Stable Diffusion: Image Generation and ControlNet Guide

Stable Diffusion is an open-source latent diffusion model that generates photorealistic images from text prompts or transforms existing images — all locally on consumer hardware. The HuggingFace diffusers library provides a clean Python API for text-to-image generation, image-to-image transformation, inpainting, and ControlNet-guided generation. In 2026, SDXL and SD 3.5 have raised the quality bar while SDXL-Turbo enables real-time generation in 1–4 steps.

This guide covers the complete diffusers pipeline: text-to-image with prompt engineering, img2img transformation, inpainting for targeted edits, ControlNet for spatial conditioning (pose, depth, canny edges), loading community LoRA models, and efficient batch generation strategies.

Installation and Setup
Text-to-Image Generation
Image-to-Image Transformation
Inpainting and Outpainting
ControlNet Guided Generation
Loading LoRA Models
Batch Generation and Optimisation

Installation and Setup

The diffusers library supports CUDA GPUs, Apple Silicon MPS, and CPU inference. A GPU with at least 6 GB VRAM is recommended for SDXL; SD 1.5 runs on 4 GB. For CPU inference, expect several minutes per image. Install xformers for memory-efficient attention on NVIDIA GPUs.

pip install diffusers transformers accelerate
pip install xformers  # Optional: memory-efficient attention (NVIDIA only)
pip install Pillow opencv-python  # Image utilities

# Verify CUDA availability
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Memory tip: Enable enable_model_cpu_offload() to run SDXL on 8 GB VRAM by automatically offloading model components to CPU when not in use. This roughly doubles generation time but halves peak VRAM usage.

Text-to-Image Generation

The StableDiffusionXLPipeline takes a text prompt and optional negative prompt, then iteratively denoises random noise into a coherent image. Key parameters — guidance scale, number of steps, and seed — control the tradeoff between prompt adherence, quality, and diversity. A well-crafted negative prompt is as important as the positive one.

from diffusers import StableDiffusionXLPipeline
import torch

# Load SDXL pipeline — downloads ~7 GB on first run
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)
pipe = pipe.to("cuda")
pipe.enable_xformers_memory_efficient_attention()  # NVIDIA only

# Basic generation
prompt = (
    "a futuristic city skyline at golden hour, cyberpunk aesthetic, "
    "neon lights reflecting in rain-soaked streets, ultra realistic, "
    "8k resolution, cinematic lighting, shot on RED camera"
)
negative_prompt = (
    "blurry, low quality, watermark, text, ugly, deformed, "
    "bad anatomy, extra limbs, oversaturated"
)

generator = torch.Generator("cuda").manual_seed(42)  # Reproducible output

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=30,
    guidance_scale=7.5,
    width=1024,
    height=1024,
    generator=generator,
).images[0]

image.save("city_skyline.png")

# Generate multiple images at once
images = pipe(
    prompt=[prompt] * 4,
    negative_prompt=[negative_prompt] * 4,
    num_inference_steps=25,
    guidance_scale=7.0,
    generator=[torch.Generator("cuda").manual_seed(i) for i in range(4)],
).images

for i, img in enumerate(images):
    img.save(f"city_{i}.png")

Image-to-Image Transformation

Img2img takes an existing image and a prompt, then adds controlled noise and denoises it guided by the new prompt. The strength parameter (0.0–1.0) controls how much the original image influences the result: 0.3 = subtle change, 0.8 = drastic transformation. This is ideal for style transfer, photo enhancement, and concept iteration.

from diffusers import StableDiffusionXLImg2ImgPipeline
from PIL import Image
import torch, requests
from io import BytesIO

pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
).to("cuda")

# Load source image
init_image = Image.open("photo.jpg").convert("RGB").resize((1024, 1024))

prompt = "oil painting style, impressionist brushstrokes, vibrant colours, museum quality art"
negative_prompt = "photorealistic, photography, low quality"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    image=init_image,
    strength=0.6,      # 0.6 = moderate transformation
    guidance_scale=8.0,
    num_inference_steps=50,
    generator=torch.Generator("cuda").manual_seed(99),
).images[0]

image.save("oil_painting.png")

Inpainting and Outpainting

Inpainting replaces a masked region of an image with AI-generated content matching the surrounding context. This is essential for removing objects, replacing backgrounds, or adding elements to existing photos. The pipeline requires the original image and a binary mask (white = areas to regenerate, black = preserve).

from diffusers import StableDiffusionInpaintPipeline
from PIL import Image, ImageDraw
import torch, numpy as np

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16,
).to("cuda")

# Load image and create mask
image = Image.open("street_photo.jpg").convert("RGB").resize((512, 512))

# Draw mask — white area will be regenerated
mask = Image.new("RGB", (512, 512), "black")
draw = ImageDraw.Draw(mask)
draw.rectangle([150, 100, 350, 400], fill="white")  # Mask car in image

prompt = "a park bench surrounded by flowers, natural lighting, photorealistic"

result = pipe(
    prompt=prompt,
    image=image,
    mask_image=mask,
    num_inference_steps=50,
    guidance_scale=7.5,
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

result.save("inpainted.png")

# Composite result back with original for crisp borders
mask_np = np.array(mask.convert("L")) > 127
original_np = np.array(image)
result_np = np.array(result)
composite = np.where(mask_np[:, :, None], result_np, original_np)
Image.fromarray(composite).save("composite.png")

ControlNet Guided Generation

ControlNet adds spatial conditioning to diffusion models — you provide a control image (pose skeleton, depth map, Canny edges, or segmentation map) and the model generates an image that follows that structure while matching the text prompt. This solves the core limitation of text-only generation: lack of precise spatial control.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
from PIL import Image
import torch, cv2, numpy as np

# Load ControlNet conditioned on Canny edges
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16,
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
pipe.enable_xformers_memory_efficient_attention()

# Extract Canny edges from source image
source = cv2.imread("building.jpg")
source_gray = cv2.cvtColor(source, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(source_gray, threshold1=100, threshold2=200)
control_image = Image.fromarray(edges)

# Generate new image following the edge structure
prompt = "a medieval castle, dramatic lighting, fantasy art, hyperdetailed, 8k"
negative_prompt = "blurry, low quality, modern, cartoon"

images = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    image=control_image,
    num_inference_steps=30,
    guidance_scale=7.5,
    controlnet_conditioning_scale=0.8,  # How strongly to follow the control
    generator=torch.Generator("cuda").manual_seed(42),
).images

images[0].save("castle_from_building_edges.png")

ControlNet types: Use sd-controlnet-openpose for human pose control, sd-controlnet-depth for 3D depth conditioning, sd-controlnet-seg for semantic segmentation, and sd-controlnet-scribble for sketch-to-image workflows.

Loading LoRA Models

Community LoRA models from CivitAI and HuggingFace Hub add new styles, characters, or concepts to base models without full retraining. Load multiple LoRAs simultaneously and control each one's influence via a weight. LoRAs are tiny (typically 50–150 MB) and can be swapped at runtime for different aesthetics.

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

# Load LoRA from HuggingFace Hub
pipe.load_lora_weights(
    "CiroN2022/toy-face",
    weight_name="toy_face_sdxl.safetensors",
    adapter_name="toy_face",
)

# Load a second LoRA
pipe.load_lora_weights(
    "nerijs/pixel-art-xl",
    weight_name="pixel-art-xl.safetensors",
    adapter_name="pixel_art",
)

# Blend both LoRAs
pipe.set_adapters(["toy_face", "pixel_art"], adapter_weights=[0.7, 0.3])

prompt = "a cute robot character, toy face, pixel art style, colourful"
image = pipe(
    prompt=prompt,
    num_inference_steps=30,
    guidance_scale=7.0,
    cross_attention_kwargs={"scale": 1.0},
).images[0]
image.save("lora_blend.png")

# Unload LoRAs (restore base model)
pipe.unload_lora_weights()

Batch Generation and Optimisation

For high-throughput image generation — product photography, dataset augmentation, or creative pipelines — optimise for batch throughput. Use attention slicing and model CPU offload on lower-end hardware, and enable sequential CPU offload for extreme VRAM savings. SDXL-Turbo reduces steps from 30 to 4 for near-instant generation.

from diffusers import AutoPipelineForText2Image
import torch, time
from pathlib import Path

# SDXL-Turbo: 4-step generation
pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/sdxl-turbo",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")

prompts = [
    "a red sports car on a mountain road, sunset",
    "a cosy library interior, warm lighting, books everywhere",
    "a neon-lit Tokyo street at night, rainy, reflections",
    "a medieval marketplace with merchants and merchants' stalls",
]

start = time.time()
images = pipe(
    prompt=prompts,
    num_inference_steps=4,  # Turbo only needs 4 steps
    guidance_scale=0.0,     # Turbo uses CFG=0
    generator=[torch.Generator("cuda").manual_seed(i) for i in range(len(prompts))],
).images
elapsed = time.time() - start
print(f"Generated {len(images)} images in {elapsed:.1f}s ({elapsed/len(images):.2f}s each)")

# Save all
Path("output").mkdir(exist_ok=True)
for i, img in enumerate(images):
    img.save(f"output/batch_{i:03d}.png")

# Memory optimisation for low VRAM (< 6 GB)
# pipe.enable_attention_slicing()
# pipe.enable_model_cpu_offload()

Performance: SDXL-Turbo generates 1024×1024 images in approximately 1.5 seconds on an RTX 3080. Standard SDXL at 30 steps takes 12–18 seconds. Use Turbo for iteration speed and full SDXL for final quality outputs.