Stable Diffusion is an open-source latent diffusion model that generates photorealistic images from text prompts or transforms existing images — all locally on consumer hardware. The HuggingFace diffusers library provides a clean Python API for text-to-image generation, image-to-image transformation, inpainting, and ControlNet-guided generation. In 2026, SDXL and SD 3.5 have raised the quality bar while SDXL-Turbo enables real-time generation in 1–4 steps.
This guide covers the complete diffusers pipeline: text-to-image with prompt engineering, img2img transformation, inpainting for targeted edits, ControlNet for spatial conditioning (pose, depth, canny edges), loading community LoRA models, and efficient batch generation strategies.
The diffusers library supports CUDA GPUs, Apple Silicon MPS, and CPU inference. A GPU with at least 6 GB VRAM is recommended for SDXL; SD 1.5 runs on 4 GB. For CPU inference, expect several minutes per image. Install xformers for memory-efficient attention on NVIDIA GPUs.
pip install diffusers transformers accelerate
pip install xformers # Optional: memory-efficient attention (NVIDIA only)
pip install Pillow opencv-python # Image utilities
# Verify CUDA availability
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
enable_model_cpu_offload() to run SDXL on 8 GB VRAM by automatically offloading model components to CPU when not in use. This roughly doubles generation time but halves peak VRAM usage.
The StableDiffusionXLPipeline takes a text prompt and optional negative prompt, then iteratively denoises random noise into a coherent image. Key parameters — guidance scale, number of steps, and seed — control the tradeoff between prompt adherence, quality, and diversity. A well-crafted negative prompt is as important as the positive one.
from diffusers import StableDiffusionXLPipeline
import torch
# Load SDXL pipeline — downloads ~7 GB on first run
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16",
)
pipe = pipe.to("cuda")
pipe.enable_xformers_memory_efficient_attention() # NVIDIA only
# Basic generation
prompt = (
"a futuristic city skyline at golden hour, cyberpunk aesthetic, "
"neon lights reflecting in rain-soaked streets, ultra realistic, "
"8k resolution, cinematic lighting, shot on RED camera"
)
negative_prompt = (
"blurry, low quality, watermark, text, ugly, deformed, "
"bad anatomy, extra limbs, oversaturated"
)
generator = torch.Generator("cuda").manual_seed(42) # Reproducible output
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=30,
guidance_scale=7.5,
width=1024,
height=1024,
generator=generator,
).images[0]
image.save("city_skyline.png")
# Generate multiple images at once
images = pipe(
prompt=[prompt] * 4,
negative_prompt=[negative_prompt] * 4,
num_inference_steps=25,
guidance_scale=7.0,
generator=[torch.Generator("cuda").manual_seed(i) for i in range(4)],
).images
for i, img in enumerate(images):
img.save(f"city_{i}.png")
Img2img takes an existing image and a prompt, then adds controlled noise and denoises it guided by the new prompt. The strength parameter (0.0–1.0) controls how much the original image influences the result: 0.3 = subtle change, 0.8 = drastic transformation. This is ideal for style transfer, photo enhancement, and concept iteration.
from diffusers import StableDiffusionXLImg2ImgPipeline
from PIL import Image
import torch, requests
from io import BytesIO
pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16",
).to("cuda")
# Load source image
init_image = Image.open("photo.jpg").convert("RGB").resize((1024, 1024))
prompt = "oil painting style, impressionist brushstrokes, vibrant colours, museum quality art"
negative_prompt = "photorealistic, photography, low quality"
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
image=init_image,
strength=0.6, # 0.6 = moderate transformation
guidance_scale=8.0,
num_inference_steps=50,
generator=torch.Generator("cuda").manual_seed(99),
).images[0]
image.save("oil_painting.png")
Inpainting replaces a masked region of an image with AI-generated content matching the surrounding context. This is essential for removing objects, replacing backgrounds, or adding elements to existing photos. The pipeline requires the original image and a binary mask (white = areas to regenerate, black = preserve).
from diffusers import StableDiffusionInpaintPipeline
from PIL import Image, ImageDraw
import torch, numpy as np
pipe = StableDiffusionInpaintPipeline.from_pretrained(
"runwayml/stable-diffusion-inpainting",
torch_dtype=torch.float16,
).to("cuda")
# Load image and create mask
image = Image.open("street_photo.jpg").convert("RGB").resize((512, 512))
# Draw mask — white area will be regenerated
mask = Image.new("RGB", (512, 512), "black")
draw = ImageDraw.Draw(mask)
draw.rectangle([150, 100, 350, 400], fill="white") # Mask car in image
prompt = "a park bench surrounded by flowers, natural lighting, photorealistic"
result = pipe(
prompt=prompt,
image=image,
mask_image=mask,
num_inference_steps=50,
guidance_scale=7.5,
generator=torch.Generator("cuda").manual_seed(0),
).images[0]
result.save("inpainted.png")
# Composite result back with original for crisp borders
mask_np = np.array(mask.convert("L")) > 127
original_np = np.array(image)
result_np = np.array(result)
composite = np.where(mask_np[:, :, None], result_np, original_np)
Image.fromarray(composite).save("composite.png")
ControlNet adds spatial conditioning to diffusion models — you provide a control image (pose skeleton, depth map, Canny edges, or segmentation map) and the model generates an image that follows that structure while matching the text prompt. This solves the core limitation of text-only generation: lack of precise spatial control.
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
from PIL import Image
import torch, cv2, numpy as np
# Load ControlNet conditioned on Canny edges
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny",
torch_dtype=torch.float16,
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
pipe.enable_xformers_memory_efficient_attention()
# Extract Canny edges from source image
source = cv2.imread("building.jpg")
source_gray = cv2.cvtColor(source, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(source_gray, threshold1=100, threshold2=200)
control_image = Image.fromarray(edges)
# Generate new image following the edge structure
prompt = "a medieval castle, dramatic lighting, fantasy art, hyperdetailed, 8k"
negative_prompt = "blurry, low quality, modern, cartoon"
images = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
image=control_image,
num_inference_steps=30,
guidance_scale=7.5,
controlnet_conditioning_scale=0.8, # How strongly to follow the control
generator=torch.Generator("cuda").manual_seed(42),
).images
images[0].save("castle_from_building_edges.png")
sd-controlnet-openpose for human pose control, sd-controlnet-depth for 3D depth conditioning, sd-controlnet-seg for semantic segmentation, and sd-controlnet-scribble for sketch-to-image workflows.
Community LoRA models from CivitAI and HuggingFace Hub add new styles, characters, or concepts to base models without full retraining. Load multiple LoRAs simultaneously and control each one's influence via a weight. LoRAs are tiny (typically 50–150 MB) and can be swapped at runtime for different aesthetics.
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
).to("cuda")
# Load LoRA from HuggingFace Hub
pipe.load_lora_weights(
"CiroN2022/toy-face",
weight_name="toy_face_sdxl.safetensors",
adapter_name="toy_face",
)
# Load a second LoRA
pipe.load_lora_weights(
"nerijs/pixel-art-xl",
weight_name="pixel-art-xl.safetensors",
adapter_name="pixel_art",
)
# Blend both LoRAs
pipe.set_adapters(["toy_face", "pixel_art"], adapter_weights=[0.7, 0.3])
prompt = "a cute robot character, toy face, pixel art style, colourful"
image = pipe(
prompt=prompt,
num_inference_steps=30,
guidance_scale=7.0,
cross_attention_kwargs={"scale": 1.0},
).images[0]
image.save("lora_blend.png")
# Unload LoRAs (restore base model)
pipe.unload_lora_weights()
For high-throughput image generation — product photography, dataset augmentation, or creative pipelines — optimise for batch throughput. Use attention slicing and model CPU offload on lower-end hardware, and enable sequential CPU offload for extreme VRAM savings. SDXL-Turbo reduces steps from 30 to 4 for near-instant generation.
from diffusers import AutoPipelineForText2Image
import torch, time
from pathlib import Path
# SDXL-Turbo: 4-step generation
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/sdxl-turbo",
torch_dtype=torch.float16,
variant="fp16",
).to("cuda")
prompts = [
"a red sports car on a mountain road, sunset",
"a cosy library interior, warm lighting, books everywhere",
"a neon-lit Tokyo street at night, rainy, reflections",
"a medieval marketplace with merchants and merchants' stalls",
]
start = time.time()
images = pipe(
prompt=prompts,
num_inference_steps=4, # Turbo only needs 4 steps
guidance_scale=0.0, # Turbo uses CFG=0
generator=[torch.Generator("cuda").manual_seed(i) for i in range(len(prompts))],
).images
elapsed = time.time() - start
print(f"Generated {len(images)} images in {elapsed:.1f}s ({elapsed/len(images):.2f}s each)")
# Save all
Path("output").mkdir(exist_ok=True)
for i, img in enumerate(images):
img.save(f"output/batch_{i:03d}.png")
# Memory optimisation for low VRAM (< 6 GB)
# pipe.enable_attention_slicing()
# pipe.enable_model_cpu_offload()