Fine-tuning a large language model adapts a pretrained base model to a specific task, style, or domain — achieving dramatically better performance than prompting alone for specialized use cases. The challenge has always been cost: full fine-tuning a 7B-parameter model requires 80GB of GPU VRAM and days of compute. Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA changed everything, making fine-tuning accessible on a single consumer GPU.
This guide explains how LoRA, QLoRA, and related PEFT techniques work at a conceptual level, then walks through a complete supervised fine-tuning pipeline using Hugging Face's transformers, peft, and trl libraries.
Prompt engineering is always the first thing to try — it's fast, cheap, and reversible. Fine-tuning makes sense when you have a specific task the model consistently gets wrong through prompting, when you need to reduce token costs by shrinking prompts, when you want a smaller model to match a larger model's quality on your domain, or when you need to teach the model proprietary knowledge that wasn't in its training data.
Fine-tuning is NOT needed for: general Q&A, summarization, classification (few-shot prompting usually works), one-off tasks, or cases where RAG can supply the missing knowledge. Over-engineering with fine-tuning when prompting suffices is a common and expensive mistake.
Low-Rank Adaptation (LoRA) works by freezing all original model weights and injecting small trainable rank-decomposition matrices into the attention layers. Instead of updating a weight matrix W (which might be 4096×4096 = 16M parameters), LoRA learns two small matrices A (4096×r) and B (r×4096) where r is the rank (typically 8–64). The effective weight update is W + BA.
With rank r=16 on a 7B model, LoRA trains only ~0.1% of parameters — yet achieves 90–95% of full fine-tuning quality. Training is 3× faster and uses 70% less VRAM. The LoRA adapters (a few hundred MB) are stored separately from the base model and can be swapped at inference time.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Meta-Llama-3-8B"
# Load base model in float16 for efficiency
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank — higher = more capacity, more params
lora_alpha=32, # Scaling factor (alpha/r = effective LR multiplier)
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Attention layers
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
# Wrap model with LoRA
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2604
QLoRA (Quantized LoRA) combines 4-bit NF4 quantization of the base model with LoRA adapters trained in bfloat16. The result: a 7B model that required 14GB in float16 now fits in ~5GB of VRAM. QLoRA makes fine-tuning possible on a single 24GB RTX 3090/4090 — or even on a 16GB GPU with careful batch sizing.
The key ingredient is bitsandbytes for quantization and double quantization (quantizing the quantization constants) for extra memory savings. Quality loss vs full fine-tuning is typically 1–3% on standard benchmarks — negligible for most domain-specific tasks.
from transformers import BitsAndBytesConfig
import torch
# QLoRA: load base model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 — best for LLM weights
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Saves extra 0.4 bits/param
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for k-bit training (required for gradient checkpointing with quantized models)
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
# Apply LoRA on top of quantized model
peft_model = get_peft_model(model, lora_config)
print(f"VRAM usage: ~5GB for a 7B model with QLoRA")
The Hugging Face trl library's SFTTrainer is the standard tool for supervised fine-tuning in 2026. It handles dataset formatting, packing short examples to fill context windows (for efficiency), gradient accumulation, and W&B/TensorBoard logging automatically.
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Load instruction-following dataset
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
# Format as chat template
def format_example(example):
return {"text": f"### Human: {example['text'].split('### Human:')[1].split('### Assistant:')[0].strip()}\n\n### Assistant: {example['text'].split('### Assistant:')[1].strip()}"}
# Training configuration
training_args = SFTConfig(
output_dir="./llama3-finetuned",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch size = 8
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
fp16=False,
bf16=True, # Use bfloat16 on Ampere+ GPUs
logging_steps=10,
save_steps=500,
max_seq_length=2048,
dataset_text_field="text",
packing=True, # Pack short sequences for efficiency
report_to="wandb"
)
trainer = SFTTrainer(
model=peft_model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=lora_config,
)
trainer.train()
packing=True for short instruction datasets — it can improve GPU utilisation from 40% to 90%+ by filling each training batch completely, cutting training time by up to 2×.
After training, LoRA adapters are separate from the base model. For inference, you can either load them dynamically (adds ~100ms latency) or merge them permanently into the base model weights (zero latency overhead). For production, always merge.
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load base model in float16 (not quantized — needed for merging)
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="cpu" # Load to CPU for merging to avoid OOM
)
# Load fine-tuned LoRA adapters
peft_model = PeftModel.from_pretrained(base_model, "./llama3-finetuned/checkpoint-500")
# Merge adapters into base model weights
merged_model = peft_model.merge_and_unload()
# Save merged model — ready for serving
merged_model.save_pretrained("./llama3-merged")
tokenizer.save_pretrained("./llama3-merged")
print("Merged model saved — no LoRA overhead at inference time")
Never ship a fine-tuned model without evaluation. At minimum, check perplexity on a held-out test set, run domain-specific benchmarks, and compare outputs against the base model and a GPT-4-class model on 50–100 representative test prompts. For chat/instruction models, use LLM-as-judge evaluation with Claude or GPT-4 as the evaluator.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Load merged fine-tuned model
model = AutoModelForCausalLM.from_pretrained("./llama3-merged", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("./llama3-merged")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,
max_new_tokens=256, temperature=0.1, do_sample=True)
# Test prompts
test_prompts = [
"Explain what LoRA is in simple terms.",
"Write a Python function to chunk text for RAG.",
]
for prompt in test_prompts:
result = pipe(f"### Human: {prompt}\n\n### Assistant:")
generated = result[0]["generated_text"].split("### Assistant:")[1].strip()
print(f"Q: {prompt}\nA: {generated[:200]}\n---")
Data quality beats data quantity. 1,000 high-quality, diverse instruction pairs consistently outperform 100,000 noisy examples. Spend 80% of your effort on data curation.
Start with the smallest rank that works. Try r=8 first. If validation loss plateaus early, increase to r=16 or r=32. Higher ranks aren't always better and cost more VRAM.
Target all attention projections. Including q, k, v, and o projections (and sometimes the MLP layers with target_modules="all-linear") consistently outperforms targeting only q and v.
Use a cosine LR schedule with warmup. A learning rate of 1e-4 to 3e-4 with 5% warmup and cosine decay is the most reliable starting configuration for SFT.
Monitor for catastrophic forgetting. Fine-tuning can degrade general capabilities. Evaluate on a general benchmark (MMLU, ARC) alongside your domain benchmark after training to check how much general capability was lost.
Use gradient checkpointing. Set use_gradient_checkpointing=True to trade compute for memory — allows 2–3× larger effective batch sizes at the cost of ~20% slower training.