LLM Fine-Tuning: LoRA, QLoRA and PEFT Techniques

Fine-tuning a large language model adapts a pretrained base model to a specific task, style, or domain — achieving dramatically better performance than prompting alone for specialized use cases. The challenge has always been cost: full fine-tuning a 7B-parameter model requires 80GB of GPU VRAM and days of compute. Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA changed everything, making fine-tuning accessible on a single consumer GPU.

This guide explains how LoRA, QLoRA, and related PEFT techniques work at a conceptual level, then walks through a complete supervised fine-tuning pipeline using Hugging Face's transformers, peft, and trl libraries.

Table of Contents

Why Fine-Tune vs Prompt Engineering?

Prompt engineering is always the first thing to try — it's fast, cheap, and reversible. Fine-tuning makes sense when you have a specific task the model consistently gets wrong through prompting, when you need to reduce token costs by shrinking prompts, when you want a smaller model to match a larger model's quality on your domain, or when you need to teach the model proprietary knowledge that wasn't in its training data.

Fine-tuning is NOT needed for: general Q&A, summarization, classification (few-shot prompting usually works), one-off tasks, or cases where RAG can supply the missing knowledge. Over-engineering with fine-tuning when prompting suffices is a common and expensive mistake.

Rule of thumb: If you can't get the model to perform adequately with 10 well-crafted few-shot examples, fine-tuning is worth exploring. If you can, stick with prompting.

LoRA Explained

Low-Rank Adaptation (LoRA) works by freezing all original model weights and injecting small trainable rank-decomposition matrices into the attention layers. Instead of updating a weight matrix W (which might be 4096×4096 = 16M parameters), LoRA learns two small matrices A (4096×r) and B (r×4096) where r is the rank (typically 8–64). The effective weight update is W + BA.

With rank r=16 on a 7B model, LoRA trains only ~0.1% of parameters — yet achieves 90–95% of full fine-tuning quality. Training is 3× faster and uses 70% less VRAM. The LoRA adapters (a few hundred MB) are stored separately from the base model and can be swapped at inference time.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "meta-llama/Meta-Llama-3-8B"

# Load base model in float16 for efficiency
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    r=16,                          # Rank — higher = more capacity, more params
    lora_alpha=32,                 # Scaling factor (alpha/r = effective LR multiplier)
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Wrap model with LoRA
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2604

QLoRA: Quantization + LoRA

QLoRA (Quantized LoRA) combines 4-bit NF4 quantization of the base model with LoRA adapters trained in bfloat16. The result: a 7B model that required 14GB in float16 now fits in ~5GB of VRAM. QLoRA makes fine-tuning possible on a single 24GB RTX 3090/4090 — or even on a 16GB GPU with careful batch sizing.

The key ingredient is bitsandbytes for quantization and double quantization (quantizing the quantization constants) for extra memory savings. Quality loss vs full fine-tuning is typically 1–3% on standard benchmarks — negligible for most domain-specific tasks.

from transformers import BitsAndBytesConfig
import torch

# QLoRA: load base model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4 — best for LLM weights
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,     # Saves extra 0.4 bits/param
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for k-bit training (required for gradient checkpointing with quantized models)
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

# Apply LoRA on top of quantized model
peft_model = get_peft_model(model, lora_config)
print(f"VRAM usage: ~5GB for a 7B model with QLoRA")

Supervised Fine-Tuning with TRL

The Hugging Face trl library's SFTTrainer is the standard tool for supervised fine-tuning in 2026. It handles dataset formatting, packing short examples to fill context windows (for efficiency), gradient accumulation, and W&B/TensorBoard logging automatically.

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# Load instruction-following dataset
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")

# Format as chat template
def format_example(example):
    return {"text": f"### Human: {example['text'].split('### Human:')[1].split('### Assistant:')[0].strip()}\n\n### Assistant: {example['text'].split('### Assistant:')[1].strip()}"}

# Training configuration
training_args = SFTConfig(
    output_dir="./llama3-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,   # Effective batch size = 8
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    fp16=False,
    bf16=True,                        # Use bfloat16 on Ampere+ GPUs
    logging_steps=10,
    save_steps=500,
    max_seq_length=2048,
    dataset_text_field="text",
    packing=True,                     # Pack short sequences for efficiency
    report_to="wandb"
)

trainer = SFTTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=lora_config,
)

trainer.train()
Note: Set packing=True for short instruction datasets — it can improve GPU utilisation from 40% to 90%+ by filling each training batch completely, cutting training time by up to 2×.

Merging Adapters and Exporting

After training, LoRA adapters are separate from the base model. For inference, you can either load them dynamically (adds ~100ms latency) or merge them permanently into the base model weights (zero latency overhead). For production, always merge.

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model in float16 (not quantized — needed for merging)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="cpu"    # Load to CPU for merging to avoid OOM
)

# Load fine-tuned LoRA adapters
peft_model = PeftModel.from_pretrained(base_model, "./llama3-finetuned/checkpoint-500")

# Merge adapters into base model weights
merged_model = peft_model.merge_and_unload()

# Save merged model — ready for serving
merged_model.save_pretrained("./llama3-merged")
tokenizer.save_pretrained("./llama3-merged")
print("Merged model saved — no LoRA overhead at inference time")

Evaluating Your Fine-Tuned Model

Never ship a fine-tuned model without evaluation. At minimum, check perplexity on a held-out test set, run domain-specific benchmarks, and compare outputs against the base model and a GPT-4-class model on 50–100 representative test prompts. For chat/instruction models, use LLM-as-judge evaluation with Claude or GPT-4 as the evaluator.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load merged fine-tuned model
model = AutoModelForCausalLM.from_pretrained("./llama3-merged", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("./llama3-merged")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,
                max_new_tokens=256, temperature=0.1, do_sample=True)

# Test prompts
test_prompts = [
    "Explain what LoRA is in simple terms.",
    "Write a Python function to chunk text for RAG.",
]

for prompt in test_prompts:
    result = pipe(f"### Human: {prompt}\n\n### Assistant:")
    generated = result[0]["generated_text"].split("### Assistant:")[1].strip()
    print(f"Q: {prompt}\nA: {generated[:200]}\n---")

Fine-Tuning Best Practices

Data quality beats data quantity. 1,000 high-quality, diverse instruction pairs consistently outperform 100,000 noisy examples. Spend 80% of your effort on data curation.

Start with the smallest rank that works. Try r=8 first. If validation loss plateaus early, increase to r=16 or r=32. Higher ranks aren't always better and cost more VRAM.

Target all attention projections. Including q, k, v, and o projections (and sometimes the MLP layers with target_modules="all-linear") consistently outperforms targeting only q and v.

Use a cosine LR schedule with warmup. A learning rate of 1e-4 to 3e-4 with 5% warmup and cosine decay is the most reliable starting configuration for SFT.

Monitor for catastrophic forgetting. Fine-tuning can degrade general capabilities. Evaluate on a general benchmark (MMLU, ARC) alongside your domain benchmark after training to check how much general capability was lost.

Use gradient checkpointing. Set use_gradient_checkpointing=True to trade compute for memory — allows 2–3× larger effective batch sizes at the cost of ~20% slower training.