HuggingFace Transformers: Fine-Tuning and Inference Guide

HuggingFace Transformers is the de facto library for working with pre-trained language models in 2026, providing a unified API for thousands of models — BERT, GPT-2, LLaMA, Mistral, Falcon, and beyond. Whether you want zero-shot inference via a pipeline or full parameter-efficient fine-tuning with LoRA adapters, the Transformers ecosystem has you covered. This guide walks through the complete workflow: loading models, running inference, fine-tuning on custom data, and serving predictions at scale.

You will learn to use the pipeline abstraction for quick tasks, fine-tune BERT for text classification, apply LoRA with PEFT for resource-efficient training of large models, and serve models via the HuggingFace Inference API or a self-hosted FastAPI endpoint.

Installation and Setup
Inference Pipelines
Tokenizers and Preprocessing
Fine-Tuning with Trainer API
Parameter-Efficient Fine-Tuning with LoRA
Custom Training Loop
Model Serving and Deployment

Installation and Setup

Install the core Transformers library along with the dataset tooling and PEFT for LoRA. If you have a GPU, install the appropriate CUDA-enabled version of PyTorch first — training speed differs by orders of magnitude between CPU and GPU.

# Core stack
pip install transformers datasets accelerate evaluate
pip install peft bitsandbytes  # For LoRA and 4-bit quantisation

# GPU PyTorch (CUDA 12.1)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# Verify setup
python -c "import transformers; print(transformers.__version__)"
python -c "import torch; print(torch.cuda.is_available())"

Note: bitsandbytes enables 4-bit and 8-bit quantisation, dramatically reducing GPU memory requirements when fine-tuning large models like LLaMA-2-7B on a consumer GPU.

Inference Pipelines

The pipeline function is the fastest path from zero to working predictions. It handles tokenisation, model loading, and post-processing in one call. Pipelines support dozens of tasks: text classification, named entity recognition, question answering, text generation, translation, summarisation, zero-shot classification, and image classification.

from transformers import pipeline

# Sentiment analysis — downloads model automatically on first run
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
results = classifier(["HuggingFace makes NLP easy!", "This is terrible."])
# [{'label': 'POSITIVE', 'score': 0.9998}, {'label': 'NEGATIVE', 'score': 0.9994}]

# Named Entity Recognition
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
entities = ner("Elon Musk founded SpaceX in Hawthorne, California.")
for e in entities:
    print(f"{e['word']:20s} {e['entity_group']:10s} {e['score']:.3f}")

# Zero-shot classification — no fine-tuning needed
zero_shot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = zero_shot(
    "The new iPhone has a revolutionary camera system",
    candidate_labels=["technology", "sports", "politics", "entertainment"]
)
print(result["labels"][0], result["scores"][0])  # technology 0.987

# Text generation
generator = pipeline("text-generation", model="gpt2", max_new_tokens=100)
output = generator("The future of AI in medicine is", do_sample=True, temperature=0.7)
print(output[0]["generated_text"])

# Question answering
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
context = """
HuggingFace was founded in 2016 and is headquartered in New York City.
It provides the Transformers library, the Hub for sharing models,
and the Datasets library for NLP benchmarks.
"""
answer = qa(question="Where is HuggingFace headquartered?", context=context)
print(answer)  # {'score': 0.993, 'start': 55, 'end': 69, 'answer': 'New York City'}

Tokenizers and Preprocessing

Understanding tokenisation is essential for controlling model input. The AutoTokenizer handles subword tokenisation (BPE, WordPiece, SentencePiece) and returns tensors ready for model consumption. Batching, padding, and truncation are key parameters that affect both correctness and performance.

from transformers import AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Single sentence
encoding = tokenizer("Hello, HuggingFace!", return_tensors="pt")
print(encoding.input_ids)      # tensor([[101, 7592, 1010, ...]])
print(encoding.attention_mask)  # tensor([[1, 1, 1, ...]])

# Batch with padding and truncation
texts = [
    "Short sentence.",
    "This is a much longer sentence that needs padding or truncation to fit the model input window.",
]
batch = tokenizer(
    texts,
    padding=True,       # Pad shorter sequences
    truncation=True,    # Truncate to max_length
    max_length=128,
    return_tensors="pt"
)
print(batch.input_ids.shape)  # torch.Size([2, 128])

# Decode tokens back to text (useful for debugging)
decoded = tokenizer.decode(batch.input_ids[0], skip_special_tokens=True)
print(decoded)

# Special tokens info
print(tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token)
# [CLS] [SEP] [PAD]

# Word-to-token alignment
word_ids = tokenizer("New York is a city", return_offsets_mapping=True)
print(word_ids.offset_mapping)  # [(0,0),(0,3),(4,8),...]

Fine-Tuning with Trainer API

The Trainer API abstracts the training loop, gradient accumulation, mixed-precision, distributed training, and evaluation. You provide a model, dataset, and TrainingArguments — the Trainer handles the rest. This example fine-tunes bert-base-uncased for binary text classification on a custom dataset.

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)
from datasets import Dataset
import evaluate
import numpy as np

# Prepare dataset
texts = ["Great product!", "Terrible experience.", "Works perfectly.", "Never buying again."]
labels = [1, 0, 1, 0]
raw_dataset = Dataset.from_dict({"text": texts, "label": labels})
raw_dataset = raw_dataset.train_test_split(test_size=0.25, seed=42)

MODEL_NAME = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, max_length=128)

tokenized = raw_dataset.map(tokenize, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Load model with classification head
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

# Metrics
accuracy_metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=preds, references=labels)

# Training arguments
args = TrainingArguments(
    output_dir="./bert-sentiment",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,   # Mixed precision — requires GPU
    logging_steps=10,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model("./bert-sentiment-final")
results = trainer.evaluate()
print(results)  # {'eval_accuracy': 1.0, ...}

Tip: Set fp16=True for NVIDIA GPUs to halve memory usage and double throughput. For Apple Silicon, use use_mps_device=True instead.

Parameter-Efficient Fine-Tuning with LoRA

Full fine-tuning of models like LLaMA-2-7B (7 billion parameters) requires 28+ GB of GPU VRAM — out of reach for most practitioners. LoRA (Low-Rank Adaptation) inserts small trainable adapter matrices into the attention layers while keeping the base model frozen. This reduces trainable parameters by 99%+ while achieving near-identical downstream performance.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import torch

# 4-bit quantisation config — loads 7B model in ~4 GB VRAM
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,                          # Rank of adapter matrices
    lora_alpha=32,                 # Scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

# Dataset — instruction format
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train[:1000]")

training_args = TrainingArguments(
    output_dir="./llama2-lora",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=25,
    optim="paged_adamw_8bit",
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    dataset_text_field="text",
    max_seq_length=512,
    tokenizer=tokenizer,
)
trainer.train()
model.save_pretrained("./llama2-lora-adapter")

Custom Training Loop

When the Trainer API doesn't offer enough control — custom optimiser schedules, multi-task losses, or curriculum learning — write a custom training loop with PyTorch. The Accelerate library handles device placement, mixed precision, and distributed setup with minimal boilerplate.

from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup
from torch.utils.data import DataLoader, TensorDataset
from accelerate import Accelerator
import torch

accelerator = Accelerator(mixed_precision="fp16")

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Dummy dataset
texts = ["I love this!", "I hate this!"] * 50
labels = [1, 0] * 50
enc = tokenizer(texts, padding=True, truncation=True, max_length=64, return_tensors="pt")
dataset = TensorDataset(enc.input_ids, enc.attention_mask, torch.tensor(labels))
loader = DataLoader(dataset, batch_size=8, shuffle=True)

optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=10, num_training_steps=len(loader)*3)

model, optimizer, loader, scheduler = accelerator.prepare(model, optimizer, loader, scheduler)

model.train()
for epoch in range(3):
    total_loss = 0
    for input_ids, attention_mask, label_ids in loader:
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=label_ids)
        loss = outputs.loss
        accelerator.backward(loss)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} | Loss: {total_loss/len(loader):.4f}")

Model Serving and Deployment

After training, serve your model via FastAPI for low-latency inference. Cache the model in memory at startup, use batching for throughput, and implement health checks. For production, wrap in Docker and deploy behind a load balancer. The HuggingFace Inference API is an alternative for hosted serving without infrastructure management.

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
import torch

app = FastAPI(title="HuggingFace Sentiment API")

# Load once at startup — not on every request
@app.on_event("startup")
async def load_model():
    device = 0 if torch.cuda.is_available() else -1
    app.state.classifier = pipeline(
        "sentiment-analysis",
        model="./bert-sentiment-final",
        device=device,
    )

class PredictRequest(BaseModel):
    texts: list[str]

class PredictResponse(BaseModel):
    results: list[dict]

@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest):
    results = app.state.classifier(req.texts, batch_size=32)
    return PredictResponse(results=results)

@app.get("/health")
async def health():
    return {"status": "ok"}

# Run: uvicorn app:app --host 0.0.0.0 --port 8000

Production tip: Use torch.compile(model) (PyTorch 2.x) to JIT-compile the model graph for 20–30% speedup on repeated inference calls with fixed sequence lengths.