HuggingFace Transformers is the de facto library for working with pre-trained language models in 2026, providing a unified API for thousands of models — BERT, GPT-2, LLaMA, Mistral, Falcon, and beyond. Whether you want zero-shot inference via a pipeline or full parameter-efficient fine-tuning with LoRA adapters, the Transformers ecosystem has you covered. This guide walks through the complete workflow: loading models, running inference, fine-tuning on custom data, and serving predictions at scale.
You will learn to use the pipeline abstraction for quick tasks, fine-tune BERT for text classification, apply LoRA with PEFT for resource-efficient training of large models, and serve models via the HuggingFace Inference API or a self-hosted FastAPI endpoint.
Install the core Transformers library along with the dataset tooling and PEFT for LoRA. If you have a GPU, install the appropriate CUDA-enabled version of PyTorch first — training speed differs by orders of magnitude between CPU and GPU.
# Core stack
pip install transformers datasets accelerate evaluate
pip install peft bitsandbytes # For LoRA and 4-bit quantisation
# GPU PyTorch (CUDA 12.1)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# Verify setup
python -c "import transformers; print(transformers.__version__)"
python -c "import torch; print(torch.cuda.is_available())"
bitsandbytes enables 4-bit and 8-bit quantisation, dramatically reducing GPU memory requirements when fine-tuning large models like LLaMA-2-7B on a consumer GPU.
The pipeline function is the fastest path from zero to working predictions. It handles tokenisation, model loading, and post-processing in one call. Pipelines support dozens of tasks: text classification, named entity recognition, question answering, text generation, translation, summarisation, zero-shot classification, and image classification.
from transformers import pipeline
# Sentiment analysis — downloads model automatically on first run
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
results = classifier(["HuggingFace makes NLP easy!", "This is terrible."])
# [{'label': 'POSITIVE', 'score': 0.9998}, {'label': 'NEGATIVE', 'score': 0.9994}]
# Named Entity Recognition
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
entities = ner("Elon Musk founded SpaceX in Hawthorne, California.")
for e in entities:
print(f"{e['word']:20s} {e['entity_group']:10s} {e['score']:.3f}")
# Zero-shot classification — no fine-tuning needed
zero_shot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = zero_shot(
"The new iPhone has a revolutionary camera system",
candidate_labels=["technology", "sports", "politics", "entertainment"]
)
print(result["labels"][0], result["scores"][0]) # technology 0.987
# Text generation
generator = pipeline("text-generation", model="gpt2", max_new_tokens=100)
output = generator("The future of AI in medicine is", do_sample=True, temperature=0.7)
print(output[0]["generated_text"])
# Question answering
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
context = """
HuggingFace was founded in 2016 and is headquartered in New York City.
It provides the Transformers library, the Hub for sharing models,
and the Datasets library for NLP benchmarks.
"""
answer = qa(question="Where is HuggingFace headquartered?", context=context)
print(answer) # {'score': 0.993, 'start': 55, 'end': 69, 'answer': 'New York City'}
Understanding tokenisation is essential for controlling model input. The AutoTokenizer handles subword tokenisation (BPE, WordPiece, SentencePiece) and returns tensors ready for model consumption. Batching, padding, and truncation are key parameters that affect both correctness and performance.
from transformers import AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Single sentence
encoding = tokenizer("Hello, HuggingFace!", return_tensors="pt")
print(encoding.input_ids) # tensor([[101, 7592, 1010, ...]])
print(encoding.attention_mask) # tensor([[1, 1, 1, ...]])
# Batch with padding and truncation
texts = [
"Short sentence.",
"This is a much longer sentence that needs padding or truncation to fit the model input window.",
]
batch = tokenizer(
texts,
padding=True, # Pad shorter sequences
truncation=True, # Truncate to max_length
max_length=128,
return_tensors="pt"
)
print(batch.input_ids.shape) # torch.Size([2, 128])
# Decode tokens back to text (useful for debugging)
decoded = tokenizer.decode(batch.input_ids[0], skip_special_tokens=True)
print(decoded)
# Special tokens info
print(tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token)
# [CLS] [SEP] [PAD]
# Word-to-token alignment
word_ids = tokenizer("New York is a city", return_offsets_mapping=True)
print(word_ids.offset_mapping) # [(0,0),(0,3),(4,8),...]
The Trainer API abstracts the training loop, gradient accumulation, mixed-precision, distributed training, and evaluation. You provide a model, dataset, and TrainingArguments — the Trainer handles the rest. This example fine-tunes bert-base-uncased for binary text classification on a custom dataset.
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorWithPadding,
)
from datasets import Dataset
import evaluate
import numpy as np
# Prepare dataset
texts = ["Great product!", "Terrible experience.", "Works perfectly.", "Never buying again."]
labels = [1, 0, 1, 0]
raw_dataset = Dataset.from_dict({"text": texts, "label": labels})
raw_dataset = raw_dataset.train_test_split(test_size=0.25, seed=42)
MODEL_NAME = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
def tokenize(batch):
return tokenizer(batch["text"], truncation=True, max_length=128)
tokenized = raw_dataset.map(tokenize, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Load model with classification head
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
# Metrics
accuracy_metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return accuracy_metric.compute(predictions=preds, references=labels)
# Training arguments
args = TrainingArguments(
output_dir="./bert-sentiment",
num_train_epochs=3,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
learning_rate=2e-5,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
fp16=True, # Mixed precision — requires GPU
logging_steps=10,
report_to="none",
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.save_model("./bert-sentiment-final")
results = trainer.evaluate()
print(results) # {'eval_accuracy': 1.0, ...}
fp16=True for NVIDIA GPUs to halve memory usage and double throughput. For Apple Silicon, use use_mps_device=True instead.
Full fine-tuning of models like LLaMA-2-7B (7 billion parameters) requires 28+ GB of GPU VRAM — out of reach for most practitioners. LoRA (Low-Rank Adaptation) inserts small trainable adapter matrices into the attention layers while keeping the base model frozen. This reduces trainable parameters by 99%+ while achieving near-identical downstream performance.
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import torch
# 4-bit quantisation config — loads 7B model in ~4 GB VRAM
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
model = prepare_model_for_kbit_training(model)
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank of adapter matrices
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622
# Dataset — instruction format
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train[:1000]")
training_args = TrainingArguments(
output_dir="./llama2-lora",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=25,
optim="paged_adamw_8bit",
report_to="none",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
dataset_text_field="text",
max_seq_length=512,
tokenizer=tokenizer,
)
trainer.train()
model.save_pretrained("./llama2-lora-adapter")
When the Trainer API doesn't offer enough control — custom optimiser schedules, multi-task losses, or curriculum learning — write a custom training loop with PyTorch. The Accelerate library handles device placement, mixed precision, and distributed setup with minimal boilerplate.
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup
from torch.utils.data import DataLoader, TensorDataset
from accelerate import Accelerator
import torch
accelerator = Accelerator(mixed_precision="fp16")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Dummy dataset
texts = ["I love this!", "I hate this!"] * 50
labels = [1, 0] * 50
enc = tokenizer(texts, padding=True, truncation=True, max_length=64, return_tensors="pt")
dataset = TensorDataset(enc.input_ids, enc.attention_mask, torch.tensor(labels))
loader = DataLoader(dataset, batch_size=8, shuffle=True)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=10, num_training_steps=len(loader)*3)
model, optimizer, loader, scheduler = accelerator.prepare(model, optimizer, loader, scheduler)
model.train()
for epoch in range(3):
total_loss = 0
for input_ids, attention_mask, label_ids in loader:
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=label_ids)
loss = outputs.loss
accelerator.backward(loss)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
total_loss += loss.item()
print(f"Epoch {epoch+1} | Loss: {total_loss/len(loader):.4f}")
After training, serve your model via FastAPI for low-latency inference. Cache the model in memory at startup, use batching for throughput, and implement health checks. For production, wrap in Docker and deploy behind a load balancer. The HuggingFace Inference API is an alternative for hosted serving without infrastructure management.
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
import torch
app = FastAPI(title="HuggingFace Sentiment API")
# Load once at startup — not on every request
@app.on_event("startup")
async def load_model():
device = 0 if torch.cuda.is_available() else -1
app.state.classifier = pipeline(
"sentiment-analysis",
model="./bert-sentiment-final",
device=device,
)
class PredictRequest(BaseModel):
texts: list[str]
class PredictResponse(BaseModel):
results: list[dict]
@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest):
results = app.state.classifier(req.texts, batch_size=32)
return PredictResponse(results=results)
@app.get("/health")
async def health():
return {"status": "ok"}
# Run: uvicorn app:app --host 0.0.0.0 --port 8000
torch.compile(model) (PyTorch 2.x) to JIT-compile the model graph for 20–30% speedup on repeated inference calls with fixed sequence lengths.