Ollama makes running large language models locally as simple as ollama run llama3. No GPU cloud costs, no data leaving your machine, no API rate limits. In 2026, with 4-bit quantized models fitting in 8GB of RAM and Apple Silicon delivering impressive inference speeds, local LLMs have become a genuine alternative to cloud APIs for many development and privacy-sensitive use cases.
This guide covers installing Ollama, running popular models, building Python applications with the Ollama API, integrating with LangChain, and deploying Ollama in production via Docker.
Ollama runs on macOS (Apple Silicon and Intel), Linux, and Windows (via WSL2 or native installer). It bundles llama.cpp under the hood — the C++ inference engine that supports CPU, GPU, and Metal acceleration — with a clean CLI and REST API on top.
# macOS / Linux — one-line install
curl -fsSL https://ollama.com/install.sh | sh
# Windows — download installer from https://ollama.com/download
# Verify installation
ollama --version
# ollama version 0.5.x
# Start the Ollama server (runs on http://localhost:11434)
ollama serve
http://localhost:11434.
Ollama's CLI makes pulling and running models trivially easy. Models are downloaded from the Ollama library (ollama.com/library) and cached locally. Once downloaded, they start instantly with no internet required.
# Pull and run Llama 3.1 8B (4-bit quantized, ~5GB download)
ollama run llama3.1
# Run Mistral 7B
ollama run mistral
# Run Google's Gemma 2 9B
ollama run gemma2:9b
# Run a coding-focused model
ollama run codellama:13b
# Run with specific quantization level
ollama run llama3.1:8b-instruct-q8_0 # 8-bit, higher quality, ~8GB
# List downloaded models
ollama list
# Remove a model
ollama rm mistral
# Interactive chat session — type /bye to exit
ollama run llama3.1
>>> Tell me about retrieval augmented generation
Choosing the right model depends on your available RAM (or VRAM if using a GPU). The quantized models below are 4-bit (Q4_K_M) unless noted — the best quality-to-size tradeoff for most uses.
| Model | Size | RAM Needed | Best For |
|---|---|---|---|
| llama3.2:3b | 2GB | 8GB | Fast responses, basic tasks |
| llama3.1:8b | 4.7GB | 8GB | General purpose, great quality |
| mistral:7b | 4.1GB | 8GB | Instruction following, coding |
| gemma2:9b | 5.5GB | 16GB | Reasoning, multilingual |
| llama3.1:70b | 40GB | 64GB | Near-GPT-4 quality |
| codellama:13b | 7.4GB | 16GB | Code generation and completion |
| nomic-embed-text | 274MB | 4GB | Embeddings for RAG |
Ollama exposes a REST API compatible with the OpenAI API format, plus its own Python client library. Both work — use the OpenAI-compatible endpoint if you want to swap between local and cloud models with one config change.
import ollama
# Simple chat completion
response = ollama.chat(
model="llama3.1",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Explain Python decorators in 3 sentences."}
]
)
print(response["message"]["content"])
# Streaming response — tokens appear as they generate
stream = ollama.chat(
model="llama3.1",
messages=[{"role": "user", "content": "Write a haiku about programming."}],
stream=True
)
for chunk in stream:
print(chunk["message"]["content"], end="", flush=True)
# Generate embeddings locally (free, private)
embedding = ollama.embeddings(
model="nomic-embed-text",
prompt="Retrieval augmented generation improves LLM accuracy"
)
print(f"Embedding dimensions: {len(embedding['embedding'])}") # 768
# OpenAI-compatible client — swap cloud/local with one line
from openai import OpenAI
# Point to local Ollama instead of OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but ignored
)
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "What is RAG?"}],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
LangChain has first-class Ollama support via the langchain-ollama package. This lets you build chains, agents, and RAG pipelines entirely locally with zero API costs or data leaving your machine — critical for enterprise use cases with sensitive data.
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_core.messages import HumanMessage, SystemMessage
# Local LLM via LangChain
llm = ChatOllama(model="llama3.1", temperature=0.1)
messages = [
SystemMessage(content="You are a technical writer. Be concise."),
HumanMessage(content="Explain vector databases in 2 sentences.")
]
response = llm.invoke(messages)
print(response.content)
# Local embeddings — completely free and private
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector = embeddings.embed_query("What is a transformer model?")
print(f"Embedding dim: {len(vector)}") # 768
Combining Ollama (local LLM + embeddings) with Chroma (local vector store) creates a completely private, offline RAG pipeline. No API keys, no cloud costs, no data leaves your machine. This is ideal for processing confidential documents.
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
# Load a PDF document (stays on your machine)
loader = PyPDFLoader("confidential_report.pdf")
docs = loader.load()
# Chunk the document
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# Embed locally with Ollama — completely private
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./local_rag_db")
# Query with local Llama 3
llm = ChatOllama(model="llama3.1", temperature=0)
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True
)
result = qa.invoke({"query": "What are the key findings in section 3?"})
print(result["result"])
For server deployments, running Ollama in Docker with GPU passthrough gives you a clean, reproducible environment. This is useful for self-hosted internal tools, CI/CD pipelines that need LLM inference, or team-shared Ollama instances.
# CPU-only Docker run
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# With NVIDIA GPU support
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# Pull model inside container
docker exec -it ollama ollama pull llama3.1
# Docker Compose setup
cat > docker-compose.yml <<'EOF'
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped
# Uncomment for GPU:
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: 1
# capabilities: [gpu]
volumes:
ollama_data:
EOF
docker compose up -d
Ollama's Modelfile lets you create custom model variants with a fixed system prompt, different temperature settings, or even load your own GGUF fine-tuned model. It's like a Dockerfile for LLMs.
# Create a custom coding assistant
cat > Modelfile <<'EOF'
FROM llama3.1
# Set a permanent system prompt
SYSTEM """
You are an expert Python developer who writes clean, well-documented code.
Always include type hints. Never use deprecated APIs.
When asked for code, provide complete, runnable examples.
"""
# Tune generation parameters
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
EOF
# Build and run the custom model
ollama create python-expert -f Modelfile
ollama run python-expert
>>> Write a FastAPI endpoint that validates email addresses