Ollama: Run Local LLMs on Your Machine Guide

Ollama makes running large language models locally as simple as ollama run llama3. No GPU cloud costs, no data leaving your machine, no API rate limits. In 2026, with 4-bit quantized models fitting in 8GB of RAM and Apple Silicon delivering impressive inference speeds, local LLMs have become a genuine alternative to cloud APIs for many development and privacy-sensitive use cases.

This guide covers installing Ollama, running popular models, building Python applications with the Ollama API, integrating with LangChain, and deploying Ollama in production via Docker.

Installing Ollama
Running Models
Popular Models and Hardware Requirements
Python API Integration
LangChain + Ollama
Building a Local RAG Pipeline
Running Ollama with Docker
Custom Models with Modelfile

Installing Ollama

Ollama runs on macOS (Apple Silicon and Intel), Linux, and Windows (via WSL2 or native installer). It bundles llama.cpp under the hood — the C++ inference engine that supports CPU, GPU, and Metal acceleration — with a clean CLI and REST API on top.

# macOS / Linux — one-line install
curl -fsSL https://ollama.com/install.sh | sh

# Windows — download installer from https://ollama.com/download

# Verify installation
ollama --version
# ollama version 0.5.x

# Start the Ollama server (runs on http://localhost:11434)
ollama serve

Note: On macOS, Ollama runs as a menu bar app and starts automatically. On Linux, it runs as a systemd service after installation. The REST API is always available at http://localhost:11434.

Running Models

Ollama's CLI makes pulling and running models trivially easy. Models are downloaded from the Ollama library (ollama.com/library) and cached locally. Once downloaded, they start instantly with no internet required.

# Pull and run Llama 3.1 8B (4-bit quantized, ~5GB download)
ollama run llama3.1

# Run Mistral 7B
ollama run mistral

# Run Google's Gemma 2 9B
ollama run gemma2:9b

# Run a coding-focused model
ollama run codellama:13b

# Run with specific quantization level
ollama run llama3.1:8b-instruct-q8_0   # 8-bit, higher quality, ~8GB

# List downloaded models
ollama list

# Remove a model
ollama rm mistral

# Interactive chat session — type /bye to exit
ollama run llama3.1
>>> Tell me about retrieval augmented generation

Popular Models and Hardware Requirements

Choosing the right model depends on your available RAM (or VRAM if using a GPU). The quantized models below are 4-bit (Q4_K_M) unless noted — the best quality-to-size tradeoff for most uses.

Model	Size	RAM Needed	Best For
llama3.2:3b	2GB	8GB	Fast responses, basic tasks
llama3.1:8b	4.7GB	8GB	General purpose, great quality
mistral:7b	4.1GB	8GB	Instruction following, coding
gemma2:9b	5.5GB	16GB	Reasoning, multilingual
llama3.1:70b	40GB	64GB	Near-GPT-4 quality
codellama:13b	7.4GB	16GB	Code generation and completion
nomic-embed-text	274MB	4GB	Embeddings for RAG

Python API Integration

Ollama exposes a REST API compatible with the OpenAI API format, plus its own Python client library. Both work — use the OpenAI-compatible endpoint if you want to swap between local and cloud models with one config change.

import ollama

# Simple chat completion
response = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Explain Python decorators in 3 sentences."}
    ]
)
print(response["message"]["content"])

# Streaming response — tokens appear as they generate
stream = ollama.chat(
    model="llama3.1",
    messages=[{"role": "user", "content": "Write a haiku about programming."}],
    stream=True
)
for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

# Generate embeddings locally (free, private)
embedding = ollama.embeddings(
    model="nomic-embed-text",
    prompt="Retrieval augmented generation improves LLM accuracy"
)
print(f"Embedding dimensions: {len(embedding['embedding'])}")  # 768

# OpenAI-compatible client — swap cloud/local with one line
from openai import OpenAI

# Point to local Ollama instead of OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but ignored
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "What is RAG?"}],
    temperature=0.7,
    max_tokens=500
)
print(response.choices[0].message.content)

LangChain + Ollama

LangChain has first-class Ollama support via the langchain-ollama package. This lets you build chains, agents, and RAG pipelines entirely locally with zero API costs or data leaving your machine — critical for enterprise use cases with sensitive data.

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_core.messages import HumanMessage, SystemMessage

# Local LLM via LangChain
llm = ChatOllama(model="llama3.1", temperature=0.1)

messages = [
    SystemMessage(content="You are a technical writer. Be concise."),
    HumanMessage(content="Explain vector databases in 2 sentences.")
]

response = llm.invoke(messages)
print(response.content)

# Local embeddings — completely free and private
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector = embeddings.embed_query("What is a transformer model?")
print(f"Embedding dim: {len(vector)}")  # 768

Building a Local RAG Pipeline

Combining Ollama (local LLM + embeddings) with Chroma (local vector store) creates a completely private, offline RAG pipeline. No API keys, no cloud costs, no data leaves your machine. This is ideal for processing confidential documents.

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA

# Load a PDF document (stays on your machine)
loader = PyPDFLoader("confidential_report.pdf")
docs = loader.load()

# Chunk the document
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# Embed locally with Ollama — completely private
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./local_rag_db")

# Query with local Llama 3
llm = ChatOllama(model="llama3.1", temperature=0)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

result = qa.invoke({"query": "What are the key findings in section 3?"})
print(result["result"])

Note: For M1/M2/M3 Macs, Ollama uses Metal GPU acceleration automatically. A 7B model at 4-bit quantization runs at 30–60 tokens/second on an M3 Pro — fast enough for interactive use.

Running Ollama with Docker

For server deployments, running Ollama in Docker with GPU passthrough gives you a clean, reproducible environment. This is useful for self-hosted internal tools, CI/CD pipelines that need LLM inference, or team-shared Ollama instances.

# CPU-only Docker run
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# With NVIDIA GPU support
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# Pull model inside container
docker exec -it ollama ollama pull llama3.1

# Docker Compose setup
cat > docker-compose.yml <<'EOF'
version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped
    # Uncomment for GPU:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]
volumes:
  ollama_data:
EOF
docker compose up -d

Custom Models with Modelfile

Ollama's Modelfile lets you create custom model variants with a fixed system prompt, different temperature settings, or even load your own GGUF fine-tuned model. It's like a Dockerfile for LLMs.

# Create a custom coding assistant
cat > Modelfile <<'EOF'
FROM llama3.1

# Set a permanent system prompt
SYSTEM """
You are an expert Python developer who writes clean, well-documented code.
Always include type hints. Never use deprecated APIs.
When asked for code, provide complete, runnable examples.
"""

# Tune generation parameters
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
EOF

# Build and run the custom model
ollama create python-expert -f Modelfile
ollama run python-expert
>>> Write a FastAPI endpoint that validates email addresses