Retrieval Augmented Generation (RAG) is the most widely adopted pattern for grounding LLM responses in factual, up-to-date information. Instead of relying solely on the model's training data, RAG retrieves relevant documents from a knowledge base at query time and injects them into the prompt — dramatically reducing hallucinations and enabling LLMs to answer questions about private or recent data.
This guide covers every component of a production RAG system: document ingestion, chunking strategies, embedding models, vector stores, retrieval algorithms, reranking, and generation. Code examples use LangChain and Python throughout.
A RAG system has two phases: an indexing phase (offline) where documents are chunked, embedded, and stored in a vector database, and a query phase (online) where the user's question is embedded, similar chunks are retrieved, and the retrieved context plus the question are sent to an LLM for answer generation.
The beauty of RAG is that you can update your knowledge base without retraining the LLM. Add a new document, embed it, and the system can answer questions about it immediately. This makes RAG ideal for internal knowledge bases, customer support bots, and any application where the data changes frequently.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
# 1. Load documents
loader = TextLoader("knowledge_base.txt")
documents = loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)
# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
# 4. Create retrieval chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True
)
result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])
Chunking strategy is arguably the most important parameter in a RAG system — poor chunking causes poor retrieval. The goal is to create chunks that are semantically coherent: each chunk should contain one complete idea, not half of two ideas.
Chunking strategies:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# Option 1: Recursive — fast and reliable
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=600,
chunk_overlap=60,
separators=["\n\n", "\n", ". ", " ", ""]
)
# Option 2: Semantic — slower but smarter
embeddings = OpenAIEmbeddings()
semantic_splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
text = "Your long document text here..."
recursive_chunks = recursive_splitter.create_documents([text])
semantic_chunks = semantic_splitter.create_documents([text])
print(f"Recursive: {len(recursive_chunks)} chunks")
print(f"Semantic: {len(semantic_chunks)} chunks")
Embeddings are dense vector representations of text. Two chunks are "similar" when their vectors are close in high-dimensional space (measured by cosine similarity or dot product). Choosing the right embedding model significantly affects retrieval quality.
Top embedding models in 2026:
text-embedding-3-large (OpenAI) — 3072 dimensions, best quality, $0.13/1M tokenstext-embedding-3-small (OpenAI) — 1536 dimensions, great price/quality ratioBAAI/bge-large-en-v1.5 — free, local, excellent for Englishnomic-embed-text — 768 dimensions, open-source, runs via Ollamafrom sentence_transformers import SentenceTransformer
import numpy as np
# Free local embedding model — no API costs
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
documents = [
"Python is a high-level programming language.",
"Django is a Python web framework.",
"Java is a statically typed language."
]
# Encode documents
embeddings = model.encode(documents, normalize_embeddings=True)
print(f"Shape: {embeddings.shape}") # (3, 1024)
# Cosine similarity (embeddings are normalized, so dot product = cosine sim)
query_embedding = model.encode(["What web frameworks exist for Python?"],
normalize_embeddings=True)
similarities = np.dot(query_embedding, embeddings.T)[0]
print(dict(zip(documents, similarities)))
# Django doc gets highest score
Vector stores persist embeddings and enable fast approximate nearest-neighbor (ANN) search. For small datasets (<100K chunks), Chroma or FAISS work fine. For production at scale, Pinecone, Weaviate, or Qdrant offer managed infrastructure with filtering, namespaces, and metadata search.
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create FAISS index from documents
docs = [
Document(page_content="RAG improves LLM accuracy", metadata={"source": "blog", "date": "2026"}),
Document(page_content="Vector databases store embeddings", metadata={"source": "wiki"}),
Document(page_content="LangChain simplifies LLM apps", metadata={"source": "docs"}),
]
vectorstore = FAISS.from_documents(docs, embeddings)
# Save and load (for production)
vectorstore.save_local("faiss_index")
loaded = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)
# Similarity search with metadata filter
results = loaded.similarity_search_with_score(
"How do I improve LLM accuracy?", k=2
)
for doc, score in results:
print(f"Score: {score:.4f} | {doc.page_content}")
Basic similarity search retrieves the k most similar chunks. But in production, naive top-k retrieval often fails for complex questions. Advanced retrieval strategies dramatically improve answer quality.
Hybrid search: Combines semantic (vector) search with keyword (BM25) search. Semantic search handles paraphrases; BM25 handles exact terms. Most production RAG systems use hybrid retrieval.
Multi-query retrieval: Uses an LLM to generate multiple phrasings of the original question, retrieves for each, and merges results. Helps when the user's phrasing doesn't match the document's language.
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
docs = [
Document(page_content="Python asyncio enables concurrent I/O operations"),
Document(page_content="Use async/await keywords for asynchronous Python code"),
Document(page_content="Event loops manage coroutine execution in asyncio"),
]
# Keyword retriever
bm25_retriever = BM25Retriever.from_documents(docs, k=2)
# Semantic retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(docs, embeddings)
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
# Hybrid: 40% BM25, 60% semantic
hybrid = EnsembleRetriever(
retrievers=[bm25_retriever, semantic_retriever],
weights=[0.4, 0.6]
)
results = hybrid.invoke("How does Python handle concurrency?")
for r in results:
print(r.page_content)
Reranking is a second-pass scoring that takes the top-N retrieved chunks and reorders them by actual relevance to the query. This two-stage approach (retrieve broadly, rerank precisely) is the state-of-the-art in 2026 production RAG. Common rerankers include Cohere Rerank, cross-encoder models, and FlashRank (a lightweight local option).
import cohere
from langchain.retrievers.document_compressors import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever
# Set up base retriever (gets 20 candidates)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
# Cohere reranker reduces to top 5 most relevant
compressor = CohereRerank(
cohere_api_key="your-key",
model="rerank-english-v3.0",
top_n=5
)
# Compression retriever = retrieve + rerank
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
docs = compression_retriever.invoke("What is the refund policy for annual plans?")
for i, doc in enumerate(docs):
print(f"Rank {i+1}: {doc.page_content[:100]}")
The final step is injecting retrieved context into a well-designed prompt and calling the LLM. A good RAG prompt explicitly instructs the model to use only the provided context, cite sources when possible, and say "I don't know" when the context doesn't contain the answer (rather than hallucinating).
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
RAG_PROMPT = """You are a helpful assistant. Answer the question using ONLY the context below.
If the context does not contain the answer, say "I don't have enough information to answer that."
Do not make up information.
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(RAG_PROMPT)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
def format_docs(docs):
return "\n\n---\n\n".join(d.page_content for d in docs)
# LCEL chain: question → retrieve → format → LLM → parse
rag_chain = (
{"context": compression_retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
answer = rag_chain.invoke("What is the return window for products?")
print(answer)
rag_chain.stream(...)) to improve perceived latency. RAG pipelines can take 2–5 seconds end-to-end, and streaming means the user sees text appear immediately.
Evaluating RAG systems requires measuring both retrieval quality and generation quality separately. RAGAS is the leading open-source framework for this. It measures faithfulness (does the answer match the context?), answer relevancy (does the answer address the question?), context precision, and context recall.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
# Build evaluation dataset
eval_data = {
"question": ["What is RAG?", "How do embeddings work?"],
"answer": [
"RAG retrieves relevant documents and injects them into LLM prompts.",
"Embeddings are dense vector representations of text in high-dimensional space."
],
"contexts": [
["RAG stands for Retrieval Augmented Generation. It grounds LLM responses in retrieved documents."],
["Text embeddings map words and sentences to vectors where semantic similarity equals vector proximity."]
],
"ground_truth": [
"RAG combines retrieval with generation to ground LLM responses.",
"Embeddings are numerical vector representations capturing semantic meaning."
]
}
dataset = Dataset.from_dict(eval_data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(results)
# {'faithfulness': 0.95, 'answer_relevancy': 0.88, 'context_precision': 0.91}