LlamaIndex (formerly GPT Index) is the leading framework for building document-centric AI applications: knowledge bases, document Q&A systems, multi-document research tools, and structured data query engines. Where LangChain is a general-purpose LLM framework, LlamaIndex is purpose-built for ingesting, indexing, and querying data — with first-class support for PDFs, databases, APIs, and complex document structures.
This guide walks through LlamaIndex's core abstractions — documents, nodes, indices, query engines, and agents — with complete Python examples for each pattern. By the end you'll be able to build a production-grade document Q&A system that handles PDFs, applies metadata filtering, and supports multi-document synthesis.
Understanding LlamaIndex's object model is key to using it effectively. The framework has a clean hierarchy: Documents are raw text or files; Nodes are chunked pieces of documents with metadata; Indices organize nodes for fast retrieval (most commonly as a VectorStoreIndex); Retrievers fetch relevant nodes given a query; Response Synthesizers combine retrieved nodes into a final answer using an LLM; and Query Engines combine a retriever and synthesizer into a single queryable interface.
from llama_index.core import Settings, VectorStoreIndex, Document
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure global settings — applied to all operations
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.chunk_size = 512
Settings.chunk_overlap = 64
# Create documents manually
documents = [
Document(text="LlamaIndex is a framework for building LLM applications over data.",
metadata={"source": "docs", "category": "framework"}),
Document(text="VectorStoreIndex embeds documents and enables semantic search.",
metadata={"source": "docs", "category": "index"}),
]
# Build index — automatically chunks, embeds, and stores
index = VectorStoreIndex.from_documents(documents)
# Create query engine — the main interface for Q&A
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What is LlamaIndex used for?")
print(response) # Final synthesized answer
print(response.source_nodes[0].text) # Source chunk used
Settings.llm and Settings.embed_model at the top of your script. LlamaIndex uses these globally, and not setting them explicitly can silently fall back to older defaults.
The most common LlamaIndex pattern is loading a PDF or directory of documents and building a Q&A system in under 10 lines. LlamaIndex's SimpleDirectoryReader handles PDF, Word, Markdown, HTML, and plain text files automatically.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.core import load_index_from_storage
# Load all documents from a directory (PDF, TXT, DOCX, MD...)
documents = SimpleDirectoryReader("./documents").load_data()
print(f"Loaded {len(documents)} documents")
# Build vector index — this embeds and stores all chunks
index = VectorStoreIndex.from_documents(documents, show_progress=True)
# Persist to disk — avoid re-embedding on every run
index.storage_context.persist(persist_dir="./storage")
# Load from disk on subsequent runs
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
# Different response modes
query_engine_compact = index.as_query_engine(response_mode="compact")
query_engine_detail = index.as_query_engine(response_mode="tree_summarize")
q = "What are the main topics covered in these documents?"
print(query_engine_compact.query(q)) # Concise answer
print(query_engine_detail.query(q)) # Detailed synthesized answer
LlamaIndex's LlamaHub provides 150+ data loaders for every conceivable source: Notion, Confluence, Google Drive, GitHub, Slack, databases, REST APIs, YouTube transcripts, and more. Every loader returns a list of Document objects that feed directly into an index.
from llama_index.readers.web import SimpleWebPageReader
from llama_index.readers.github import GithubRepositoryReader
from llama_index.core import VectorStoreIndex
# Load web pages
web_loader = SimpleWebPageReader(html_to_text=True)
web_docs = web_loader.load_data(urls=["https://docs.llamaindex.ai/en/stable/"])
# Load GitHub repository (README, docs, etc.)
github_loader = GithubRepositoryReader(
owner="run-llama",
repo="llama_index",
filter_file_extensions=([".md", ".rst"], GithubRepositoryReader.FilterType.INCLUDE),
verbose=False,
concurrent=5,
)
github_docs = github_loader.load_data(branch="main")
# Combine all sources into one index
all_docs = web_docs + github_docs
index = VectorStoreIndex.from_documents(all_docs)
engine = index.as_query_engine()
print(engine.query("How do I install LlamaIndex?"))
The default query engine works well, but production applications need fine-grained control over retrieval parameters, node postprocessors (reranking, keyword filtering, similarity cutoffs), and response synthesis strategy.
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import (
SimilarityPostprocessor,
KeywordNodePostprocessor,
SentenceTransformerRerank
)
from llama_index.core.response_synthesizers import get_response_synthesizer
# Step 1: Configure retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=10, # Retrieve 10 candidates
)
# Step 2: Apply postprocessors in order
reranker = SentenceTransformerRerank(
model="cross-encoder/ms-marco-MiniLM-L-2-v2",
top_n=4 # Keep top 4 after reranking
)
sim_filter = SimilarityPostprocessor(similarity_cutoff=0.7)
# Step 3: Configure response synthesis
synthesizer = get_response_synthesizer(
response_mode="tree_summarize", # Hierarchical summarization for long answers
verbose=True
)
# Assemble the query engine
query_engine = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=synthesizer,
node_postprocessors=[reranker, sim_filter]
)
response = query_engine.query("Summarize the architecture patterns used in this codebase.")
print(response)
tree_summarize mode recursively summarizes chunks in a tree structure — much better than compact for long documents where the answer spans many sections. Use it for summarization tasks; use compact for specific fact retrieval.
Metadata filters narrow retrieval to a subset of your knowledge base before semantic search runs. This is essential in multi-tenant systems (filter by user/org), date-sensitive applications (filter by recency), or domain-specific queries (filter by document type or department).
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator
# Build index with rich metadata
from llama_index.core import Document, VectorStoreIndex
docs = [
Document(text="Q4 revenue was $5.2M...", metadata={"dept": "finance", "year": 2026, "quarter": "Q4"}),
Document(text="New product roadmap for 2027...", metadata={"dept": "product", "year": 2026}),
Document(text="Q3 revenue was $4.8M...", metadata={"dept": "finance", "year": 2026, "quarter": "Q3"}),
]
index = VectorStoreIndex.from_documents(docs)
# Filter to finance department, 2026 only
filters = MetadataFilters(filters=[
MetadataFilter(key="dept", value="finance", operator=FilterOperator.EQ),
MetadataFilter(key="year", value=2026, operator=FilterOperator.EQ),
])
engine = index.as_query_engine(filters=filters, similarity_top_k=3)
response = engine.query("What was the revenue trend?")
print(response) # Only uses finance docs, ignores product roadmap
When different query types need different retrieval strategies — some questions need keyword search, others need semantic search, and some need SQL — LlamaIndex's RouterQueryEngine uses an LLM to route each query to the most appropriate sub-engine automatically.
from llama_index.core.tools import QueryEngineTool
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
# Build separate indices for different document collections
api_docs_engine = VectorStoreIndex.from_documents(api_documents).as_query_engine()
blog_engine = VectorStoreIndex.from_documents(blog_documents).as_query_engine()
# Wrap as tools with clear descriptions — LLM uses these to route
tools = [
QueryEngineTool.from_defaults(
query_engine=api_docs_engine,
description="Useful for questions about API reference, parameters, and technical specifications."
),
QueryEngineTool.from_defaults(
query_engine=blog_engine,
description="Useful for conceptual questions, tutorials, best practices, and use cases."
),
]
# Router automatically picks the right engine per query
router = RouterQueryEngine(selector=LLMSingleSelector.from_defaults(), query_engine_tools=tools)
print(router.query("What parameters does the /embed endpoint accept?")) # → API docs
print(router.query("When should I use RAG vs fine-tuning?")) # → Blog
LlamaIndex agents combine query engines as tools with an LLM reasoning loop, enabling multi-hop queries, tool chaining, and autonomous document exploration. The FunctionCallingAgent uses OpenAI/Anthropic function calling for reliable tool selection.
from llama_index.core.agent import FunctionCallingAgent
from llama_index.core.tools import FunctionTool, QueryEngineTool
from llama_index.llms.openai import OpenAI
# Convert query engine to tool
doc_tool = QueryEngineTool.from_defaults(
query_engine=query_engine,
name="document_search",
description="Search the internal knowledge base for any technical information."
)
# Add a custom function tool
def get_current_date() -> str:
"""Returns today's date in ISO format."""
from datetime import date
return date.today().isoformat()
date_tool = FunctionTool.from_defaults(fn=get_current_date)
# Build agent
agent = FunctionCallingAgent.from_tools(
tools=[doc_tool, date_tool],
llm=OpenAI(model="gpt-4o", temperature=0),
verbose=True,
max_function_calls=5
)
response = agent.chat("What does our documentation say about authentication, and when was it last relevant?")
print(response)