LlamaIndex: Document QA and Knowledge Base Guide

LlamaIndex (formerly GPT Index) is the leading framework for building document-centric AI applications: knowledge bases, document Q&A systems, multi-document research tools, and structured data query engines. Where LangChain is a general-purpose LLM framework, LlamaIndex is purpose-built for ingesting, indexing, and querying data — with first-class support for PDFs, databases, APIs, and complex document structures.

This guide walks through LlamaIndex's core abstractions — documents, nodes, indices, query engines, and agents — with complete Python examples for each pattern. By the end you'll be able to build a production-grade document Q&A system that handles PDFs, applies metadata filtering, and supports multi-document synthesis.

Table of Contents

Core Abstractions

Understanding LlamaIndex's object model is key to using it effectively. The framework has a clean hierarchy: Documents are raw text or files; Nodes are chunked pieces of documents with metadata; Indices organize nodes for fast retrieval (most commonly as a VectorStoreIndex); Retrievers fetch relevant nodes given a query; Response Synthesizers combine retrieved nodes into a final answer using an LLM; and Query Engines combine a retriever and synthesizer into a single queryable interface.

from llama_index.core import Settings, VectorStoreIndex, Document
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure global settings — applied to all operations
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.chunk_size = 512
Settings.chunk_overlap = 64

# Create documents manually
documents = [
    Document(text="LlamaIndex is a framework for building LLM applications over data.",
             metadata={"source": "docs", "category": "framework"}),
    Document(text="VectorStoreIndex embeds documents and enables semantic search.",
             metadata={"source": "docs", "category": "index"}),
]

# Build index — automatically chunks, embeds, and stores
index = VectorStoreIndex.from_documents(documents)

# Create query engine — the main interface for Q&A
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What is LlamaIndex used for?")
print(response)                    # Final synthesized answer
print(response.source_nodes[0].text)  # Source chunk used
Note: Always set Settings.llm and Settings.embed_model at the top of your script. LlamaIndex uses these globally, and not setting them explicitly can silently fall back to older defaults.

Basic Document Q&A

The most common LlamaIndex pattern is loading a PDF or directory of documents and building a Q&A system in under 10 lines. LlamaIndex's SimpleDirectoryReader handles PDF, Word, Markdown, HTML, and plain text files automatically.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.core import load_index_from_storage

# Load all documents from a directory (PDF, TXT, DOCX, MD...)
documents = SimpleDirectoryReader("./documents").load_data()
print(f"Loaded {len(documents)} documents")

# Build vector index — this embeds and stores all chunks
index = VectorStoreIndex.from_documents(documents, show_progress=True)

# Persist to disk — avoid re-embedding on every run
index.storage_context.persist(persist_dir="./storage")

# Load from disk on subsequent runs
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

# Different response modes
query_engine_compact = index.as_query_engine(response_mode="compact")
query_engine_detail  = index.as_query_engine(response_mode="tree_summarize")

q = "What are the main topics covered in these documents?"
print(query_engine_compact.query(q))   # Concise answer
print(query_engine_detail.query(q))    # Detailed synthesized answer

Loading Data from Multiple Sources

LlamaIndex's LlamaHub provides 150+ data loaders for every conceivable source: Notion, Confluence, Google Drive, GitHub, Slack, databases, REST APIs, YouTube transcripts, and more. Every loader returns a list of Document objects that feed directly into an index.

from llama_index.readers.web import SimpleWebPageReader
from llama_index.readers.github import GithubRepositoryReader
from llama_index.core import VectorStoreIndex

# Load web pages
web_loader = SimpleWebPageReader(html_to_text=True)
web_docs = web_loader.load_data(urls=["https://docs.llamaindex.ai/en/stable/"])

# Load GitHub repository (README, docs, etc.)
github_loader = GithubRepositoryReader(
    owner="run-llama",
    repo="llama_index",
    filter_file_extensions=([".md", ".rst"], GithubRepositoryReader.FilterType.INCLUDE),
    verbose=False,
    concurrent=5,
)
github_docs = github_loader.load_data(branch="main")

# Combine all sources into one index
all_docs = web_docs + github_docs
index = VectorStoreIndex.from_documents(all_docs)
engine = index.as_query_engine()
print(engine.query("How do I install LlamaIndex?"))

Advanced Query Engine Configuration

The default query engine works well, but production applications need fine-grained control over retrieval parameters, node postprocessors (reranking, keyword filtering, similarity cutoffs), and response synthesis strategy.

from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import (
    SimilarityPostprocessor,
    KeywordNodePostprocessor,
    SentenceTransformerRerank
)
from llama_index.core.response_synthesizers import get_response_synthesizer

# Step 1: Configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,    # Retrieve 10 candidates
)

# Step 2: Apply postprocessors in order
reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2",
    top_n=4            # Keep top 4 after reranking
)
sim_filter = SimilarityPostprocessor(similarity_cutoff=0.7)

# Step 3: Configure response synthesis
synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",   # Hierarchical summarization for long answers
    verbose=True
)

# Assemble the query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=synthesizer,
    node_postprocessors=[reranker, sim_filter]
)

response = query_engine.query("Summarize the architecture patterns used in this codebase.")
print(response)
Note: tree_summarize mode recursively summarizes chunks in a tree structure — much better than compact for long documents where the answer spans many sections. Use it for summarization tasks; use compact for specific fact retrieval.

Metadata Filtering

Metadata filters narrow retrieval to a subset of your knowledge base before semantic search runs. This is essential in multi-tenant systems (filter by user/org), date-sensitive applications (filter by recency), or domain-specific queries (filter by document type or department).

from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator

# Build index with rich metadata
from llama_index.core import Document, VectorStoreIndex

docs = [
    Document(text="Q4 revenue was $5.2M...", metadata={"dept": "finance", "year": 2026, "quarter": "Q4"}),
    Document(text="New product roadmap for 2027...", metadata={"dept": "product", "year": 2026}),
    Document(text="Q3 revenue was $4.8M...", metadata={"dept": "finance", "year": 2026, "quarter": "Q3"}),
]
index = VectorStoreIndex.from_documents(docs)

# Filter to finance department, 2026 only
filters = MetadataFilters(filters=[
    MetadataFilter(key="dept", value="finance", operator=FilterOperator.EQ),
    MetadataFilter(key="year", value=2026, operator=FilterOperator.EQ),
])

engine = index.as_query_engine(filters=filters, similarity_top_k=3)
response = engine.query("What was the revenue trend?")
print(response)  # Only uses finance docs, ignores product roadmap

Multi-Document Query with RouterQueryEngine

When different query types need different retrieval strategies — some questions need keyword search, others need semantic search, and some need SQL — LlamaIndex's RouterQueryEngine uses an LLM to route each query to the most appropriate sub-engine automatically.

from llama_index.core.tools import QueryEngineTool
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector

# Build separate indices for different document collections
api_docs_engine = VectorStoreIndex.from_documents(api_documents).as_query_engine()
blog_engine = VectorStoreIndex.from_documents(blog_documents).as_query_engine()

# Wrap as tools with clear descriptions — LLM uses these to route
tools = [
    QueryEngineTool.from_defaults(
        query_engine=api_docs_engine,
        description="Useful for questions about API reference, parameters, and technical specifications."
    ),
    QueryEngineTool.from_defaults(
        query_engine=blog_engine,
        description="Useful for conceptual questions, tutorials, best practices, and use cases."
    ),
]

# Router automatically picks the right engine per query
router = RouterQueryEngine(selector=LLMSingleSelector.from_defaults(), query_engine_tools=tools)

print(router.query("What parameters does the /embed endpoint accept?"))  # → API docs
print(router.query("When should I use RAG vs fine-tuning?"))             # → Blog

LlamaIndex Agents

LlamaIndex agents combine query engines as tools with an LLM reasoning loop, enabling multi-hop queries, tool chaining, and autonomous document exploration. The FunctionCallingAgent uses OpenAI/Anthropic function calling for reliable tool selection.

from llama_index.core.agent import FunctionCallingAgent
from llama_index.core.tools import FunctionTool, QueryEngineTool
from llama_index.llms.openai import OpenAI

# Convert query engine to tool
doc_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="document_search",
    description="Search the internal knowledge base for any technical information."
)

# Add a custom function tool
def get_current_date() -> str:
    """Returns today's date in ISO format."""
    from datetime import date
    return date.today().isoformat()

date_tool = FunctionTool.from_defaults(fn=get_current_date)

# Build agent
agent = FunctionCallingAgent.from_tools(
    tools=[doc_tool, date_tool],
    llm=OpenAI(model="gpt-4o", temperature=0),
    verbose=True,
    max_function_calls=5
)

response = agent.chat("What does our documentation say about authentication, and when was it last relevant?")
print(response)