Python LangChain: Build LLM Applications and Agents
LangChain is the most popular framework for building production LLM applications. It provides composable primitives for chaining prompts and models, retrieval-augmented generation (RAG) pipelines, agents that use tools, and conversation memory. This guide covers LangChain Expression Language (LCEL), RAG with Chroma and FAISS, tool-using agents, memory management, and streaming responses for FastAPI.
Table of Contents
Setup and LLM Configuration
pip install langchain langchain-openai langchain-anthropic langchain-community
pip install chromadb faiss-cpu tiktoken
import os
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
# OpenAI
llm = ChatOpenAI(
model="gpt-4o",
temperature=0,
api_key=os.environ["OPENAI_API_KEY"],
max_tokens=2048,
)
# Anthropic Claude
claude = ChatAnthropic(
model="claude-sonnet-4-6",
temperature=0,
api_key=os.environ["ANTHROPIC_API_KEY"],
)
# Simple invocation
from langchain_core.messages import HumanMessage, SystemMessage
response = llm.invoke([
SystemMessage(content="You are a concise Python expert."),
HumanMessage(content="What is the GIL?"),
])
print(response.content)
LangChain Expression Language (LCEL)
LCEL uses the | operator to compose chains declaratively. Every component is a Runnable with invoke, stream, batch, and async variants. LCEL chains auto-support streaming, parallel execution, and LangSmith tracing.
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
# Basic chain: prompt | llm | parser
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Answer in {language}."),
("human", "{question}"),
])
chain = prompt | llm | StrOutputParser()
answer = chain.invoke({"question": "What is asyncio?", "language": "English"})
# Parallel chains — run multiple branches simultaneously
parallel = RunnableParallel(
summary=ChatPromptTemplate.from_template("Summarize: {text}") | llm | StrOutputParser(),
keywords=ChatPromptTemplate.from_template("Extract keywords from: {text}") | llm | StrOutputParser(),
)
result = parallel.invoke({"text": "LangChain is a framework for LLM applications..."})
print(result["summary"])
print(result["keywords"])
# JSON output parsing
from pydantic import BaseModel
class CodeReview(BaseModel):
issues: list[str]
rating: int
suggestion: str
parser = JsonOutputParser(pydantic_object=CodeReview)
review_chain = (
ChatPromptTemplate.from_template(
"Review this Python code and respond in JSON:\n{code}\n{format_instructions}"
).partial(format_instructions=parser.get_format_instructions())
| llm
| parser
)
review = review_chain.invoke({"code": "x = [i for i in range(1000000)]"})
RAG: Retrieval-Augmented Generation
RAG grounds LLM responses in your own documents. The pipeline: chunk documents → embed them → store in a vector DB → at query time, retrieve the most similar chunks → include them in the prompt as context.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.runnables import RunnablePassthrough
# 1. Load documents
loader = PyPDFLoader("technical_docs.pdf")
docs = loader.load()
# Or load from URLs
web_loader = WebBaseLoader(["https://docs.python.org/3/library/asyncio.html"])
web_docs = web_loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_documents(docs)
# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
chunks,
embeddings,
persist_directory="./chroma_db",
)
# 4. Create retriever
retriever = vectorstore.as_retriever(
search_type="mmr", # maximal marginal relevance — diverse results
search_kwargs={"k": 5, "fetch_k": 20},
)
# 5. RAG chain
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_prompt = ChatPromptTemplate.from_template("""Answer based only on the context:
Context:
{context}
Question: {question}
If the context doesn't contain the answer, say "I don't know".""")
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| rag_prompt
| llm
| StrOutputParser()
)
answer = rag_chain.invoke("How does asyncio handle concurrent tasks?")
Agents and Tools
Agents use LLMs to decide which tools to call, observe results, and iterate until they have an answer. LangChain's tool-calling agents work with any LLM that supports function/tool calling.
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.tools import tool
from langchain_community.tools import DuckDuckGoSearchRun
import requests
# Define custom tools with @tool decorator
@tool
def get_weather(city: str) -> str:
"""Get the current weather for a city."""
resp = requests.get(f"https://wttr.in/{city}?format=3", timeout=5)
return resp.text
@tool
def calculate(expression: str) -> str:
"""Evaluate a mathematical expression safely."""
try:
result = eval(expression, {"__builtins__": {}}, {
"abs": abs, "round": round, "min": min, "max": max,
})
return str(result)
except Exception as e:
return f"Error: {e}"
@tool
def search_web(query: str) -> str:
"""Search the web for current information."""
search = DuckDuckGoSearchRun()
return search.run(query)
tools = [get_weather, calculate, search_web]
# Create agent
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant with access to tools."),
("human", "{input}"),
("placeholder", "{agent_scratchpad}"),
])
agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=5)
result = executor.invoke({"input": "What is the weather in Mysore, India right now?"})
print(result["output"])
Conversation Memory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import RedisChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
# In-memory history per session
from langchain_core.chat_history import InMemoryChatMessageHistory
store = {}
def get_session_history(session_id: str) -> BaseChatMessageHistory:
if session_id not in store:
store[session_id] = InMemoryChatMessageHistory()
return store[session_id]
# Or Redis-backed for production (persists across restarts)
def get_redis_history(session_id: str) -> RedisChatMessageHistory:
return RedisChatMessageHistory(
session_id=session_id,
url=os.environ["REDIS_URL"],
ttl=3600, # expire after 1 hour
)
# Wrap chain with history
chat_prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful Python tutor."),
("placeholder", "{history}"),
("human", "{input}"),
])
chain = chat_prompt | llm | StrOutputParser()
with_history = RunnableWithMessageHistory(
chain,
get_session_history,
input_messages_key="input",
history_messages_key="history",
)
# Multi-turn conversation
config = {"configurable": {"session_id": "user_123"}}
r1 = with_history.invoke({"input": "What is a decorator?"}, config=config)
r2 = with_history.invoke({"input": "Can you give an example?"}, config=config)
r3 = with_history.invoke({"input": "How is that different from a wrapper?"}, config=config)
Streaming with FastAPI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from langchain_core.messages import HumanMessage
import asyncio
app = FastAPI()
@app.post("/chat/stream")
async def chat_stream(question: str):
async def generate():
async for chunk in llm.astream([HumanMessage(content=question)]):
if chunk.content:
yield f"data: {chunk.content}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
@app.post("/rag/stream")
async def rag_stream(question: str):
async def generate():
async for chunk in rag_chain.astream(question):
yield f"data: {chunk}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
LangSmith Tracing
import os
# Enable LangSmith tracing — set env vars before importing LangChain
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app"
# All chain invocations are now traced automatically
# View at: smith.langchain.com
# Manual tracing with callbacks
from langchain_core.callbacks import StdOutCallbackHandler
result = chain.invoke(
{"question": "What is LangChain?", "language": "English"},
config={"callbacks": [StdOutCallbackHandler()]},
)
Frequently Asked Questions
- LangChain vs LlamaIndex — which should I use for RAG?
- LlamaIndex (formerly GPT Index) is more focused on RAG and has better built-in support for complex retrieval strategies (hybrid search, recursive retrieval, query routing). LangChain is broader — better for agents, multi-LLM pipelines, and applications that mix RAG with other tasks. Many production apps use both.
- How do I reduce LLM costs with LangChain?
- Cache repeated LLM calls with
langchain_community.cache.SQLiteCacheor Redis cache. Use a cheaper model for routing/classification and an expensive model only for final generation. Implement semantic caching to cache similar (not just identical) queries. - How do I evaluate RAG quality?
- Use RAGAS (Retrieval Augmented Generation Assessment) for automated metrics: context precision, context recall, faithfulness, and answer relevance. Or use LangSmith's dataset and evaluation features to run your chain against a golden dataset.