Python LangChain: Build LLM Applications and Agents
LangChain is the most widely adopted Python framework for building applications powered by large language models. It provides composable building blocks — prompts, models, output parsers, retrievers, memory, and agents — that snap together via the LangChain Expression Language (LCEL). In 2026, LangChain v0.3 has stabilised the LCEL pipe syntax and introduced first-class streaming, making it suitable for both rapid prototyping and production RAG pipelines handling millions of queries.
This guide covers the LCEL chain composition, retrieval-augmented generation (RAG) with ChromaDB, ReAct agents with tool use, conversation memory, streaming responses, and structuring a production LLM application. See also the Python OpenAI API guide for direct API usage without a framework.
Table of Contents
Installation and Setup
LangChain is split into focused packages. Install langchain for the core framework, langchain-openai for OpenAI integration, and langchain-community for community connectors including ChromaDB, Pinecone, and dozens of other vector stores.
pip install langchain langchain-openai langchain-community
pip install chromadb # Local vector store
pip install faiss-cpu # Alternative: Facebook AI Similarity Search
pip install tiktoken pypdf # Token counting and PDF loading
export OPENAI_API_KEY="sk-..."
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Basic invocation
response = llm.invoke([
SystemMessage(content="You are a concise technical writer."),
HumanMessage(content="Explain Docker in one sentence."),
])
print(response.content)
LCEL: Chain Composition
LangChain Expression Language (LCEL) uses the | pipe operator to compose Runnables — any object with invoke(), stream(), and batch() methods. Chains are built left to right: output of each step feeds into the next. LCEL chains are automatically parallelisable and support async, streaming, and fallbacks.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Simple chain: prompt | llm | parser
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Be concise."),
("human", "{question}"),
])
chain = prompt | llm | StrOutputParser()
print(chain.invoke({"question": "What is a Python generator?"}))
# Chain with JSON output
json_prompt = ChatPromptTemplate.from_template(
"Extract entities from this text as JSON with keys 'people', 'places', 'orgs'.\n\nText: {text}"
)
json_chain = json_prompt | llm | JsonOutputParser()
result = json_chain.invoke({"text": "Elon Musk founded SpaceX in Hawthorne, California."})
print(result) # {'people': ['Elon Musk'], 'places': ['Hawthorne, California'], 'orgs': ['SpaceX']}
# Parallel chains — run multiple chains simultaneously
analysis_chain = RunnableParallel(
summary=ChatPromptTemplate.from_template("Summarise in 20 words: {text}") | llm | StrOutputParser(),
sentiment=ChatPromptTemplate.from_template("Sentiment (positive/negative/neutral) of: {text}") | llm | StrOutputParser(),
keywords=ChatPromptTemplate.from_template("List 5 keywords from: {text}") | llm | StrOutputParser(),
)
results = analysis_chain.invoke({"text": "NumPy 2.0 brings a faster array API and improved type annotations."})
print(results["summary"])
print(results["sentiment"])
# Batch processing — parallel invocations
questions = [
{"question": "What is REST?"},
{"question": "What is GraphQL?"},
{"question": "What is gRPC?"},
]
answers = chain.batch(questions, config={"max_concurrency": 3})
for q, a in zip(questions, answers):
print(f"Q: {q['question']}\nA: {a}\n")
RAG with Vector Stores
Retrieval-Augmented Generation grounds LLM responses in your own documents by embedding them into a vector store and retrieving relevant chunks at query time. The pipeline: load documents → split into chunks → embed → store → retrieve → generate. ChromaDB provides a local persistent vector store requiring zero infrastructure.
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# Step 1: Load documents
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
print(f"Loaded {len(documents)} pages")
# Step 2: Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
# Step 3: Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
)
# Step 4: Create retriever
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximal Marginal Relevance — diversity
search_kwargs={"k": 5, "fetch_k": 20},
)
# Step 5: RAG chain with LCEL
rag_prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the provided context.
If the answer is not in the context, say "I don't have information about that."
Context:
{context}
Question: {question}
""")
def format_docs(docs):
return "\n\n".join(d.page_content for d in docs)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| rag_prompt
| llm
| StrOutputParser()
)
answer = rag_chain.invoke("What are the key findings in the Q1 report?")
print(answer)
ReAct Agents with Tools
LangChain agents extend chains with decision-making: the LLM chooses which tool to call based on the query, executes it, observes the result, and iterates until a final answer is reached. The ReAct (Reason + Act) pattern interleaves reasoning steps with tool calls for transparency and debuggability.
from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent, AgentExecutor
from langchain.tools import tool
from langchain import hub
import requests, datetime
llm = ChatOpenAI(model="gpt-4o", temperature=0)
@tool
def get_weather(city: str) -> str:
"""Get current weather temperature for a city."""
geo = requests.get(f"https://geocoding-api.open-meteo.com/v1/search?name={city}&count=1").json()
if not geo.get("results"):
return f"City '{city}' not found"
loc = geo["results"][0]
w = requests.get(
f"https://api.open-meteo.com/v1/forecast"
f"?latitude={loc['latitude']}&longitude={loc['longitude']}¤t=temperature_2m"
).json()
return f"{city}: {w['current']['temperature_2m']}°C"
@tool
def calculate(expression: str) -> str:
"""Evaluate a mathematical expression. Example: '2 + 2 * 3'"""
try:
result = eval(expression, {"__builtins__": {}}, {})
return str(result)
except Exception as e:
return f"Error: {e}"
@tool
def get_date() -> str:
"""Get today's date."""
return datetime.date.today().isoformat()
tools = [get_weather, calculate, get_date]
# Pull standard ReAct prompt from LangChain Hub
prompt = hub.pull("hwchase17/react")
agent = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True,
max_iterations=5,
handle_parsing_errors=True,
)
result = executor.invoke({
"input": "What is the temperature difference between London and Tokyo today?"
})
print(result["output"])
Conversation Memory
Stateless LLM calls forget the conversation after each turn. LangChain's memory components maintain conversation history and inject it into the prompt. ConversationBufferWindowMemory keeps the last N exchanges; ConversationSummaryMemory compresses older history with a summarisation call to stay within context limits.
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationChain
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
# Simple conversation chain with windowed memory
memory = ConversationBufferWindowMemory(k=5, return_messages=True)
conversation = ConversationChain(llm=llm, memory=memory, verbose=False)
responses = []
for msg in ["Hi, I'm learning Python.", "What should I learn first?", "How long will it take?"]:
resp = conversation.predict(input=msg)
responses.append(resp)
print(f"User: {msg}\nBot: {resp}\n")
# LCEL with per-session message history (production pattern)
store: dict[str, ChatMessageHistory] = {}
def get_session_history(session_id: str) -> ChatMessageHistory:
if session_id not in store:
store[session_id] = ChatMessageHistory()
return store[session_id]
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful Python tutor. Be encouraging and concise."),
MessagesPlaceholder(variable_name="history"),
("human", "{input}"),
])
chain = prompt | llm
chain_with_history = RunnableWithMessageHistory(
chain,
get_session_history,
input_messages_key="input",
history_messages_key="history",
)
# Each session_id maintains independent conversation history
for msg in ["What is a list comprehension?", "Give me an example.", "And a dict comprehension?"]:
resp = chain_with_history.invoke(
{"input": msg},
config={"configurable": {"session_id": "user-123"}},
)
print(f"User: {msg}\nBot: {resp.content}\n")
Streaming Responses
Streaming returns tokens as they are generated, reducing perceived latency from several seconds to an immediate first-token response. All LCEL chains support streaming via .stream(). For web applications, stream server-sent events (SSE) using FastAPI's StreamingResponse.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7, streaming=True)
prompt = ChatPromptTemplate.from_template("Write a short blog post about: {topic}")
chain = prompt | llm | StrOutputParser()
# Synchronous streaming — print tokens as they arrive
print("Streaming: ", end="", flush=True)
for chunk in chain.stream({"topic": "Python asyncio"}):
print(chunk, end="", flush=True)
print()
# Async streaming
import asyncio
async def astream_example():
async for chunk in chain.astream({"topic": "NumPy broadcasting"}):
print(chunk, end="", flush=True)
asyncio.run(astream_example())
# FastAPI streaming endpoint
app = FastAPI()
@app.get("/stream")
async def stream_endpoint(topic: str):
async def generate():
async for chunk in chain.astream({"topic": topic}):
yield f"data: {chunk}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Production Patterns
Production LangChain applications need caching (to reduce API costs), fallback models (for resilience), and observability via LangSmith tracing. Cache identical prompts with InMemoryCache during development and Redis in production. Use .with_fallbacks() to automatically retry with a cheaper model on rate limits.
from langchain_openai import ChatOpenAI
from langchain_core.globals import set_llm_cache
from langchain_community.cache import InMemoryCache, RedisCache
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
# Enable caching — identical prompts return cached response
set_llm_cache(InMemoryCache()) # Dev
# set_llm_cache(RedisCache(redis_url="redis://localhost:6379")) # Production
# Model with fallback — try GPT-4o, fall back to GPT-3.5-turbo
primary = ChatOpenAI(model="gpt-4o", temperature=0)
fallback = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
llm_with_fallback = primary.with_fallbacks([fallback])
# Retry logic
llm_with_retry = primary.with_retry(
retry_if_exception_type=(Exception,),
wait_exponential_jitter=True,
stop_after_attempt=3,
)
# LangSmith tracing (set env vars to enable)
import os
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = "ls-..."
prompt = ChatPromptTemplate.from_template("Answer briefly: {question}")
chain = prompt | llm_with_fallback | StrOutputParser()
# Tag runs for filtering in LangSmith dashboard
result = chain.invoke(
{"question": "What is LangChain LCEL?"},
config={"tags": ["production", "v1.0"], "metadata": {"user_id": "u-123"}},
)
print(result)
gpt-4o-mini for simple extraction and classification tasks — it costs 15x less than gpt-4o and performs comparably on structured output tasks. Reserve full gpt-4o for complex reasoning chains.