Python LangChain: Build LLM Applications and Agents

LangChain is the most popular framework for building production LLM applications. It provides composable primitives for chaining prompts and models, retrieval-augmented generation (RAG) pipelines, agents that use tools, and conversation memory. This guide covers LangChain Expression Language (LCEL), RAG with Chroma and FAISS, tool-using agents, memory management, and streaming responses for FastAPI.

Setup and LLM Configuration
LangChain Expression Language (LCEL)
RAG: Retrieval-Augmented Generation
Agents and Tools
Conversation Memory
Streaming with FastAPI
LangSmith Tracing
Frequently Asked Questions

Setup and LLM Configuration

pip install langchain langchain-openai langchain-anthropic langchain-community
pip install chromadb faiss-cpu tiktoken

import os
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic

# OpenAI
llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,
    api_key=os.environ["OPENAI_API_KEY"],
    max_tokens=2048,
)

# Anthropic Claude
claude = ChatAnthropic(
    model="claude-sonnet-4-6",
    temperature=0,
    api_key=os.environ["ANTHROPIC_API_KEY"],
)

# Simple invocation
from langchain_core.messages import HumanMessage, SystemMessage

response = llm.invoke([
    SystemMessage(content="You are a concise Python expert."),
    HumanMessage(content="What is the GIL?"),
])
print(response.content)

LangChain Expression Language (LCEL)

LCEL uses the | operator to compose chains declaratively. Every component is a Runnable with invoke, stream, batch, and async variants. LCEL chains auto-support streaming, parallel execution, and LangSmith tracing.

from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

# Basic chain: prompt | llm | parser
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Answer in {language}."),
    ("human", "{question}"),
])

chain = prompt | llm | StrOutputParser()
answer = chain.invoke({"question": "What is asyncio?", "language": "English"})

# Parallel chains — run multiple branches simultaneously
parallel = RunnableParallel(
    summary=ChatPromptTemplate.from_template("Summarize: {text}") | llm | StrOutputParser(),
    keywords=ChatPromptTemplate.from_template("Extract keywords from: {text}") | llm | StrOutputParser(),
)
result = parallel.invoke({"text": "LangChain is a framework for LLM applications..."})
print(result["summary"])
print(result["keywords"])

# JSON output parsing
from pydantic import BaseModel

class CodeReview(BaseModel):
    issues: list[str]
    rating: int
    suggestion: str

parser = JsonOutputParser(pydantic_object=CodeReview)
review_chain = (
    ChatPromptTemplate.from_template(
        "Review this Python code and respond in JSON:\n{code}\n{format_instructions}"
    ).partial(format_instructions=parser.get_format_instructions())
    | llm
    | parser
)
review = review_chain.invoke({"code": "x = [i for i in range(1000000)]"})

RAG: Retrieval-Augmented Generation

RAG grounds LLM responses in your own documents. The pipeline: chunk documents → embed them → store in a vector DB → at query time, retrieve the most similar chunks → include them in the prompt as context.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.runnables import RunnablePassthrough

# 1. Load documents
loader = PyPDFLoader("technical_docs.pdf")
docs = loader.load()

# Or load from URLs
web_loader = WebBaseLoader(["https://docs.python.org/3/library/asyncio.html"])
web_docs = web_loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_documents(docs)

# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    chunks,
    embeddings,
    persist_directory="./chroma_db",
)

# 4. Create retriever
retriever = vectorstore.as_retriever(
    search_type="mmr",       # maximal marginal relevance — diverse results
    search_kwargs={"k": 5, "fetch_k": 20},
)

# 5. RAG chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_prompt = ChatPromptTemplate.from_template("""Answer based only on the context:

Context:
{context}

Question: {question}

If the context doesn't contain the answer, say "I don't know".""")

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("How does asyncio handle concurrent tasks?")

Agents and Tools

Agents use LLMs to decide which tools to call, observe results, and iterate until they have an answer. LangChain's tool-calling agents work with any LLM that supports function/tool calling.

from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.tools import tool
from langchain_community.tools import DuckDuckGoSearchRun
import requests

# Define custom tools with @tool decorator
@tool
def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    resp = requests.get(f"https://wttr.in/{city}?format=3", timeout=5)
    return resp.text

@tool
def calculate(expression: str) -> str:
    """Evaluate a mathematical expression safely."""
    try:
        result = eval(expression, {"__builtins__": {}}, {
            "abs": abs, "round": round, "min": min, "max": max,
        })
        return str(result)
    except Exception as e:
        return f"Error: {e}"

@tool
def search_web(query: str) -> str:
    """Search the web for current information."""
    search = DuckDuckGoSearchRun()
    return search.run(query)

tools = [get_weather, calculate, search_web]

# Create agent
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant with access to tools."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=5)

result = executor.invoke({"input": "What is the weather in Mysore, India right now?"})
print(result["output"])

Conversation Memory

from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import RedisChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

# In-memory history per session
from langchain_core.chat_history import InMemoryChatMessageHistory

store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

# Or Redis-backed for production (persists across restarts)
def get_redis_history(session_id: str) -> RedisChatMessageHistory:
    return RedisChatMessageHistory(
        session_id=session_id,
        url=os.environ["REDIS_URL"],
        ttl=3600,  # expire after 1 hour
    )

# Wrap chain with history
chat_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful Python tutor."),
    ("placeholder", "{history}"),
    ("human", "{input}"),
])

chain = chat_prompt | llm | StrOutputParser()

with_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

# Multi-turn conversation
config = {"configurable": {"session_id": "user_123"}}
r1 = with_history.invoke({"input": "What is a decorator?"}, config=config)
r2 = with_history.invoke({"input": "Can you give an example?"}, config=config)
r3 = with_history.invoke({"input": "How is that different from a wrapper?"}, config=config)

Streaming with FastAPI

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from langchain_core.messages import HumanMessage
import asyncio

app = FastAPI()

@app.post("/chat/stream")
async def chat_stream(question: str):
    async def generate():
        async for chunk in llm.astream([HumanMessage(content=question)]):
            if chunk.content:
                yield f"data: {chunk.content}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

@app.post("/rag/stream")
async def rag_stream(question: str):
    async def generate():
        async for chunk in rag_chain.astream(question):
            yield f"data: {chunk}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

LangSmith Tracing

import os

# Enable LangSmith tracing — set env vars before importing LangChain
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app"

# All chain invocations are now traced automatically
# View at: smith.langchain.com

# Manual tracing with callbacks
from langchain_core.callbacks import StdOutCallbackHandler

result = chain.invoke(
    {"question": "What is LangChain?", "language": "English"},
    config={"callbacks": [StdOutCallbackHandler()]},
)

LangChain vs direct API calls: For simple single-turn queries, calling the OpenAI or Anthropic API directly is simpler and faster. Use LangChain when you need RAG, agents, memory, multi-step chains, or need to swap between LLM providers without changing your application code.

Frequently Asked Questions

LangChain vs LlamaIndex — which should I use for RAG?: LlamaIndex (formerly GPT Index) is more focused on RAG and has better built-in support for complex retrieval strategies (hybrid search, recursive retrieval, query routing). LangChain is broader — better for agents, multi-LLM pipelines, and applications that mix RAG with other tasks. Many production apps use both.
How do I reduce LLM costs with LangChain?: Cache repeated LLM calls with langchain_community.cache.SQLiteCache or Redis cache. Use a cheaper model for routing/classification and an expensive model only for final generation. Implement semantic caching to cache similar (not just identical) queries.
How do I evaluate RAG quality?: Use RAGAS (Retrieval Augmented Generation Assessment) for automated metrics: context precision, context recall, faithfulness, and answer relevance. Or use LangSmith's dataset and evaluation features to run your chain against a golden dataset.

Python LangChain: Build LLM Applications and Agents

Table of Contents

Setup and LLM Configuration

LangChain Expression Language (LCEL)

RAG: Retrieval-Augmented Generation

Agents and Tools

Conversation Memory

Streaming with FastAPI

LangSmith Tracing

Frequently Asked Questions

Read Next

Python OpenAI API Guide

FastAPI Tutorial

Python Articles