Claude is Anthropic's family of large language models, ranging from the fast and economical Claude Haiku to the highly capable Claude Opus. In 2026, Claude models are distinguished by their large context windows (up to 200K tokens), strong instruction-following, safety-conscious design, and industry-leading prompt caching — which can reduce costs by up to 90% for applications that repeatedly send the same system prompt or document context.
This guide covers everything you need to build production applications with the Anthropic Python SDK: basic messaging, streaming, vision, tool use, prompt caching, batch processing, and cost management strategies.
The Anthropic Python SDK is the official client library. Install it with pip, set your API key as an environment variable, and you're ready. The SDK handles retries, timeouts, and error parsing automatically.
# Install the Anthropic SDK
pip install anthropic
# Set API key (get from console.anthropic.com)
export ANTHROPIC_API_KEY="sk-ant-api03-..."
import anthropic
import os
# Client auto-reads ANTHROPIC_API_KEY from environment
client = anthropic.Anthropic()
# Or pass explicitly
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
# Available models as of June 2026
MODELS = {
"fastest": "claude-haiku-4-5", # Lowest cost, fastest
"balanced": "claude-sonnet-4-5", # Best price/performance
"powerful": "claude-opus-4-5", # Most capable
}
# Simple test
response = client.messages.create(
model=MODELS["balanced"],
max_tokens=100,
messages=[{"role": "user", "content": "Say hello in 3 languages."}]
)
print(response.content[0].text)
ANTHROPIC_API_KEY environment variable is read automatically by the SDK if present.
Claude's API uses a messages format with alternating user/assistant turns and an optional system prompt. Unlike OpenAI's API, the system prompt is a top-level parameter rather than a message role, which makes it cleaner to separate instructions from conversation history.
import anthropic
client = anthropic.Anthropic()
# Single-turn with system prompt
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system="""You are a senior Python developer. Write clean, idiomatic Python code
with type hints and docstrings. Prefer standard library over third-party packages.""",
messages=[
{"role": "user", "content": "Write a function to paginate a list."}
]
)
print(response.content[0].text)
print(f"\nTokens used: {response.usage.input_tokens} in, {response.usage.output_tokens} out")
# Multi-turn conversation
messages = [
{"role": "user", "content": "What is dependency injection?"},
{"role": "assistant", "content": "Dependency injection is a design pattern..."},
{"role": "user", "content": "Show me a Python example."}
]
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=messages
)
print(response.content[0].text)
Streaming sends tokens to your application as they are generated, rather than waiting for the full response. This dramatically improves perceived latency in chat applications — users see text appearing immediately instead of waiting 5–10 seconds for a complete response.
import anthropic
client = anthropic.Anthropic()
# Synchronous streaming
print("Streaming response: ", end="", flush=True)
with client.messages.stream(
model="claude-sonnet-4-5",
max_tokens=500,
messages=[{"role": "user", "content": "Explain the CAP theorem in detail."}]
) as stream:
for text_chunk in stream.text_stream:
print(text_chunk, end="", flush=True)
final_message = stream.get_final_message()
print(f"\n\nTotal tokens: {final_message.usage.input_tokens + final_message.usage.output_tokens}")
# Async streaming for FastAPI / async applications
import asyncio
import anthropic
async def stream_response(user_message: str):
async with anthropic.AsyncAnthropic() as async_client:
async with async_client.messages.stream(
model="claude-sonnet-4-5",
max_tokens=500,
messages=[{"role": "user", "content": user_message}]
) as stream:
async for text in stream.text_stream:
yield text # Yield to FastAPI SSE / WebSocket
# FastAPI SSE endpoint example
# from fastapi.responses import StreamingResponse
# @app.get("/stream")
# async def chat_stream(q: str):
# return StreamingResponse(stream_response(q), media_type="text/event-stream")
Claude's vision capability accepts images as base64-encoded data or public URLs. You can send screenshots, diagrams, charts, photos, or documents for analysis. Claude can describe images, extract text (OCR), compare multiple images, and answer questions about visual content.
import anthropic
import base64
from pathlib import Path
client = anthropic.Anthropic()
# Method 1: Base64 encoded image (local file)
image_data = base64.standard_b64encode(Path("chart.png").read_bytes()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
}
},
{
"type": "text",
"text": "Analyze this chart. What are the key trends? What anomalies do you see?"
}
]
}]
)
print(response.content[0].text)
# Method 2: URL (public images)
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "url", "url": "https://example.com/diagram.png"}},
{"type": "text", "text": "Extract all text from this diagram."}
]
}]
)
Prompt caching is Claude's most powerful cost optimization feature. When you mark parts of your prompt with "cache_control": {"type": "ephemeral"}, Anthropic caches that content server-side for 5 minutes. Subsequent requests that reuse the same cached prefix pay only 10% of the normal input token cost for the cached portion — a 90% discount.
The ideal use case is a long system prompt or large document that stays the same across many user queries. Cache the document once, then every query against it uses cached tokens at a fraction of the cost.
import anthropic
client = anthropic.Anthropic()
# Load a large document (e.g., a 50-page technical specification)
with open("technical_spec.txt") as f:
large_document = f.read() # ~50,000 tokens
def query_document(user_question: str) -> str:
"""Query the document. Cached tokens cost 90% less after the first call."""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a technical expert. Answer questions about the provided specification accurately."
},
{
"type": "text",
"text": large_document,
"cache_control": {"type": "ephemeral"} # Cache this for 5 minutes
}
],
messages=[{"role": "user", "content": user_question}]
)
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache read tokens: {getattr(usage, 'cache_read_input_tokens', 0)}") # These are cheap!
print(f"Cache write tokens: {getattr(usage, 'cache_creation_input_tokens', 0)}") # First time only
return response.content[0].text
# First call: writes cache (full price for document tokens)
answer1 = query_document("What are the authentication requirements in section 3?")
# Subsequent calls: reads from cache (90% cheaper for document tokens)
answer2 = query_document("What error codes are defined in the spec?")
answer3 = query_document("Summarize the rate limiting rules.")
The Anthropic Batch API lets you send up to 10,000 requests at once and processes them asynchronously within 24 hours at a 50% cost discount. This is ideal for bulk document processing, dataset annotation, evaluation pipelines, and any workload where real-time response is not required.
import anthropic
import json
client = anthropic.Anthropic()
# Prepare a batch of requests
requests = [
anthropic.types.message_create_params.MessageCreateParamsNonStreaming(
model="claude-haiku-4-5", # Use Haiku for bulk tasks — fastest and cheapest
max_tokens=256,
messages=[{"role": "user", "content": f"Classify this review as positive/negative/neutral: '{review}'"}]
)
for review in ["Great product!", "Terrible service.", "It was okay."]
]
# Wrap in MessageBatchRequestParam with custom_id for tracking
batch_requests = [
{"custom_id": f"review-{i}", "params": req}
for i, req in enumerate(requests)
]
# Submit batch
batch = client.messages.batches.create(requests=batch_requests)
print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")
# Poll for completion (or use webhook)
import time
while batch.processing_status == "in_progress":
time.sleep(60)
batch = client.messages.batches.retrieve(batch.id)
print(f"Status: {batch.processing_status}")
# Retrieve results
for result in client.messages.batches.results(batch.id):
if result.result.type == "succeeded":
print(f"{result.custom_id}: {result.result.message.content[0].text}")
Claude API costs are based on input and output tokens. Here are the most impactful strategies to reduce your bill without sacrificing quality:
Use the right model tier. Claude Haiku costs ~20× less than Claude Opus. Use Haiku for classification, extraction, and simple Q&A. Reserve Sonnet for complex reasoning and Opus for the most demanding tasks. Route requests by complexity.
Enable prompt caching. If your system prompt exceeds 1,024 tokens or you're repeatedly querying the same documents, prompt caching gives a 90% discount on cached tokens. This is the single highest-impact optimization for most applications.
Use the Batch API. 50% discount for non-real-time workloads. Ideal for evaluation, annotation, and bulk processing pipelines.
Limit max_tokens aggressively. Set max_tokens to the minimum your use case needs. You're billed per output token, so a generous 4096 default on a task that only needs 200 tokens wastes money.
Cache responses in your application. For repeated identical queries (e.g., FAQ answers), cache the LLM response in Redis with a TTL. The cheapest API call is one you don't make.
import anthropic
import hashlib
import redis
import json
client = anthropic.Anthropic()
cache = redis.Redis(host="localhost", port=6379, decode_responses=True)
def cached_claude_call(system: str, user: str, model: str = "claude-haiku-4-5",
max_tokens: int = 256, ttl: int = 3600) -> str:
"""Application-level response cache — avoids duplicate API calls entirely."""
cache_key = hashlib.md5(f"{model}:{system}:{user}".encode()).hexdigest()
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
response = client.messages.create(
model=model, max_tokens=max_tokens,
system=system, messages=[{"role": "user", "content": user}]
)
result = response.content[0].text
cache.setex(cache_key, ttl, json.dumps(result))
return result