Python Observability: OpenTelemetry Tracing and Metrics
OpenTelemetry (OTel) is the CNCF standard for observability instrumentation — it provides a single vendor-neutral API for collecting traces, metrics, and logs. The Python SDK auto-instruments popular libraries (FastAPI, SQLAlchemy, httpx, Redis) with zero code changes, and exports to Jaeger, Prometheus, Datadog, Grafana Tempo, or any OTLP-compatible backend. Instrumenting once and switching backends without code changes is the core value proposition.
Table of Contents
Installation and Setup
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-fastapi \
opentelemetry-instrumentation-sqlalchemy \
opentelemetry-instrumentation-httpx \
opentelemetry-instrumentation-redis \
opentelemetry-exporter-otlp-proto-grpc \
opentelemetry-exporter-prometheus
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource
# Configure a tracer provider with service metadata
resource = Resource.create({
"service.name": "order-service",
"service.version": "2.1.0",
"deployment.environment": "production",
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("order-service", "2.1.0")
Auto-Instrumentation
OpenTelemetry provides instrumentors for every major Python library. Call them once at startup and every request, database query, cache operation, and outbound HTTP call automatically generates spans with attributes like SQL query text, HTTP status codes, and Redis commands.
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor
def setup_auto_instrumentation():
"""Call once at application startup."""
FastAPIInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
HTTPXClientInstrumentor().instrument()
RedisInstrumentor().instrument()
Psycopg2Instrumentor().instrument()
# After calling this, every FastAPI request creates a trace with child spans for:
# - Incoming HTTP request (method, path, status)
# - SQL queries (query text, duration)
# - Redis operations (command, key)
# - Outbound HTTP calls via httpx (url, status, duration)
Manual Spans and Attributes
Add custom spans for business-critical operations that don't map to a library call. Spans can carry attributes (key-value metadata), events (timestamped annotations), and status (OK, ERROR). They appear as nested operations in Jaeger and Datadog APM.
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer(__name__)
async def process_order(order_id: str, user_id: int) -> dict:
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("user.id", user_id)
span.set_attribute("order.source", "web")
try:
with tracer.start_as_current_span("validate_inventory") as child:
items = await check_inventory(order_id)
child.set_attribute("inventory.items_checked", len(items))
with tracer.start_as_current_span("charge_payment") as child:
charge_id = await charge_stripe(user_id, items)
child.set_attribute("payment.charge_id", charge_id)
child.add_event("payment_captured", {"amount": 99.99, "currency": "USD"})
span.set_status(Status(StatusCode.OK))
return {"order_id": order_id, "status": "confirmed"}
except Exception as exc:
span.record_exception(exc)
span.set_status(Status(StatusCode.ERROR, str(exc)))
raise
async def check_inventory(order_id: str) -> list:
return [] # stub
async def charge_stripe(user_id: int, items: list) -> str:
return "ch_xyz" # stub
Metrics: Counters, Histograms, Gauges
The OpenTelemetry metrics API provides counters (monotonically increasing), up-down counters (bidirectional), histograms (distributions), and observable gauges (sampled values). Metrics are exported to Prometheus or as OTLP metrics to Grafana Cloud or Datadog.
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from prometheus_client import start_http_server
# Prometheus exporter — exposes /metrics endpoint
reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter("order-service")
# Counter — monotonically increasing
orders_total = meter.create_counter(
"orders.total",
unit="1",
description="Total number of orders processed",
)
# Histogram — tracks distribution (latency, sizes)
request_duration = meter.create_histogram(
"http.request.duration",
unit="ms",
description="HTTP request duration in milliseconds",
)
# UpDownCounter — can increase or decrease (queue depth, active connections)
active_connections = meter.create_up_down_counter(
"db.connections.active",
unit="1",
description="Number of active database connections",
)
# Record metrics
def record_order(order_type: str, amount: float):
orders_total.add(1, attributes={"order.type": order_type, "env": "prod"})
def record_request(method: str, path: str, status: int, duration_ms: float):
request_duration.record(
duration_ms,
attributes={"http.method": method, "http.route": path, "http.status_code": status},
)
# Start Prometheus HTTP server on port 9090
start_http_server(9090)
Exporters: OTLP, Jaeger, Prometheus
Configure exporters to ship traces and metrics to your observability backend. OTLP (OpenTelemetry Protocol) is the universal format — use it with the OTel Collector as a proxy for routing to multiple backends simultaneously.
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
# OTLP exporter — sends to OpenTelemetry Collector, Grafana Tempo, Datadog Agent
otlp_span_exporter = OTLPSpanExporter(
endpoint="http://otel-collector:4317", # gRPC
# Or use HTTP: endpoint="http://otel-collector:4318/v1/traces"
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(otlp_span_exporter))
# Metrics via OTLP to Prometheus-compatible backend
otlp_metric_exporter = OTLPMetricExporter(endpoint="http://otel-collector:4317")
metric_reader = PeriodicExportingMetricReader(otlp_metric_exporter, export_interval_millis=60_000)
MeterProvider(metric_readers=[metric_reader])
# OTel Collector config (otel-collector-config.yaml):
# receivers:
# otlp:
# protocols: { grpc: {endpoint: 0.0.0.0:4317}, http: {endpoint: 0.0.0.0:4318} }
# exporters:
# jaeger: { endpoint: jaeger:14250 }
# prometheus: { endpoint: 0.0.0.0:8889 }
# datadog: { api: { key: ${DD_API_KEY} } }
# service:
# pipelines:
# traces: { receivers: [otlp], exporters: [jaeger, datadog] }
# metrics: { receivers: [otlp], exporters: [prometheus, datadog] }
FastAPI Full Setup
A production FastAPI app with full OTel setup: traces, metrics, auto-instrumentation, and OTLP export all wired together in the lifespan context manager.
from contextlib import asynccontextmanager
from fastapi import FastAPI, Request
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
import time
resource = Resource.create({"service.name": "api", "service.version": "1.0.0"})
@asynccontextmanager
async def lifespan(app: FastAPI):
# Setup tracing
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(tracer_provider)
# Auto-instrument
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
yield
tracer_provider.shutdown()
app = FastAPI(lifespan=lifespan)
tracer = trace.get_tracer("api")
@app.get("/orders/{order_id}")
async def get_order(order_id: str, request: Request):
with tracer.start_as_current_span("db.fetch_order") as span:
span.set_attribute("order.id", order_id)
# DB call here
return {"order_id": order_id, "status": "confirmed"}
Context Propagation Across Services
Distributed tracing only works when trace context propagates across service boundaries via HTTP headers. OpenTelemetry's auto-instrumentation injects and extracts W3C TraceContext headers automatically for httpx and requests.
from opentelemetry.propagate import inject, extract
from opentelemetry import trace, context
import httpx
# Inject context into outbound requests (httpx auto-instrumentation does this for you)
async def call_downstream_service(url: str) -> dict:
headers = {}
inject(headers) # Adds traceparent, tracestate headers
async with httpx.AsyncClient() as client:
response = await client.get(url, headers=headers)
return response.json()
# Extract context from inbound requests (FastAPI auto-instrumentation does this)
from fastapi import FastAPI, Request
app = FastAPI()
@app.middleware("http")
async def propagate_trace(request: Request, call_next):
ctx = extract(dict(request.headers)) # Extract parent context
token = context.attach(ctx)
try:
response = await call_next(request)
return response
finally:
context.detach(token)
Frequently Asked Questions
- What is the difference between traces, metrics, and logs?
- Traces show the path of a single request through distributed systems (latency at each hop). Metrics are aggregated time-series numbers (request rate, error rate, p99 latency). Logs are discrete events with context. OTel aims to unify all three under one SDK and correlate them via trace IDs.
- Does OTel add significant overhead?
- Minimal. The SDK uses sampling (typically 1–10% of traces in high-traffic systems) and batches exports asynchronously. Head-based sampling at the collector lets you control volume without touching application code. Tail-based sampling keeps all traces for errors regardless of rate.
- OpenTelemetry vs Datadog APM agent?
- OTel is vendor-neutral — you can switch backends without code changes. The Datadog agent is proprietary but has deeper Datadog-specific integrations. Many teams use OTel to instrument and the Datadog exporter to ship, getting the best of both.