Python Observability: OpenTelemetry Tracing and Metrics

OpenTelemetry (OTel) is the CNCF standard for observability instrumentation — it provides a single vendor-neutral API for collecting traces, metrics, and logs. The Python SDK auto-instruments popular libraries (FastAPI, SQLAlchemy, httpx, Redis) with zero code changes, and exports to Jaeger, Prometheus, Datadog, Grafana Tempo, or any OTLP-compatible backend. Instrumenting once and switching backends without code changes is the core value proposition.

Installation and Setup

pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-instrumentation-fastapi \
    opentelemetry-instrumentation-sqlalchemy \
    opentelemetry-instrumentation-httpx \
    opentelemetry-instrumentation-redis \
    opentelemetry-exporter-otlp-proto-grpc \
    opentelemetry-exporter-prometheus
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource

# Configure a tracer provider with service metadata
resource = Resource.create({
    "service.name": "order-service",
    "service.version": "2.1.0",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("order-service", "2.1.0")

Auto-Instrumentation

OpenTelemetry provides instrumentors for every major Python library. Call them once at startup and every request, database query, cache operation, and outbound HTTP call automatically generates spans with attributes like SQL query text, HTTP status codes, and Redis commands.

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor

def setup_auto_instrumentation():
    """Call once at application startup."""
    FastAPIInstrumentor().instrument()
    SQLAlchemyInstrumentor().instrument()
    HTTPXClientInstrumentor().instrument()
    RedisInstrumentor().instrument()
    Psycopg2Instrumentor().instrument()

# After calling this, every FastAPI request creates a trace with child spans for:
# - Incoming HTTP request (method, path, status)
# - SQL queries (query text, duration)
# - Redis operations (command, key)
# - Outbound HTTP calls via httpx (url, status, duration)

Manual Spans and Attributes

Add custom spans for business-critical operations that don't map to a library call. Spans can carry attributes (key-value metadata), events (timestamped annotations), and status (OK, ERROR). They appear as nested operations in Jaeger and Datadog APM.

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer(__name__)


async def process_order(order_id: str, user_id: int) -> dict:
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("user.id", user_id)
        span.set_attribute("order.source", "web")

        try:
            with tracer.start_as_current_span("validate_inventory") as child:
                items = await check_inventory(order_id)
                child.set_attribute("inventory.items_checked", len(items))

            with tracer.start_as_current_span("charge_payment") as child:
                charge_id = await charge_stripe(user_id, items)
                child.set_attribute("payment.charge_id", charge_id)
                child.add_event("payment_captured", {"amount": 99.99, "currency": "USD"})

            span.set_status(Status(StatusCode.OK))
            return {"order_id": order_id, "status": "confirmed"}

        except Exception as exc:
            span.record_exception(exc)
            span.set_status(Status(StatusCode.ERROR, str(exc)))
            raise


async def check_inventory(order_id: str) -> list:
    return []  # stub

async def charge_stripe(user_id: int, items: list) -> str:
    return "ch_xyz"  # stub

Metrics: Counters, Histograms, Gauges

The OpenTelemetry metrics API provides counters (monotonically increasing), up-down counters (bidirectional), histograms (distributions), and observable gauges (sampled values). Metrics are exported to Prometheus or as OTLP metrics to Grafana Cloud or Datadog.

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from prometheus_client import start_http_server

# Prometheus exporter — exposes /metrics endpoint
reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter("order-service")

# Counter — monotonically increasing
orders_total = meter.create_counter(
    "orders.total",
    unit="1",
    description="Total number of orders processed",
)

# Histogram — tracks distribution (latency, sizes)
request_duration = meter.create_histogram(
    "http.request.duration",
    unit="ms",
    description="HTTP request duration in milliseconds",
)

# UpDownCounter — can increase or decrease (queue depth, active connections)
active_connections = meter.create_up_down_counter(
    "db.connections.active",
    unit="1",
    description="Number of active database connections",
)

# Record metrics
def record_order(order_type: str, amount: float):
    orders_total.add(1, attributes={"order.type": order_type, "env": "prod"})

def record_request(method: str, path: str, status: int, duration_ms: float):
    request_duration.record(
        duration_ms,
        attributes={"http.method": method, "http.route": path, "http.status_code": status},
    )

# Start Prometheus HTTP server on port 9090
start_http_server(9090)

Exporters: OTLP, Jaeger, Prometheus

Configure exporters to ship traces and metrics to your observability backend. OTLP (OpenTelemetry Protocol) is the universal format — use it with the OTel Collector as a proxy for routing to multiple backends simultaneously.

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader


# OTLP exporter — sends to OpenTelemetry Collector, Grafana Tempo, Datadog Agent
otlp_span_exporter = OTLPSpanExporter(
    endpoint="http://otel-collector:4317",  # gRPC
    # Or use HTTP: endpoint="http://otel-collector:4318/v1/traces"
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(otlp_span_exporter))

# Metrics via OTLP to Prometheus-compatible backend
otlp_metric_exporter = OTLPMetricExporter(endpoint="http://otel-collector:4317")
metric_reader = PeriodicExportingMetricReader(otlp_metric_exporter, export_interval_millis=60_000)
MeterProvider(metric_readers=[metric_reader])

# OTel Collector config (otel-collector-config.yaml):
# receivers:
#   otlp:
#     protocols: { grpc: {endpoint: 0.0.0.0:4317}, http: {endpoint: 0.0.0.0:4318} }
# exporters:
#   jaeger: { endpoint: jaeger:14250 }
#   prometheus: { endpoint: 0.0.0.0:8889 }
#   datadog: { api: { key: ${DD_API_KEY} } }
# service:
#   pipelines:
#     traces: { receivers: [otlp], exporters: [jaeger, datadog] }
#     metrics: { receivers: [otlp], exporters: [prometheus, datadog] }

FastAPI Full Setup

A production FastAPI app with full OTel setup: traces, metrics, auto-instrumentation, and OTLP export all wired together in the lifespan context manager.

from contextlib import asynccontextmanager
from fastapi import FastAPI, Request
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
import time

resource = Resource.create({"service.name": "api", "service.version": "1.0.0"})


@asynccontextmanager
async def lifespan(app: FastAPI):
    # Setup tracing
    tracer_provider = TracerProvider(resource=resource)
    tracer_provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
    )
    trace.set_tracer_provider(tracer_provider)

    # Auto-instrument
    FastAPIInstrumentor.instrument_app(app)
    HTTPXClientInstrumentor().instrument()

    yield

    tracer_provider.shutdown()


app = FastAPI(lifespan=lifespan)
tracer = trace.get_tracer("api")


@app.get("/orders/{order_id}")
async def get_order(order_id: str, request: Request):
    with tracer.start_as_current_span("db.fetch_order") as span:
        span.set_attribute("order.id", order_id)
        # DB call here
        return {"order_id": order_id, "status": "confirmed"}

Context Propagation Across Services

Distributed tracing only works when trace context propagates across service boundaries via HTTP headers. OpenTelemetry's auto-instrumentation injects and extracts W3C TraceContext headers automatically for httpx and requests.

from opentelemetry.propagate import inject, extract
from opentelemetry import trace, context
import httpx


# Inject context into outbound requests (httpx auto-instrumentation does this for you)
async def call_downstream_service(url: str) -> dict:
    headers = {}
    inject(headers)  # Adds traceparent, tracestate headers
    async with httpx.AsyncClient() as client:
        response = await client.get(url, headers=headers)
        return response.json()


# Extract context from inbound requests (FastAPI auto-instrumentation does this)
from fastapi import FastAPI, Request

app = FastAPI()

@app.middleware("http")
async def propagate_trace(request: Request, call_next):
    ctx = extract(dict(request.headers))  # Extract parent context
    token = context.attach(ctx)
    try:
        response = await call_next(request)
        return response
    finally:
        context.detach(token)

Frequently Asked Questions

What is the difference between traces, metrics, and logs?
Traces show the path of a single request through distributed systems (latency at each hop). Metrics are aggregated time-series numbers (request rate, error rate, p99 latency). Logs are discrete events with context. OTel aims to unify all three under one SDK and correlate them via trace IDs.
Does OTel add significant overhead?
Minimal. The SDK uses sampling (typically 1–10% of traces in high-traffic systems) and batches exports asynchronously. Head-based sampling at the collector lets you control volume without touching application code. Tail-based sampling keeps all traces for errors regardless of rate.
OpenTelemetry vs Datadog APM agent?
OTel is vendor-neutral — you can switch backends without code changes. The Datadog agent is proprietary but has deeper Datadog-specific integrations. Many teams use OTel to instrument and the Datadog exporter to ship, getting the best of both.