A load balancer sits between clients and servers, distributing incoming requests across multiple backend instances. It's one of the most fundamental components in any scalable system — virtually every production deployment uses one.
This guide covers all major load balancing algorithms, the critical L4 vs L7 distinction, health checks, sticky sessions, and the tools used in real production systems.
Large systems have load balancers at multiple layers. Each layer handles a different concern — global traffic routing, SSL termination, service-to-service routing, and database connection pooling.
# OSI Model reminder:
# Layer 4 = Transport layer = TCP/UDP (IP address + port number)
# Layer 7 = Application layer = HTTP/HTTPS (URL, headers, cookies, body)
# L4 Load Balancer (TCP/UDP level):
# ─ Routes based on: source IP, destination IP, port, protocol
# ─ Cannot inspect request content — treats all traffic as byte streams
# ─ Very fast: minimal processing overhead, sub-millisecond
# ─ Examples: AWS Network Load Balancer (NLB), HAProxy TCP mode, F5 BIG-IP
#
# Use L4 when:
# ✓ Non-HTTP traffic (MySQL, Redis, MQTT, custom binary protocol)
# ✓ Need absolute maximum throughput (millions of connections per second)
# ✓ End-to-end encryption required (TLS passthrough — LB doesn't decrypt)
# L7 Load Balancer (HTTP level):
# ─ Routes based on: URL path, HTTP headers, cookies, query params, body
# ─ Can do: SSL termination, content-based routing, header modification, A/B testing
# ─ Slightly more overhead: must parse HTTP headers before routing
# ─ Examples: AWS ALB, Nginx, Envoy, Traefik, HAProxy HTTP mode
#
# Use L7 when:
# ✓ HTTP/HTTPS traffic (web apps, REST APIs, GraphQL)
# ✓ Content-based routing: /api/* → api servers, /static/* → CDN
# ✓ Header-based routing: different headers → different microservices
# ✓ A/B testing: 10% traffic to new deployment (canary)
# ✓ WebSocket support (L7 LBs handle upgrade: websocket headers)
# Content-based routing example (Nginx L7):
upstream api_servers { server api1:8080; server api2:8080; server api3:8080; }
upstream static_servers{ server cdn1:8080; server cdn2:8080; }
upstream ws_servers { server ws1:8080; server ws2:8080; }
server {
location /api/ { proxy_pass http://api_servers; }
location /static/ { proxy_pass http://static_servers; }
location /ws/ { proxy_pass http://ws_servers; proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade; }
}
Requests distributed sequentially: S1 → S2 → S3 → S1 → ... Equal distribution, assumes identical servers.
S1 (weight 3) gets 3× more traffic than S2 (weight 1). Use when servers have different capacities.
Route to server with fewest active connections. Best when requests have variable processing time.
hash(client_IP) % N → same client always hits same server. Stateful sessions without a shared session store.
Combine active connections + response time. Route to server that's fastest AND least loaded.
Pick a random server. Statistically equivalent to round robin at high request volume. Rarely used in production.
# Simple round robin (not thread-safe — illustration only):
class RoundRobinLB:
def __init__(self, servers: list):
self.servers = servers
self.index = 0
def get_server(self) -> str:
server = self.servers[self.index % len(self.servers)]
self.index += 1
return server
lb = RoundRobinLB(["10.0.0.1:8080", "10.0.0.2:8080", "10.0.0.3:8080"])
lb.get_server() # 10.0.0.1:8080
lb.get_server() # 10.0.0.2:8080
lb.get_server() # 10.0.0.3:8080
lb.get_server() # 10.0.0.1:8080 (wraps around)
# Problem: if one server is slow (not down, just slow) → it builds up a queue
# while other servers sit idle — round robin doesn't adapt
# Track active connections per server:
class LeastConnectionsLB:
def __init__(self, servers: list):
self.connections = {s: 0 for s in servers}
def get_server(self) -> str:
return min(self.connections, key=self.connections.get)
def on_request_start(self, server: str):
self.connections[server] += 1
def on_request_end(self, server: str):
self.connections[server] = max(0, self.connections[server] - 1)
# Example: 3 servers
# S1: 50 active connections (processing batch job)
# S2: 5 active connections
# S3: 3 active connections ← next request goes here
# Best for: database connections, WebSocket connections, file uploads, long-polling
# Overkill for: stateless APIs where every request completes in <50ms
# (round robin is fine — variance evens out at scale)
# Scenario: you added 2 new high-memory servers but kept 2 old ones
# Old servers: 8 CPU, 16GB RAM → weight 1
# New servers: 32 CPU, 128GB RAM → weight 4
# Weight 1:1:4:4 means out of every 10 requests:
# old_1 gets 1, old_2 gets 1, new_1 gets 4, new_2 gets 4
# Nginx weighted config:
upstream backend {
server 10.0.0.1:8080 weight=1; # old server
server 10.0.0.2:8080 weight=1; # old server
server 10.0.0.3:8080 weight=4; # new server
server 10.0.0.4:8080 weight=4; # new server
}
# Also useful for canary deployments:
upstream backend {
server v1.api:8080 weight=9; # 90% traffic to stable version
server v2.api:8080 weight=1; # 10% canary to new version
}
# Use case: stateful server-side sessions (legacy apps, WebSocket routing)
# Problem to solve: user logs in → session stored in memory on Server 1
# next request goes to Server 2 → session not found → logged out
# IP Hash solution:
# hash(client_IP) % N → same user always hits same server
# Nginx IP hash config:
upstream backend {
ip_hash;
server 10.0.0.1:8080;
server 10.0.0.2:8080;
server 10.0.0.3:8080;
}
# Limitation: if a server goes down, all its users lose sessions simultaneously
# Better solution: use a shared session store (Redis) + round robin
# Then any server can serve any request → no stickiness needed
# Consistent Hashing for IP routing:
# ip_hash with N servers → adding a server remaps (N-1)/N users
# Consistent hash → only 1/N of users are remapped (see consistent-hashing-explained.html)
# AWS ALB and Nginx Plus support consistent hashing natively
A load balancer is only as good as its ability to detect unhealthy servers and stop routing to them.
# Two types of health checks:
# 1. Passive (out-of-band) — observe real traffic errors
# LB notices: Server 3 returned 5 consecutive 500 errors → mark unhealthy
# → stop routing to Server 3, alert ops
# Pro: zero overhead (no extra requests), catches real failures
# Con: real users experience the failures before the server is marked down
# 2. Active (periodic probe) — LB sends synthetic health check requests
GET /health → 200 OK (server is healthy)
GET /health → 503 (server is unhealthy)
GET /health → timeout (server is dead / network issue)
# Health check endpoint best practices:
@app.get("/health")
def health_check():
checks = {}
# Check DB connectivity
try:
db.execute("SELECT 1")
checks["database"] = "ok"
except Exception as e:
checks["database"] = f"error: {str(e)}"
return JSONResponse({"status": "unhealthy", "checks": checks}, status_code=503)
# Check Redis connectivity
try:
redis_client.ping()
checks["redis"] = "ok"
except Exception:
checks["redis"] = "error"
# Decide: is Redis critical? Return 503 or degrade gracefully?
return {"status": "healthy", "checks": checks}
# Health check parameters (Nginx):
upstream backend {
server 10.0.0.1:8080;
server 10.0.0.2:8080;
# Passive health: mark down after 3 failures in 30 seconds, retry after 30s
# max_fails=3 fail_timeout=30s
}
# AWS ALB active health check settings:
# Interval: 30 seconds (check every 30s)
# Timeout: 5 seconds (consider unhealthy if no response)
# Healthy threshold: 2 (2 consecutive successes → healthy)
# Unhealthy threshold: 3 (3 consecutive failures → unhealthy)
# Graceful shutdown:
# When a server is being decommissioned (deploy, restart):
# 1. Signal the LB to stop sending NEW requests (deregister)
# 2. Wait for in-flight requests to complete (drain timeout: 30–300s)
# 3. Then shutdown the server
# Don't just kill the server → drops active requests (502 errors for users)
# Problem: stateful applications store session in server memory
# User request 1 → Server A → session created in Server A memory
# User request 2 → Server B → no session found → user appears logged out
# Sticky session methods:
# Method 1: Cookie-based stickiness (most common)
# LB injects a cookie: AWSALB=srv-id-abc123
# Subsequent requests from same browser include this cookie
# LB reads it → always routes to Server A
# Pro: works regardless of IP change (mobile users switching networks)
# Con: loses stickiness if cookie is deleted; server death → user logged out
# AWS ALB stickiness (duration-based):
# Enable: Target Group → Attributes → Stickiness → Enabled → Duration: 1 day
# Method 2: IP Hash (see above)
# Pro: no cookie overhead
# Con: IP can change (mobile, NAT, VPN), all users behind corporate NAT go to same server
# Method 3: Don't use sticky sessions — use shared state
# → Store session in Redis (shared across all servers)
# → Any server can serve any request → true horizontal scalability
# → This is the correct architectural choice for new systems
# Redis session store:
session_store = redis.StrictRedis(host='redis.prod', port=6379)
def get_session(session_id: str) -> dict:
data = session_store.get(f"session:{session_id}")
return json.loads(data) if data else {}
def save_session(session_id: str, data: dict, ttl_seconds=3600):
session_store.setex(f"session:{session_id}", ttl_seconds, json.dumps(data))
# Now any of your 20 API servers can serve the user → no sticky sessions needed
| Tool | Layer | Algorithms | Best For |
|---|---|---|---|
| Nginx | L7 (HTTP) | Round robin, least conn, IP hash, random, weighted | Web servers, reverse proxy, SSL termination, content routing |
| HAProxy | L4 + L7 | Round robin, least conn, source IP, URI hash, random | High-performance TCP/HTTP load balancing, database proxying |
| AWS ALB | L7 | Round robin, least outstanding requests | AWS-native HTTP/HTTPS + WebSocket, Lambda targets, ECS |
| AWS NLB | L4 | Flow hash (5-tuple: IP, port, protocol) | Ultra-low latency TCP/UDP, static IP, non-HTTP protocols |
| Envoy Proxy | L7 | Round robin, least request, ring hash, random, Maglev | Service mesh sidecar (Istio), microservices, gRPC |
| Traefik | L7 | Round robin, weighted, sticky | Kubernetes-native, auto-discover Docker/K8s services |
| PgBouncer | DB proxy | Connection pooling (not HTTP) | PostgreSQL connection pooling — critical at scale (PostgreSQL has process-per-connection model) |
# Google published Maglev (2016) — used in their production load balancers # Goal: consistent hashing for LBs — same client → same backend across LB instances # This matters for multi-LB setups (multiple LBs behind an Anycast IP) # How Maglev works: # 1. Pre-compute a lookup table of size M (large prime, e.g. 65537) # 2. Each backend gets entries based on two hash functions (permutation) # 3. Fill table round-robin until all M entries assigned # 4. Lookup: hash(5-tuple) % M → table[hash] → backend # Result: # - O(1) lookup (table is an array) # - ~99% consistency when a backend is added/removed (vs ~50% for mod N hashing) # - Consistent across all LB instances (all use the same pre-computed table) # Used by: Google, Envoy's "Maglev" balancing policy
| Scenario | Algorithm | Why |
|---|---|---|
| Stateless API servers, homogeneous hardware | Round Robin | Simple, even distribution, zero state in LB |
| Long-lived connections (WebSocket, SSE, DB) | Least Connections | Avoids overloading servers with many slow connections |
| Mixed server capacities (old + new hardware) | Weighted Round Robin | Proportional distribution by capacity |
| Stateful servers (legacy, in-memory sessions) | IP Hash / Cookie Sticky | Same client always hits same server |
| Multi-LB cluster, connection consistency matters | Consistent Hash / Maglev | All LB instances make the same routing decision |
| CPU-intensive requests, variable response time | Least Response Time | Routes to server that's actually fast, not just idle |
| Canary / A/B deployment | Weighted Round Robin | 10% to new, 90% to old — gradual rollout |
# Interviewer: "How would you load balance 50M requests/day across 10 API servers?" # Answer: "I'd use an L7 load balancer — AWS ALB or Nginx — with a least-connections algorithm. 50M requests/day ≈ 580 req/sec average, which is light work for a pool of 10 servers. For the algorithm: round robin works fine for stateless APIs where all requests complete in similar time. I'd upgrade to least-connections if any requests are slow (file uploads, PDF generation) since those would back up one server disproportionately with round robin. Health checks: active HTTP checks on /health every 30s, unhealthy threshold of 3 failures. Graceful shutdown: 30-second drain on deploys so in-flight requests complete. Session state: in Redis, not server memory — so we can use any algorithm without stickiness. For the database tier: read replicas + HAProxy or ProxySQL to distribute SELECT queries across replicas, while writes always go to the primary."