Load Balancing Algorithms

Round robin, least connections, IP hash, consistent hashing — L4 vs L7, health checks, and sticky sessions

A load balancer sits between clients and servers, distributing incoming requests across multiple backend instances. It's one of the most fundamental components in any scalable system — virtually every production deployment uses one.

This guide covers all major load balancing algorithms, the critical L4 vs L7 distinction, health checks, sticky sessions, and the tools used in real production systems.

1 Where Load Balancers Live

Internet → DNS (GeoDNS routes to nearest region) → Global Load Balancer (Anycast IP, e.g. CloudFront, Cloudflare) → Regional L7 Load Balancer (AWS ALB, Nginx, Envoy) → Service Layer (API Gateway) → Internal L7 Load Balancer (Envoy sidecar / service mesh) → App Servers (multiple instances) → DB Load Balancer (PgBouncer, ProxySQL) → Database read replicas

Large systems have load balancers at multiple layers. Each layer handles a different concern — global traffic routing, SSL termination, service-to-service routing, and database connection pooling.

In system design interviews: when you add "load balancer" to a diagram, specify whether it's L4 or L7, and which algorithm you'd use. "Add an L7 load balancer using least-connections between the API servers" is more impressive than just a box labeled "LB".

2 L4 vs L7 Load Balancers

# OSI Model reminder:
# Layer 4 = Transport layer = TCP/UDP (IP address + port number)
# Layer 7 = Application layer = HTTP/HTTPS (URL, headers, cookies, body)

# L4 Load Balancer (TCP/UDP level):
# ─ Routes based on: source IP, destination IP, port, protocol
# ─ Cannot inspect request content — treats all traffic as byte streams
# ─ Very fast: minimal processing overhead, sub-millisecond
# ─ Examples: AWS Network Load Balancer (NLB), HAProxy TCP mode, F5 BIG-IP
#
# Use L4 when:
# ✓ Non-HTTP traffic (MySQL, Redis, MQTT, custom binary protocol)
# ✓ Need absolute maximum throughput (millions of connections per second)
# ✓ End-to-end encryption required (TLS passthrough — LB doesn't decrypt)

# L7 Load Balancer (HTTP level):
# ─ Routes based on: URL path, HTTP headers, cookies, query params, body
# ─ Can do: SSL termination, content-based routing, header modification, A/B testing
# ─ Slightly more overhead: must parse HTTP headers before routing
# ─ Examples: AWS ALB, Nginx, Envoy, Traefik, HAProxy HTTP mode
#
# Use L7 when:
# ✓ HTTP/HTTPS traffic (web apps, REST APIs, GraphQL)
# ✓ Content-based routing: /api/* → api servers, /static/* → CDN
# ✓ Header-based routing: different headers → different microservices
# ✓ A/B testing: 10% traffic to new deployment (canary)
# ✓ WebSocket support (L7 LBs handle upgrade: websocket headers)

# Content-based routing example (Nginx L7):
upstream api_servers   { server api1:8080; server api2:8080; server api3:8080; }
upstream static_servers{ server cdn1:8080; server cdn2:8080; }
upstream ws_servers    { server ws1:8080;  server ws2:8080;  }

server {
    location /api/     { proxy_pass http://api_servers; }
    location /static/  { proxy_pass http://static_servers; }
    location /ws/      { proxy_pass http://ws_servers; proxy_http_version 1.1;
                         proxy_set_header Upgrade $http_upgrade; }
}

3 Load Balancing Algorithms

Round Robin

Default choice

Requests distributed sequentially: S1 → S2 → S3 → S1 → ... Equal distribution, assumes identical servers.

Weighted Round Robin

Heterogeneous servers

S1 (weight 3) gets 3× more traffic than S2 (weight 1). Use when servers have different capacities.

Least Connections

Long-lived connections

Route to server with fewest active connections. Best when requests have variable processing time.

IP Hash

Session affinity

hash(client_IP) % N → same client always hits same server. Stateful sessions without a shared session store.

Least Response Time

Performance-aware

Combine active connections + response time. Route to server that's fastest AND least loaded.

Random

Simple / testing

Pick a random server. Statistically equivalent to round robin at high request volume. Rarely used in production.

Round Robin — Implementation

# Simple round robin (not thread-safe — illustration only):
class RoundRobinLB:
    def __init__(self, servers: list):
        self.servers = servers
        self.index = 0

    def get_server(self) -> str:
        server = self.servers[self.index % len(self.servers)]
        self.index += 1
        return server

lb = RoundRobinLB(["10.0.0.1:8080", "10.0.0.2:8080", "10.0.0.3:8080"])
lb.get_server()  # 10.0.0.1:8080
lb.get_server()  # 10.0.0.2:8080
lb.get_server()  # 10.0.0.3:8080
lb.get_server()  # 10.0.0.1:8080 (wraps around)

# Problem: if one server is slow (not down, just slow) → it builds up a queue
# while other servers sit idle — round robin doesn't adapt

Least Connections — The Smart Default for Variable Load

# Track active connections per server:
class LeastConnectionsLB:
    def __init__(self, servers: list):
        self.connections = {s: 0 for s in servers}

    def get_server(self) -> str:
        return min(self.connections, key=self.connections.get)

    def on_request_start(self, server: str):
        self.connections[server] += 1

    def on_request_end(self, server: str):
        self.connections[server] = max(0, self.connections[server] - 1)

# Example: 3 servers
# S1: 50 active connections  (processing batch job)
# S2: 5 active connections
# S3: 3 active connections   ← next request goes here

# Best for: database connections, WebSocket connections, file uploads, long-polling
# Overkill for: stateless APIs where every request completes in <50ms
#               (round robin is fine — variance evens out at scale)

Weighted Round Robin — Heterogeneous Infrastructure

# Scenario: you added 2 new high-memory servers but kept 2 old ones
# Old servers: 8 CPU, 16GB RAM  → weight 1
# New servers: 32 CPU, 128GB RAM → weight 4

# Weight 1:1:4:4 means out of every 10 requests:
# old_1 gets 1, old_2 gets 1, new_1 gets 4, new_2 gets 4

# Nginx weighted config:
upstream backend {
    server 10.0.0.1:8080 weight=1;  # old server
    server 10.0.0.2:8080 weight=1;  # old server
    server 10.0.0.3:8080 weight=4;  # new server
    server 10.0.0.4:8080 weight=4;  # new server
}

# Also useful for canary deployments:
upstream backend {
    server v1.api:8080 weight=9;   # 90% traffic to stable version
    server v2.api:8080 weight=1;   # 10% canary to new version
}

IP Hash — Sticky Sessions Without a Cookie

# Use case: stateful server-side sessions (legacy apps, WebSocket routing)
# Problem to solve: user logs in → session stored in memory on Server 1
#                   next request goes to Server 2 → session not found → logged out

# IP Hash solution:
# hash(client_IP) % N → same user always hits same server

# Nginx IP hash config:
upstream backend {
    ip_hash;
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
    server 10.0.0.3:8080;
}

# Limitation: if a server goes down, all its users lose sessions simultaneously
# Better solution: use a shared session store (Redis) + round robin
# Then any server can serve any request → no stickiness needed

# Consistent Hashing for IP routing:
# ip_hash with N servers → adding a server remaps (N-1)/N users
# Consistent hash → only 1/N of users are remapped (see consistent-hashing-explained.html)
# AWS ALB and Nginx Plus support consistent hashing natively

4 Health Checks

A load balancer is only as good as its ability to detect unhealthy servers and stop routing to them.

# Two types of health checks:

# 1. Passive (out-of-band) — observe real traffic errors
#    LB notices: Server 3 returned 5 consecutive 500 errors → mark unhealthy
#    → stop routing to Server 3, alert ops
#    Pro: zero overhead (no extra requests), catches real failures
#    Con: real users experience the failures before the server is marked down

# 2. Active (periodic probe) — LB sends synthetic health check requests
GET /health → 200 OK  (server is healthy)
GET /health → 503     (server is unhealthy)
GET /health → timeout (server is dead / network issue)

# Health check endpoint best practices:
@app.get("/health")
def health_check():
    checks = {}
    # Check DB connectivity
    try:
        db.execute("SELECT 1")
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {str(e)}"
        return JSONResponse({"status": "unhealthy", "checks": checks}, status_code=503)

    # Check Redis connectivity
    try:
        redis_client.ping()
        checks["redis"] = "ok"
    except Exception:
        checks["redis"] = "error"
        # Decide: is Redis critical? Return 503 or degrade gracefully?

    return {"status": "healthy", "checks": checks}

# Health check parameters (Nginx):
upstream backend {
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
    # Passive health: mark down after 3 failures in 30 seconds, retry after 30s
    # max_fails=3 fail_timeout=30s
}

# AWS ALB active health check settings:
# Interval:            30 seconds (check every 30s)
# Timeout:             5 seconds (consider unhealthy if no response)
# Healthy threshold:   2 (2 consecutive successes → healthy)
# Unhealthy threshold: 3 (3 consecutive failures → unhealthy)

# Graceful shutdown:
# When a server is being decommissioned (deploy, restart):
# 1. Signal the LB to stop sending NEW requests (deregister)
# 2. Wait for in-flight requests to complete (drain timeout: 30–300s)
# 3. Then shutdown the server
# Don't just kill the server → drops active requests (502 errors for users)

5 Sticky Sessions (Session Affinity)

# Problem: stateful applications store session in server memory
# User request 1 → Server A → session created in Server A memory
# User request 2 → Server B → no session found → user appears logged out

# Sticky session methods:

# Method 1: Cookie-based stickiness (most common)
# LB injects a cookie: AWSALB=srv-id-abc123
# Subsequent requests from same browser include this cookie
# LB reads it → always routes to Server A
# Pro: works regardless of IP change (mobile users switching networks)
# Con: loses stickiness if cookie is deleted; server death → user logged out

# AWS ALB stickiness (duration-based):
# Enable: Target Group → Attributes → Stickiness → Enabled → Duration: 1 day

# Method 2: IP Hash (see above)
# Pro: no cookie overhead
# Con: IP can change (mobile, NAT, VPN), all users behind corporate NAT go to same server

# Method 3: Don't use sticky sessions — use shared state
# → Store session in Redis (shared across all servers)
# → Any server can serve any request → true horizontal scalability
# → This is the correct architectural choice for new systems

# Redis session store:
session_store = redis.StrictRedis(host='redis.prod', port=6379)

def get_session(session_id: str) -> dict:
    data = session_store.get(f"session:{session_id}")
    return json.loads(data) if data else {}

def save_session(session_id: str, data: dict, ttl_seconds=3600):
    session_store.setex(f"session:{session_id}", ttl_seconds, json.dumps(data))

# Now any of your 20 API servers can serve the user → no sticky sessions needed
Interview framing: "I'd avoid sticky sessions by externalizing session state to Redis. This gives us true horizontal scalability — any server can handle any request, and we can add/remove servers without disrupting existing sessions."

6 Load Balancer Tools in Production

ToolLayerAlgorithmsBest For
Nginx L7 (HTTP) Round robin, least conn, IP hash, random, weighted Web servers, reverse proxy, SSL termination, content routing
HAProxy L4 + L7 Round robin, least conn, source IP, URI hash, random High-performance TCP/HTTP load balancing, database proxying
AWS ALB L7 Round robin, least outstanding requests AWS-native HTTP/HTTPS + WebSocket, Lambda targets, ECS
AWS NLB L4 Flow hash (5-tuple: IP, port, protocol) Ultra-low latency TCP/UDP, static IP, non-HTTP protocols
Envoy Proxy L7 Round robin, least request, ring hash, random, Maglev Service mesh sidecar (Istio), microservices, gRPC
Traefik L7 Round robin, weighted, sticky Kubernetes-native, auto-discover Docker/K8s services
PgBouncer DB proxy Connection pooling (not HTTP) PostgreSQL connection pooling — critical at scale (PostgreSQL has process-per-connection model)

The Maglev Algorithm — Google's Consistent LB

# Google published Maglev (2016) — used in their production load balancers
# Goal: consistent hashing for LBs — same client → same backend across LB instances
# This matters for multi-LB setups (multiple LBs behind an Anycast IP)

# How Maglev works:
# 1. Pre-compute a lookup table of size M (large prime, e.g. 65537)
# 2. Each backend gets entries based on two hash functions (permutation)
# 3. Fill table round-robin until all M entries assigned
# 4. Lookup: hash(5-tuple) % M → table[hash] → backend

# Result:
# - O(1) lookup (table is an array)
# - ~99% consistency when a backend is added/removed (vs ~50% for mod N hashing)
# - Consistent across all LB instances (all use the same pre-computed table)
# Used by: Google, Envoy's "Maglev" balancing policy

7 Choosing the Right Algorithm

ScenarioAlgorithmWhy
Stateless API servers, homogeneous hardwareRound RobinSimple, even distribution, zero state in LB
Long-lived connections (WebSocket, SSE, DB)Least ConnectionsAvoids overloading servers with many slow connections
Mixed server capacities (old + new hardware)Weighted Round RobinProportional distribution by capacity
Stateful servers (legacy, in-memory sessions)IP Hash / Cookie StickySame client always hits same server
Multi-LB cluster, connection consistency mattersConsistent Hash / MaglevAll LB instances make the same routing decision
CPU-intensive requests, variable response timeLeast Response TimeRoutes to server that's actually fast, not just idle
Canary / A/B deploymentWeighted Round Robin10% to new, 90% to old — gradual rollout

Sample Interview Answer

# Interviewer: "How would you load balance 50M requests/day across 10 API servers?"

# Answer:
"I'd use an L7 load balancer — AWS ALB or Nginx — with a least-connections algorithm.
50M requests/day ≈ 580 req/sec average, which is light work for a pool of 10 servers.

For the algorithm: round robin works fine for stateless APIs where all requests complete
in similar time. I'd upgrade to least-connections if any requests are slow (file uploads,
PDF generation) since those would back up one server disproportionately with round robin.

Health checks: active HTTP checks on /health every 30s, unhealthy threshold of 3 failures.
Graceful shutdown: 30-second drain on deploys so in-flight requests complete.
Session state: in Redis, not server memory — so we can use any algorithm without stickiness.

For the database tier: read replicas + HAProxy or ProxySQL to distribute SELECT queries
across replicas, while writes always go to the primary."

What to Study Next