Design a Distributed Rate Limiter

All 5 algorithms explained + distributed Redis implementation + API Gateway integration

June 2026 16 min read Mid / Senior Engineers

A rate limiter controls how many requests a client can make in a given time window — protecting your APIs from abuse, preventing DoS attacks, enforcing fair usage, and reducing infrastructure costs. It's asked in nearly every distributed systems interview.

What makes this interesting is the distributed coordination problem: with 100 API servers behind a load balancer, how do you enforce "100 requests per minute per user" globally, not just per-server?

Pre-reading: The System Design Interview Guide covers the general framework used in this article.

1 Requirements

Functional

Limit requests per user/IP/API key per time window
Support multiple rules: per-endpoint, per-user tier (free/paid), global
Return 429 Too Many Requests with Retry-After header
Return rate limit headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset
Rules configurable without code deploy

Non-Functional

Low latency: <5ms overhead per request
High accuracy: don't allow bursts that significantly exceed the limit
Fault tolerant: if rate limiter fails, allow requests (fail open) or deny all (fail closed) — configurable
Distributed: consistent limits across all API server instances
Scale: 100K+ RPS

2 The 5 Rate Limiting Algorithms

1. Fixed Window Counter

Divide time into fixed windows (e.g., each minute). Count requests per window. Reset counter at window boundary.

# Redis key: "rl:{user_id}:{window_start_minute}"
# Window: floor(now / 60) → e.g., timestamp 1719230400 → window key 28653840

def is_allowed_fixed_window(user_id, limit=100):
    window = int(time.time()) // 60
    key = f"rl:{user_id}:{window}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, 60)   # set TTL on first request
    return count <= limit

Simple Memory efficient Boundary spike bug

Critical flaw: A user can make 100 requests at 12:00:59 and 100 more at 12:01:01 — 200 requests in 2 seconds, while the limit is 100/minute.

2. Sliding Window Log

Store a sorted set of timestamps for each user's requests. Count timestamps within the last window_size seconds.

def is_allowed_sliding_log(user_id, limit=100, window_sec=60):
    now = time.time()
    window_start = now - window_sec
    key = f"rl_log:{user_id}"

    redis.zremrangebyscore(key, 0, window_start)     # remove old entries
    count = redis.zcard(key)
    if count < limit:
        redis.zadd(key, {str(now): now})             # add current timestamp
        redis.expire(key, window_sec + 1)
        return True
    return False

Accurate — no boundary spike High memory — O(N) per user

Stores every request timestamp. For 100 req/min limit × 1M users = 100M entries in Redis — high memory cost.

3. Sliding Window Counter Recommended

Hybrid of fixed window + sliding: use two adjacent fixed window counts + interpolate based on how far into the current window we are.

def is_allowed_sliding_counter(user_id, limit=100, window_sec=60):
    now = time.time()
    current_window = int(now) // window_sec
    prev_window = current_window - 1
    elapsed_in_window = now % window_sec    # seconds elapsed in current window

    curr_key = f"rl:{user_id}:{current_window}"
    prev_key = f"rl:{user_id}:{prev_window}"

    curr_count = int(redis.get(curr_key) or 0)
    prev_count = int(redis.get(prev_key) or 0)

    # Weight prev window by how much of it is still "in scope"
    # If 20s into a 60s window, prev window contributes 40/60 = 67%
    weight = 1.0 - (elapsed_in_window / window_sec)
    estimated_count = prev_count * weight + curr_count

    if estimated_count < limit:
        redis.incr(curr_key)
        redis.expire(curr_key, window_sec * 2)
        return True
    return False

Accurate Memory efficient — O(1) per user Simple Redis ops

Slight approximation (assumes prev window requests were uniformly distributed) but within 0.003% error at scale. Used by Cloudflare, Stripe, and most production systems.

4. Token Bucket

A bucket holds up to capacity tokens. Tokens refill at a steady rate. Each request consumes one token. Burst is naturally allowed up to bucket capacity.

def is_allowed_token_bucket(user_id, capacity=100, refill_rate=10):
    # refill_rate = tokens added per second
    key = f"tb:{user_id}"
    now = time.time()

    # Atomic Lua script (prevents race condition):
    lua_script = """
    local key = KEYS[1]
    local capacity = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])

    local data = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(data[1]) or capacity
    local last_refill = tonumber(data[2]) or now

    -- Refill tokens since last request
    local elapsed = now - last_refill
    tokens = math.min(capacity, tokens + elapsed * refill_rate)

    if tokens >= 1 then
        tokens = tokens - 1
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)
        return 1   -- allowed
    end
    return 0   -- denied
    """
    return redis.eval(lua_script, 1, key, capacity, refill_rate, now) == 1

Handles bursts gracefully Smooth rate Harder to reason about limit

Best for APIs where short bursts are acceptable (e.g., a user sending 20 messages rapidly is fine; sustained 100/sec is not).

5. Leaky Bucket

Requests enter a queue (the "bucket"). The queue drains at a fixed rate. If the queue is full, the request is dropped. Smooths out bursts — all requests exit at the same rate.

# Implemented as a FIFO queue with fixed processing rate
# Use case: async request processing where output rate must be constant
# e.g., payment processing: process exactly 50 payments/sec regardless of burst

# Less common in API rate limiting (most APIs want burst tolerance, not strict smoothing)
# More common in: network traffic shaping, IoT sensor ingestion, billing pipelines

Constant output rate No burst tolerance Queue adds latency

Algorithm Comparison

Algorithm	Memory	Burst Handling	Accuracy	Complexity
Fixed Window	O(1)	Spike at boundary	Low	Low
Sliding Window Log	O(N)	Handles well	High	Medium
Sliding Window Counter	O(1)	Handles well	~High	Low
Token Bucket	O(1)	Best — explicit burst	High	Medium
Leaky Bucket	O(N) queue	None — smoothed	High	Medium

For interviews: Recommend Sliding Window Counter for accuracy + memory efficiency, or Token Bucket if the system needs burst tolerance (most user-facing APIs).

3 The Distributed Problem

With a single API server, rate limiting is trivial — use an in-memory counter. The challenge is enforcing a global limit across 100 API servers.

Why Per-Server Counters Fail

# Setup: 100 API servers, user limit = 100 req/min
# Without distributed coordination:
# User sends 100 requests, load balancer distributes evenly
# → Each server sees 1 request → each server's counter = 1
# → All 100 allowed → but the user has sent 100 requests!
# Worse: user sends 500 requests (5 per server × 100 servers) — all allowed

# The fix: shared counter in Redis
# All 100 API servers increment the SAME Redis key
# Redis is single-threaded → INCR is atomic → no race condition

Redis-Based Distributed Rate Limiter

# Architecture:
Client → Load Balancer → API Server
                              ↓
                          Rate Limiter Middleware (in-process)
                              ↓
                          Redis Cluster (shared counter)
                              ↓ (if allowed)
                          Business Logic Handler

# Production implementation with Lua for atomicity:
RATE_LIMIT_SCRIPT = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local window_start = math.floor(now / window) * window
local full_key = key .. ':' .. window_start

local count = redis.call('INCR', full_key)
if count == 1 then
    redis.call('EXPIRE', full_key, window + 1)
end
local remaining = math.max(0, limit - count)
local reset = window_start + window
return {count, remaining, reset, limit}
"""

def check_rate_limit(user_id: str, endpoint: str) -> RateLimitResult:
    rule = get_rule(user_id, endpoint)   # from config store
    key = f"rl:{user_id}:{endpoint}"
    count, remaining, reset, limit = redis.eval(
        RATE_LIMIT_SCRIPT, 1, key, limit, rule.window_sec, time.time()
    )
    return RateLimitResult(
        allowed=count <= limit,
        limit=limit,
        remaining=remaining,
        reset=int(reset)
    )

# In the API middleware:
def rate_limit_middleware(request):
    result = check_rate_limit(request.user_id, request.endpoint)
    response.headers['X-RateLimit-Limit'] = result.limit
    response.headers['X-RateLimit-Remaining'] = result.remaining
    response.headers['X-RateLimit-Reset'] = result.reset
    if not result.allowed:
        return Response(429, headers={'Retry-After': result.reset - int(time.time())})

Redis Cluster for Scale

# Single Redis node: ~500K ops/sec → enough for 500K RPS rate limit checks
# For 1M+ RPS: Redis Cluster (16,384 hash slots across multiple nodes)
# Shard by user_id: hash(user_id) → consistent node for that user's counter

# Redis Cluster with 3 primary + 3 replica nodes:
# Throughput: 3 × 500K = 1.5M ops/sec
# Latency: <1ms within same datacenter

# Local cache optimization (for high-traffic users):
# Keep a small local counter per-server; sync to Redis every 100ms
# Allows up to N_servers × per_100ms_batch extra requests in worst case
# Acceptable approximation for most APIs; NOT for billing-critical limits

4 Full System Design

Components

Client Request
    ↓
Load Balancer (L7 — API Gateway or nginx)
    ↓
Rate Limiter Middleware (runs in API Gateway or as sidecar)
    ↓ (check)
Redis Cluster (distributed counters)
    ↓ (rules lookup)
Rule Config Store (Redis or etcd — rule definitions per user tier / endpoint)
    ↓ (if allowed)
API Server / Microservice

When denied:
    → 429 Too Many Requests
    → Headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After
    → Log to analytics (throttled requests dashboard)

Where to Place the Rate Limiter

Location	Pros	Cons
Client-side	Zero server load for throttled requests	Easily bypassed — never rely on client
API Gateway	Single enforcement point; language-agnostic; before request reaches service	Gateway is a SPOF — must be HA; can't access business context
Middleware (in-process)	Full access to request context; per-service control	Duplicated logic across services; harder to coordinate globally
Sidecar (service mesh)	No code change to services; Envoy/Istio handles it	Ops complexity; latency overhead per-hop

Best approach: API Gateway for coarse-grained limits (per IP, per API key globally); Middleware for fine-grained limits (per endpoint, per user tier). Layered defense.

Rule Configuration

# Rules stored in Redis or etcd (hot-reloadable without code deploy)
{
  "rules": [
    {
      "name": "global_ip_limit",
      "key": "ip:{ip_address}",
      "limit": 1000,
      "window_sec": 60,
      "action": "allow"   // or "deny", "throttle"
    },
    {
      "name": "free_tier_api_limit",
      "key": "user:{user_id}",
      "condition": "user.tier == 'free'",
      "limit": 100,
      "window_sec": 3600,
      "action": "deny"
    },
    {
      "name": "paid_tier_api_limit",
      "key": "user:{user_id}",
      "condition": "user.tier == 'paid'",
      "limit": 10000,
      "window_sec": 3600,
      "action": "deny"
    },
    {
      "name": "login_endpoint",
      "key": "ip:{ip_address}:login",
      "limit": 5,
      "window_sec": 900,   // 15 minutes
      "action": "deny"     // strict — prevent brute force
    }
  ]
}

# Rule evaluation order: most specific first (endpoint > user tier > IP > global)
# Rules hot-reloaded: API Gateway polls etcd every 5 seconds

5 Fault Tolerance & Edge Cases

Redis Failure — Fail Open vs Fail Closed

# Two strategies when Redis is unavailable:
# Fail Open: allow all requests (prefer availability over rate limiting)
#   → Risk: burst traffic passes through; DDoS protection is lost
#   → Best for: non-security rate limits (tier limits, cost controls)

# Fail Closed: deny all requests (prefer safety over availability)
#   → Risk: all users blocked until Redis recovers
#   → Best for: security-critical limits (login brute-force, payment fraud)

def check_rate_limit_with_fallback(user_id, endpoint):
    try:
        return check_rate_limit(user_id, endpoint)
    except RedisException:
        if FAIL_OPEN:
            return RateLimitResult(allowed=True, ...)   # fail open
        else:
            return RateLimitResult(allowed=False, ...)  # fail closed

# Mitigation: Redis Cluster with replication — single node failure doesn't affect limits
# Further mitigation: local in-process fallback counter (less accurate but functional)

Race Condition — Why Lua Scripts Are Essential

# Problem: without atomicity:
# Server A: count = redis.get(key) → 99
# Server B: count = redis.get(key) → 99
# Server A: if 99 < 100: redis.set(key, 100) → allowed
# Server B: if 99 < 100: redis.set(key, 100) → allowed (WRONG — should be 101)

# Solution 1: Lua script (atomic execution on Redis — no interleave possible)
# Solution 2: Redis transactions (WATCH + MULTI/EXEC) — slower, more complex
# Solution 3: Redis atomic INCR + check result

# INCR is always atomic in Redis:
count = redis.incr(key)   # always returns the new value after increment
if count == 1:
    redis.expire(key, window)
return count <= limit

Multi-Datacenter Rate Limiting

# Challenge: user makes 60 requests in US-East and 60 in EU-West simultaneously
# → Two Redis clusters, each counts independently → user gets 120 req against a 100 limit

# Option 1: Route all rate limit checks to a single global Redis cluster
#   Latency cost: 100–150ms cross-region round-trip → not acceptable for <5ms requirement

# Option 2: Per-region limits (accept the tradeoff)
#   Each region enforces 50% of global limit locally
#   Simple and fast; slightly less accurate for global limits

# Option 3: Global coordination via Synchronizer (async)
#   Each region keeps local counter + periodically syncs delta to global counter
#   Read global counter for enforcement; local for latency
#   Used by Stripe and similar high-throughput billing systems

# Option 4: Overcount + forgive approach
#   Allow limit + slack per region; detect and correct overage in next window
#   Works when exact precision is less critical than low latency

Headers: What to Return

HTTP/1.1 200 OK
X-RateLimit-Limit: 100          # total allowed in this window
X-RateLimit-Remaining: 47       # remaining requests in current window
X-RateLimit-Reset: 1719231060   # Unix timestamp when window resets

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1719231060
Retry-After: 42                 # seconds until next window
{
  "error": "rate_limit_exceeded",
  "message": "You have exceeded 100 requests per minute. Retry after 42 seconds.",
  "docs_url": "https://api.yourcompany.com/docs/rate-limits"
}

6 Trade-offs & Design Decisions

Sliding Window Counter vs Token Bucket: Sliding window counter is easier to explain to users ("100 requests per minute") and simpler to implement. Token bucket is better when you want explicit burst handling — users can burst up to the bucket size, then are limited to the refill rate. For most APIs, sliding window is the right choice.
Where to store counters: Redis is the clear choice — it's in-memory (fast), supports atomic operations (INCR, Lua), has TTL (auto-cleanup), and clusters easily. Database (MySQL/PostgreSQL) is too slow for the <5ms requirement. In-memory (per-server HashMap) doesn't work across multiple servers.
Consistency model: Rate limiting is one of the few cases where we want strong consistency across servers (so limits are accurate). Redis's single-threaded INCR gives us linearizability within a single node — exactly what we need.
Granularity: Per-user limits are more fair than per-IP limits (users can share IPs via NAT/VPN). But per-user requires authentication — for anonymous APIs, per-IP is the fallback. Layer both: deny at IP level first (cheaper), then per-user if authenticated.

What to Study Next

API Design (REST, GraphQL, gRPC) — APIs that need rate limiting
Consistent Hashing — used for Redis cluster sharding
URL Shortener Design — applies rate limiting to the write API
System Design Interview Framework — the approach used here
All System Design Topics →