A rate limiter controls how many requests a client can make in a given time window — protecting your APIs from abuse, preventing DoS attacks, enforcing fair usage, and reducing infrastructure costs. It's asked in nearly every distributed systems interview.
What makes this interesting is the distributed coordination problem: with 100 API servers behind a load balancer, how do you enforce "100 requests per minute per user" globally, not just per-server?
429 Too Many Requests with Retry-After headerX-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-ResetDivide time into fixed windows (e.g., each minute). Count requests per window. Reset counter at window boundary.
# Redis key: "rl:{user_id}:{window_start_minute}"
# Window: floor(now / 60) → e.g., timestamp 1719230400 → window key 28653840
def is_allowed_fixed_window(user_id, limit=100):
window = int(time.time()) // 60
key = f"rl:{user_id}:{window}"
count = redis.incr(key)
if count == 1:
redis.expire(key, 60) # set TTL on first request
return count <= limit
Simple
Memory efficient
Boundary spike bug
Critical flaw: A user can make 100 requests at 12:00:59 and 100 more at 12:01:01 — 200 requests in 2 seconds, while the limit is 100/minute.
Store a sorted set of timestamps for each user's requests. Count timestamps within the last window_size seconds.
def is_allowed_sliding_log(user_id, limit=100, window_sec=60):
now = time.time()
window_start = now - window_sec
key = f"rl_log:{user_id}"
redis.zremrangebyscore(key, 0, window_start) # remove old entries
count = redis.zcard(key)
if count < limit:
redis.zadd(key, {str(now): now}) # add current timestamp
redis.expire(key, window_sec + 1)
return True
return False
Accurate — no boundary spike
High memory — O(N) per user
Stores every request timestamp. For 100 req/min limit × 1M users = 100M entries in Redis — high memory cost.
Hybrid of fixed window + sliding: use two adjacent fixed window counts + interpolate based on how far into the current window we are.
def is_allowed_sliding_counter(user_id, limit=100, window_sec=60):
now = time.time()
current_window = int(now) // window_sec
prev_window = current_window - 1
elapsed_in_window = now % window_sec # seconds elapsed in current window
curr_key = f"rl:{user_id}:{current_window}"
prev_key = f"rl:{user_id}:{prev_window}"
curr_count = int(redis.get(curr_key) or 0)
prev_count = int(redis.get(prev_key) or 0)
# Weight prev window by how much of it is still "in scope"
# If 20s into a 60s window, prev window contributes 40/60 = 67%
weight = 1.0 - (elapsed_in_window / window_sec)
estimated_count = prev_count * weight + curr_count
if estimated_count < limit:
redis.incr(curr_key)
redis.expire(curr_key, window_sec * 2)
return True
return False
Accurate
Memory efficient — O(1) per user
Simple Redis ops
Slight approximation (assumes prev window requests were uniformly distributed) but within 0.003% error at scale. Used by Cloudflare, Stripe, and most production systems.
A bucket holds up to capacity tokens. Tokens refill at a steady rate. Each request consumes one token. Burst is naturally allowed up to bucket capacity.
def is_allowed_token_bucket(user_id, capacity=100, refill_rate=10):
# refill_rate = tokens added per second
key = f"tb:{user_id}"
now = time.time()
# Atomic Lua script (prevents race condition):
lua_script = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now
-- Refill tokens since last request
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 1 -- allowed
end
return 0 -- denied
"""
return redis.eval(lua_script, 1, key, capacity, refill_rate, now) == 1
Handles bursts gracefully
Smooth rate
Harder to reason about limit
Best for APIs where short bursts are acceptable (e.g., a user sending 20 messages rapidly is fine; sustained 100/sec is not).
Requests enter a queue (the "bucket"). The queue drains at a fixed rate. If the queue is full, the request is dropped. Smooths out bursts — all requests exit at the same rate.
# Implemented as a FIFO queue with fixed processing rate # Use case: async request processing where output rate must be constant # e.g., payment processing: process exactly 50 payments/sec regardless of burst # Less common in API rate limiting (most APIs want burst tolerance, not strict smoothing) # More common in: network traffic shaping, IoT sensor ingestion, billing pipelinesConstant output rate No burst tolerance Queue adds latency
| Algorithm | Memory | Burst Handling | Accuracy | Complexity |
|---|---|---|---|---|
| Fixed Window | O(1) | Spike at boundary | Low | Low |
| Sliding Window Log | O(N) | Handles well | High | Medium |
| Sliding Window Counter | O(1) | Handles well | ~High | Low |
| Token Bucket | O(1) | Best — explicit burst | High | Medium |
| Leaky Bucket | O(N) queue | None — smoothed | High | Medium |
With a single API server, rate limiting is trivial — use an in-memory counter. The challenge is enforcing a global limit across 100 API servers.
# Setup: 100 API servers, user limit = 100 req/min # Without distributed coordination: # User sends 100 requests, load balancer distributes evenly # → Each server sees 1 request → each server's counter = 1 # → All 100 allowed → but the user has sent 100 requests! # Worse: user sends 500 requests (5 per server × 100 servers) — all allowed # The fix: shared counter in Redis # All 100 API servers increment the SAME Redis key # Redis is single-threaded → INCR is atomic → no race condition
# Architecture:
Client → Load Balancer → API Server
↓
Rate Limiter Middleware (in-process)
↓
Redis Cluster (shared counter)
↓ (if allowed)
Business Logic Handler
# Production implementation with Lua for atomicity:
RATE_LIMIT_SCRIPT = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local window_start = math.floor(now / window) * window
local full_key = key .. ':' .. window_start
local count = redis.call('INCR', full_key)
if count == 1 then
redis.call('EXPIRE', full_key, window + 1)
end
local remaining = math.max(0, limit - count)
local reset = window_start + window
return {count, remaining, reset, limit}
"""
def check_rate_limit(user_id: str, endpoint: str) -> RateLimitResult:
rule = get_rule(user_id, endpoint) # from config store
key = f"rl:{user_id}:{endpoint}"
count, remaining, reset, limit = redis.eval(
RATE_LIMIT_SCRIPT, 1, key, limit, rule.window_sec, time.time()
)
return RateLimitResult(
allowed=count <= limit,
limit=limit,
remaining=remaining,
reset=int(reset)
)
# In the API middleware:
def rate_limit_middleware(request):
result = check_rate_limit(request.user_id, request.endpoint)
response.headers['X-RateLimit-Limit'] = result.limit
response.headers['X-RateLimit-Remaining'] = result.remaining
response.headers['X-RateLimit-Reset'] = result.reset
if not result.allowed:
return Response(429, headers={'Retry-After': result.reset - int(time.time())})
# Single Redis node: ~500K ops/sec → enough for 500K RPS rate limit checks # For 1M+ RPS: Redis Cluster (16,384 hash slots across multiple nodes) # Shard by user_id: hash(user_id) → consistent node for that user's counter # Redis Cluster with 3 primary + 3 replica nodes: # Throughput: 3 × 500K = 1.5M ops/sec # Latency: <1ms within same datacenter # Local cache optimization (for high-traffic users): # Keep a small local counter per-server; sync to Redis every 100ms # Allows up to N_servers × per_100ms_batch extra requests in worst case # Acceptable approximation for most APIs; NOT for billing-critical limits
Client Request
↓
Load Balancer (L7 — API Gateway or nginx)
↓
Rate Limiter Middleware (runs in API Gateway or as sidecar)
↓ (check)
Redis Cluster (distributed counters)
↓ (rules lookup)
Rule Config Store (Redis or etcd — rule definitions per user tier / endpoint)
↓ (if allowed)
API Server / Microservice
When denied:
→ 429 Too Many Requests
→ Headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After
→ Log to analytics (throttled requests dashboard)
| Location | Pros | Cons |
|---|---|---|
| Client-side | Zero server load for throttled requests | Easily bypassed — never rely on client |
| API Gateway | Single enforcement point; language-agnostic; before request reaches service | Gateway is a SPOF — must be HA; can't access business context |
| Middleware (in-process) | Full access to request context; per-service control | Duplicated logic across services; harder to coordinate globally |
| Sidecar (service mesh) | No code change to services; Envoy/Istio handles it | Ops complexity; latency overhead per-hop |
# Rules stored in Redis or etcd (hot-reloadable without code deploy)
{
"rules": [
{
"name": "global_ip_limit",
"key": "ip:{ip_address}",
"limit": 1000,
"window_sec": 60,
"action": "allow" // or "deny", "throttle"
},
{
"name": "free_tier_api_limit",
"key": "user:{user_id}",
"condition": "user.tier == 'free'",
"limit": 100,
"window_sec": 3600,
"action": "deny"
},
{
"name": "paid_tier_api_limit",
"key": "user:{user_id}",
"condition": "user.tier == 'paid'",
"limit": 10000,
"window_sec": 3600,
"action": "deny"
},
{
"name": "login_endpoint",
"key": "ip:{ip_address}:login",
"limit": 5,
"window_sec": 900, // 15 minutes
"action": "deny" // strict — prevent brute force
}
]
}
# Rule evaluation order: most specific first (endpoint > user tier > IP > global)
# Rules hot-reloaded: API Gateway polls etcd every 5 seconds
# Two strategies when Redis is unavailable:
# Fail Open: allow all requests (prefer availability over rate limiting)
# → Risk: burst traffic passes through; DDoS protection is lost
# → Best for: non-security rate limits (tier limits, cost controls)
# Fail Closed: deny all requests (prefer safety over availability)
# → Risk: all users blocked until Redis recovers
# → Best for: security-critical limits (login brute-force, payment fraud)
def check_rate_limit_with_fallback(user_id, endpoint):
try:
return check_rate_limit(user_id, endpoint)
except RedisException:
if FAIL_OPEN:
return RateLimitResult(allowed=True, ...) # fail open
else:
return RateLimitResult(allowed=False, ...) # fail closed
# Mitigation: Redis Cluster with replication — single node failure doesn't affect limits
# Further mitigation: local in-process fallback counter (less accurate but functional)
# Problem: without atomicity:
# Server A: count = redis.get(key) → 99
# Server B: count = redis.get(key) → 99
# Server A: if 99 < 100: redis.set(key, 100) → allowed
# Server B: if 99 < 100: redis.set(key, 100) → allowed (WRONG — should be 101)
# Solution 1: Lua script (atomic execution on Redis — no interleave possible)
# Solution 2: Redis transactions (WATCH + MULTI/EXEC) — slower, more complex
# Solution 3: Redis atomic INCR + check result
# INCR is always atomic in Redis:
count = redis.incr(key) # always returns the new value after increment
if count == 1:
redis.expire(key, window)
return count <= limit
# Challenge: user makes 60 requests in US-East and 60 in EU-West simultaneously # → Two Redis clusters, each counts independently → user gets 120 req against a 100 limit # Option 1: Route all rate limit checks to a single global Redis cluster # Latency cost: 100–150ms cross-region round-trip → not acceptable for <5ms requirement # Option 2: Per-region limits (accept the tradeoff) # Each region enforces 50% of global limit locally # Simple and fast; slightly less accurate for global limits # Option 3: Global coordination via Synchronizer (async) # Each region keeps local counter + periodically syncs delta to global counter # Read global counter for enforcement; local for latency # Used by Stripe and similar high-throughput billing systems # Option 4: Overcount + forgive approach # Allow limit + slack per region; detect and correct overage in next window # Works when exact precision is less critical than low latency
HTTP/1.1 200 OK
X-RateLimit-Limit: 100 # total allowed in this window
X-RateLimit-Remaining: 47 # remaining requests in current window
X-RateLimit-Reset: 1719231060 # Unix timestamp when window resets
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1719231060
Retry-After: 42 # seconds until next window
{
"error": "rate_limit_exceeded",
"message": "You have exceeded 100 requests per minute. Retry after 42 seconds.",
"docs_url": "https://api.yourcompany.com/docs/rate-limits"
}