Search autocomplete (or "typeahead") is one of the most latency-sensitive features you'll ever build. As a user types each character, suggestions must appear within 50 milliseconds — otherwise the dropdown feels laggy. At Google scale, that means processing billions of queries per day while returning relevant, personalised suggestions faster than a human blink.
This design covers the complete system: the trie data structure powering suggestions, how to rank and update suggestions in real time, caching strategy, and scaling to 100K queries per second.
Critical clarification: Does "autocomplete" mean prefix match only, or also fuzzy/typo-tolerant matching? Prefix-only is much simpler to design. Fuzzy matching (like Google's "did you mean?") requires edit distance computation — a separate system. Scope to prefix for the interview.
# Query volume: 100K autocomplete requests/sec Each request = one keypress → user types avg 5 chars → 100K/5 = 20K actual search submissions/sec # Trie size: 10B unique queries, avg 30 chars each = 300B characters But trie shares prefixes → actual memory much less Estimate: 26 children × 8 bytes pointer × 10B nodes = ~2 TB raw After compression (Patricia trie): ~100–200 GB → Must be stored in-memory for <50ms latency requirement # Cache for hot prefixes: Top 1% prefixes get 80% of traffic (Pareto) 100K prefixes × 200 bytes (5 suggestions × 40 chars each) = 20 MB → Entire hot prefix cache fits in L2 cache easily # Write throughput (query log processing): 20K search submissions/sec → 20K log events/sec to Kafka Batch processing every hour → trie rebuild or incremental update
A trie (prefix tree) is the core data structure. Each node represents a character. The path from root to any node spells a prefix. Nodes that end a valid query store its frequency score.
class TrieNode:
def __init__(self):
self.children: dict[str, TrieNode] = {} # char → child node
self.is_end: bool = False
self.frequency: int = 0
self.top_k: list[tuple[int, str]] = [] # pre-computed: [(freq, query), ...]
class Trie:
def __init__(self, k=5):
self.root = TrieNode()
self.k = k # top-K suggestions to return
def insert(self, query: str, frequency: int):
node = self.root
for char in query:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
# Update top-K at each ancestor node (crucial optimisation):
node.top_k = self._merge_top_k(node.top_k, (frequency, query), self.k)
node.is_end = True
node.frequency = frequency
def search(self, prefix: str) -> list[str]:
node = self.root
for char in prefix:
if char not in node.children:
return []
node = node.children[char]
# top_k is pre-computed at every node — O(prefix_length) lookup!
return [query for freq, query in node.top_k]
def _merge_top_k(self, current_top_k, new_entry, k):
merged = current_top_k + [new_entry]
merged.sort(reverse=True) # sort by frequency descending
return merged[:k]
# Key insight: storing top_k at every node trades memory for speed
# Without top_k pre-computation: search requires DFS through entire subtree = O(N)
# With top_k pre-computation: search is O(prefix_length) = effectively O(1) per keystroke
# Memory cost of pre-computed top_k:
# Each node stores 5 × (8 bytes freq + 30 bytes query) = 190 bytes of top_k data
# 10B nodes × 190 bytes = 1.9 TB — too much
# Reality: most tries have far fewer nodes after compression
# Problem: single-char children chains waste memory # "search" = s→e→a→r→c→h — 6 nodes for one branch # Patricia Trie: collapse linear chains into single edges with string labels # "search" = s → "earch" (one node instead of 6) # Memory reduction: 10B characters → ~1B nodes (10× compression) # This is what real implementations use (Radix tree / PATRICIA trie) # In practice: use an existing library # Python: marisa-trie (memory-efficient, C++ backed) # Java: Apache Lucene's FST (Finite State Transducer) — used by Elasticsearch # Redis: use sorted sets per prefix (discussed below)
Two separate flows:
=== Query (Read) Path — latency critical ===
User keystroke → Client (debounce 100ms) → CDN cache → Autocomplete API
↓
Redis cache (prefix → top-5)
↓ (cache miss)
Trie Service (in-memory trie)
↓
Return top-5 suggestions
=== Update (Write) Path — batch, not real-time ===
User submits search → Search Log (Kafka) → Log Aggregator (Spark Streaming)
↓
Frequency Table (Cassandra)
↓ (hourly batch job)
Trie Builder (re-computes updated nodes)
↓
Trie Store (S3 — serialized snapshots)
↓ (hot-swap)
Trie Service (loads new snapshot)
// Don't fire an API call for every single keystroke — debounce by 100ms
// Without debounce: typing "system" = 6 API calls in rapid succession
// With debounce: only fires after user pauses typing for 100ms
// Reduces API calls by ~60% on average
const debounce = (fn, delay) => {
let timeout;
return (...args) => {
clearTimeout(timeout);
timeout = setTimeout(() => fn(...args), delay);
};
};
const fetchSuggestions = debounce(async (prefix) => {
if (prefix.length < 2) return; // don't suggest for 1-char prefix
const res = await fetch(`/api/autocomplete?q=${encodeURIComponent(prefix)}`);
const { suggestions } = await res.json();
displayDropdown(suggestions);
}, 100);
# Short prefixes (1–3 chars) are queried by essentially ALL users # "th" → "the", "the office", "therapy", "thanos", "thunder" # This result is the same for all users → perfect CDN cache candidate # Cache-Control header on autocomplete responses: # Short prefix (1–2 chars): Cache-Control: public, max-age=3600 (1hr CDN cache) # Medium prefix (3–5 chars): Cache-Control: public, max-age=300 (5min CDN cache) # Long prefix (6+ chars): Cache-Control: private, max-age=0 (no CDN cache — personalised) # With CDN: top 1% prefixes (which get 80% of traffic) are served from edge # → 80% of 100K RPS = 80K RPS served by CDN nodes globally # → Only 20K RPS reach your origin servers
# For cache misses at CDN → check Redis before hitting trie in memory
# Redis sorted set per prefix: ZADD prefix:sys score query
# ZADD "prefix:sys" 380000 "system" 290000 "system design" 175000 "systems"
# ZREVRANGE "prefix:sys" 0 4 → top-5 in O(log N)
# OR simpler: Redis string cache (prefix → JSON array of suggestions)
redis.setex(f"ac:{prefix}", 300, json.dumps(suggestions)) # 5-min TTL
# Cache warm-up: on trie update, pre-populate Redis for top 10K prefixes
# These are the prefixes that will get the most traffic
# Cache invalidation: when trie is rebuilt (hourly), invalidate affected prefix keys
# Use Redis SCAN + DELETE pattern match "ac:{first_3_chars}*"
# For cache misses: query the in-memory trie
# The trie is loaded into RAM once and serves many requests
# Lookup: O(prefix_length) — effectively O(1) for typical 5-char prefix
def get_suggestions(prefix: str) -> list[str]:
prefix = prefix.lower().strip()
if len(prefix) < 1 or len(prefix) > 50:
return []
# Try cache first
cached = redis.get(f"ac:{prefix}")
if cached:
return json.loads(cached)
# Hit trie
suggestions = trie.search(prefix)
# Cache result
if suggestions:
redis.setex(f"ac:{prefix}", 300, json.dumps(suggestions))
return suggestions
# API response:
GET /api/autocomplete?q=sys
{
"prefix": "sys",
"suggestions": [
"system design",
"system",
"systems",
"syscall",
"system requirements"
],
"source": "cache" // for debugging: "cache" | "trie"
}
# Step 1: Every search submission → Kafka topic "search-queries"
# Payload: { user_id, query, timestamp, locale, session_id }
# Step 2: Spark Streaming aggregation (sliding window)
# Group by (query, 1-hour window) → count occurrences
# Output: { query: "system design", frequency: 48273, window: "2026-06-24T14:00" }
# Step 3: Frequency table in Cassandra (upsert with decay)
CREATE TABLE query_frequencies (
query TEXT PRIMARY KEY,
frequency COUNTER, -- Cassandra counter type: atomic increment
last_seen TIMESTAMP,
locale TEXT
);
UPDATE query_frequencies SET frequency = frequency + 1 WHERE query = ?;
# Frequency decay: queries trending down should be demoted
# Apply exponential decay: freq_today = freq_yesterday × 0.9 + new_count_today
# This means a query needs sustained search volume to stay in top suggestions
# Option A: Full rebuild (simpler, recommended for most scales)
def rebuild_trie():
# 1. Read top 10M queries by frequency from Cassandra
top_queries = cassandra.query(
"SELECT query, frequency FROM query_frequencies ORDER BY frequency DESC LIMIT 10000000"
)
# 2. Build new trie in memory (takes ~5 minutes for 10M queries)
new_trie = Trie(k=5)
for query, freq in top_queries:
new_trie.insert(query, freq)
# 3. Serialize to binary (pickle / protobuf) and upload to S3
snapshot = serialize(new_trie)
s3.put("trie-snapshots/trie-latest.bin", snapshot)
# 4. Signal trie service workers to hot-swap (blue-green deployment)
redis.publish("trie-reload", "trie-snapshots/trie-latest.bin")
# Hot-swap: each trie server keeps TWO tries in memory
# Active trie: serves requests
# Standby trie: downloads + loads new snapshot
# On reload signal: atomic swap (pointer swap) — zero downtime
# Option B: Incremental update (more complex, for real-time requirements)
# Update only the affected nodes (queries that changed frequency since last build)
# Harder to implement correctly — full rebuild at 10M queries takes only 5 min anyway
# Problem: "World Cup 2026" suddenly trends → not in trie yet → missing from suggestions
# Hourly rebuild means 59-minute lag → bad for breaking news / live events
# Solution: two-tier suggestion system
# Tier 1: Pre-built trie (top 10M historical queries, updated hourly)
# Tier 2: Real-time trending layer (Redis sorted set, updated every 5 minutes)
# Trending queries identified by Spark: sudden spike relative to 7-day baseline
# If query volume > 3× baseline in last 15 minutes → mark as "trending"
# Merge in API layer:
def get_suggestions(prefix):
trie_results = trie.search(prefix) # historical top-K
trending = redis.zrevrange(f"trending:{prefix}", 0, 2) # top-3 trending
# Merge: trending first (flagged with ⬆ icon), then historical
merged = dedupe(trending + trie_results)
return merged[:5]
# The trie is read-only → trivially horizontally scalable # Each server loads the full trie snapshot into RAM (~100GB after compression) # 100K QPS, each request takes <1ms → 100K RPS/server is feasible (event-driven) # But trie lookup is CPU-bound, not I/O-bound → 1 request per core (not async) # Practical estimate: 16-core server, 100K QPS → 6.25K RPS/core → <1ms each → fine # For safety: 5 servers × 20K RPS each = 100K QPS with N+2 redundancy # Deployment: 5 trie servers behind L7 load balancer # All 5 serve the same data (identical trie snapshot) # No sharding needed — the full trie fits in RAM # Trie snapshot size: # 10M queries × 40 chars avg = 400 MB raw text # Patricia trie compression → ~50–100 MB in memory (with top-K precomputed) # Even if 10× larger: 1 GB per server → trivial with 128 GB RAM servers
# Deploy trie servers in each region (US, EU, APAC) # CDN routes users to nearest region: 60% US, 25% EU, 15% APAC # Trie snapshot replicated to all regions from S3 (global CDN for the snapshot itself) # Result: typical user gets suggestions from a server <50ms away → total latency <10ms
# Basic autocomplete: same suggestions for all users with same prefix
# Personalised autocomplete: factor in user's search history + location
# Implementation:
# 1. Get global top-K suggestions for prefix (trie)
# 2. Get user's personal top searches for prefix (Redis: "user:{id}:ac:{prefix}")
# 3. Merge: personal suggestions first, then global, deduplicated
# User's personal search history stored in Redis:
# Key: "user:{user_id}:history" Value: sorted set of {query: last_searched_timestamp}
# On search: ZADD "user:{user_id}:history" {timestamp} {query}; ZREMRANGEBYRANK ... 0 -101
# Personalisation trade-off: increases query complexity, can't use CDN cache
# Disable for anonymous users; enable for logged-in users with history > 10 searches
| Decision | Chosen Approach | Trade-off |
|---|---|---|
| Data structure | Patricia Trie in RAM with pre-computed top-K | High memory use vs O(1) lookup. Worth it — RAM is cheap, latency is not. |
| Update strategy | Hourly batch rebuild + hot-swap | Simple, reliable, zero-downtime. Trade-off: 1-hour freshness lag. Acceptable for most apps. |
| Trie sharding | Full trie on each server (no sharding) | Simpler routing, no cross-shard joins. Works as long as trie fits in RAM (~100GB after compression). Shard by first character only if trie exceeds available RAM. |
| Cache placement | CDN → Redis → Trie (three tiers) | CDN handles 80% of traffic for common prefixes. Redis handles rare cache misses. Trie is last resort. Inverted dependency — CDN hit means <5ms; trie hit means <50ms. |
| Ranking signal | Query frequency only | Simple, explainable, fast to compute. In production, also factor: recency (trending), user's location, query completion rate. More signals = better suggestions but more complex ranking. |
# Option 1: Shard trie by first character (26 shards or by frequency) # Route "a*" queries to shard A, "b*" to shard B, etc. # Problem: uneven distribution — "s" is far more common than "x" # Better: shard by frequency bucket (top 1M queries on fast servers, rest on slower) # Option 2: Use Elasticsearch instead of custom trie # Elasticsearch has built-in completion suggester (edge n-gram based, not trie) # Pros: managed infrastructure, handles 10B+ terms, built-in ranking # Cons: higher latency (5–20ms vs <1ms for in-memory trie), more infrastructure # Use when: you also need fuzzy matching, multi-language, faceted suggestions # Option 3: Redis ZRANGEBYLEX (for moderate scale, up to 1M unique queries) # Store all queries in a sorted set with empty scores (alphabetical ordering) # ZRANGEBYLEX "autocomplete" "[sys" "[sys\xff" LIMIT 0 5 # Fast, simple, no custom trie — but limited to alphabetical ranking (no frequency)