Design Search Autocomplete (like Google / Amazon)

Trie data structure, top-K suggestions, real-time ranking updates, and 100K QPS with <50ms latency

Search autocomplete (or "typeahead") is one of the most latency-sensitive features you'll ever build. As a user types each character, suggestions must appear within 50 milliseconds — otherwise the dropdown feels laggy. At Google scale, that means processing billions of queries per day while returning relevant, personalised suggestions faster than a human blink.

This design covers the complete system: the trie data structure powering suggestions, how to rank and update suggestions in real time, caching strategy, and scaling to 100K queries per second.

Pre-reading: The System Design Interview Guide covers the general approach used throughout.

1 Requirements

Functional

  • Return top-5 suggestions for any prefix the user types
  • Suggestions ranked by query frequency (most searched first)
  • Support all lowercase English characters (extend to Unicode later)
  • Suggestions updated in near real-time (within 1 hour of trending queries)
  • Optional: personalised suggestions based on user's search history
  • Optional: filter offensive/inappropriate suggestions

Non-Functional

  • 100K queries per second
  • Latency: <50ms p99 (from keypress to suggestions displayed)
  • High availability: 99.9%
  • Eventual consistency: suggestion freshness lag of up to 1 hour is acceptable
  • Scale: support 10B unique queries in the dataset

Critical clarification: Does "autocomplete" mean prefix match only, or also fuzzy/typo-tolerant matching? Prefix-only is much simpler to design. Fuzzy matching (like Google's "did you mean?") requires edit distance computation — a separate system. Scope to prefix for the interview.

2 Capacity Estimation

# Query volume:
100K autocomplete requests/sec
Each request = one keypress → user types avg 5 chars → 100K/5 = 20K actual search submissions/sec

# Trie size:
10B unique queries, avg 30 chars each = 300B characters
But trie shares prefixes → actual memory much less
Estimate: 26 children × 8 bytes pointer × 10B nodes = ~2 TB raw
After compression (Patricia trie): ~100–200 GB
→ Must be stored in-memory for <50ms latency requirement

# Cache for hot prefixes:
Top 1% prefixes get 80% of traffic (Pareto)
100K prefixes × 200 bytes (5 suggestions × 40 chars each) = 20 MB
→ Entire hot prefix cache fits in L2 cache easily

# Write throughput (query log processing):
20K search submissions/sec → 20K log events/sec to Kafka
Batch processing every hour → trie rebuild or incremental update

3 The Trie Data Structure

A trie (prefix tree) is the core data structure. Each node represents a character. The path from root to any node spells a prefix. Nodes that end a valid query store its frequency score.

Queries: "system", "system design", "systems", "search", "search engine" Trie structure: root ├── s │ ├── e │ │ └── a │ │ └── r │ │ └── c │ │ └── h [freq:450000] │ │ └── " engine" [freq:220000] │ └── y │ └── s │ └── t │ └── e │ └── m [freq:380000] │ ├── " design" [freq:290000] │ └── s [freq:175000] Query for prefix "sys" → traverse to node 's'→'y'→'s' → collect all words in subtree ranked by frequency → return: ["system design", "system", "systems"]
class TrieNode:
    def __init__(self):
        self.children: dict[str, TrieNode] = {}   # char → child node
        self.is_end: bool = False
        self.frequency: int = 0
        self.top_k: list[tuple[int, str]] = []    # pre-computed: [(freq, query), ...]

class Trie:
    def __init__(self, k=5):
        self.root = TrieNode()
        self.k = k   # top-K suggestions to return

    def insert(self, query: str, frequency: int):
        node = self.root
        for char in query:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
            # Update top-K at each ancestor node (crucial optimisation):
            node.top_k = self._merge_top_k(node.top_k, (frequency, query), self.k)
        node.is_end = True
        node.frequency = frequency

    def search(self, prefix: str) -> list[str]:
        node = self.root
        for char in prefix:
            if char not in node.children:
                return []
            node = node.children[char]
        # top_k is pre-computed at every node — O(prefix_length) lookup!
        return [query for freq, query in node.top_k]

    def _merge_top_k(self, current_top_k, new_entry, k):
        merged = current_top_k + [new_entry]
        merged.sort(reverse=True)   # sort by frequency descending
        return merged[:k]

# Key insight: storing top_k at every node trades memory for speed
# Without top_k pre-computation: search requires DFS through entire subtree = O(N)
# With top_k pre-computation:    search is O(prefix_length) = effectively O(1) per keystroke

# Memory cost of pre-computed top_k:
# Each node stores 5 × (8 bytes freq + 30 bytes query) = 190 bytes of top_k data
# 10B nodes × 190 bytes = 1.9 TB — too much
# Reality: most tries have far fewer nodes after compression

Patricia Trie (Compressed Trie) — Memory Optimisation

# Problem: single-char children chains waste memory
# "search" = s→e→a→r→c→h — 6 nodes for one branch

# Patricia Trie: collapse linear chains into single edges with string labels
# "search" = s → "earch" (one node instead of 6)

# Memory reduction: 10B characters → ~1B nodes (10× compression)
# This is what real implementations use (Radix tree / PATRICIA trie)

# In practice: use an existing library
# Python: marisa-trie (memory-efficient, C++ backed)
# Java: Apache Lucene's FST (Finite State Transducer) — used by Elasticsearch
# Redis: use sorted sets per prefix (discussed below)

4 High-Level Design

Two separate flows:

=== Query (Read) Path — latency critical ===
User keystroke → Client (debounce 100ms) → CDN cache → Autocomplete API
                                                            ↓
                                                     Redis cache (prefix → top-5)
                                                            ↓ (cache miss)
                                                     Trie Service (in-memory trie)
                                                            ↓
                                                     Return top-5 suggestions

=== Update (Write) Path — batch, not real-time ===
User submits search → Search Log (Kafka) → Log Aggregator (Spark Streaming)
                                                  ↓
                                     Frequency Table (Cassandra)
                                                  ↓ (hourly batch job)
                                     Trie Builder (re-computes updated nodes)
                                                  ↓
                                     Trie Store (S3 — serialized snapshots)
                                                  ↓ (hot-swap)
                                     Trie Service (loads new snapshot)
The read path and write path are completely decoupled. The trie is read-optimised and updated in batch (hourly). This allows <50ms read latency without being slowed by write complexity.

5 Read Path — Getting Suggestions Fast

Client-Side Debouncing

// Don't fire an API call for every single keystroke — debounce by 100ms
// Without debounce: typing "system" = 6 API calls in rapid succession
// With debounce: only fires after user pauses typing for 100ms
// Reduces API calls by ~60% on average

const debounce = (fn, delay) => {
  let timeout;
  return (...args) => {
    clearTimeout(timeout);
    timeout = setTimeout(() => fn(...args), delay);
  };
};

const fetchSuggestions = debounce(async (prefix) => {
  if (prefix.length < 2) return;   // don't suggest for 1-char prefix
  const res = await fetch(`/api/autocomplete?q=${encodeURIComponent(prefix)}`);
  const { suggestions } = await res.json();
  displayDropdown(suggestions);
}, 100);

CDN Caching for Common Prefixes

# Short prefixes (1–3 chars) are queried by essentially ALL users
# "th" → "the", "the office", "therapy", "thanos", "thunder"
# This result is the same for all users → perfect CDN cache candidate

# Cache-Control header on autocomplete responses:
# Short prefix (1–2 chars): Cache-Control: public, max-age=3600   (1hr CDN cache)
# Medium prefix (3–5 chars): Cache-Control: public, max-age=300   (5min CDN cache)
# Long prefix (6+ chars): Cache-Control: private, max-age=0       (no CDN cache — personalised)

# With CDN: top 1% prefixes (which get 80% of traffic) are served from edge
# → 80% of 100K RPS = 80K RPS served by CDN nodes globally
# → Only 20K RPS reach your origin servers

Redis Cache Layer

# For cache misses at CDN → check Redis before hitting trie in memory
# Redis sorted set per prefix: ZADD prefix:sys score query
# ZADD "prefix:sys" 380000 "system" 290000 "system design" 175000 "systems"
# ZREVRANGE "prefix:sys" 0 4 → top-5 in O(log N)

# OR simpler: Redis string cache (prefix → JSON array of suggestions)
redis.setex(f"ac:{prefix}", 300, json.dumps(suggestions))  # 5-min TTL

# Cache warm-up: on trie update, pre-populate Redis for top 10K prefixes
# These are the prefixes that will get the most traffic

# Cache invalidation: when trie is rebuilt (hourly), invalidate affected prefix keys
# Use Redis SCAN + DELETE pattern match "ac:{first_3_chars}*"

Trie Service — In-Memory Lookup

# For cache misses: query the in-memory trie
# The trie is loaded into RAM once and serves many requests
# Lookup: O(prefix_length) — effectively O(1) for typical 5-char prefix

def get_suggestions(prefix: str) -> list[str]:
    prefix = prefix.lower().strip()
    if len(prefix) < 1 or len(prefix) > 50:
        return []

    # Try cache first
    cached = redis.get(f"ac:{prefix}")
    if cached:
        return json.loads(cached)

    # Hit trie
    suggestions = trie.search(prefix)

    # Cache result
    if suggestions:
        redis.setex(f"ac:{prefix}", 300, json.dumps(suggestions))

    return suggestions

# API response:
GET /api/autocomplete?q=sys
{
  "prefix": "sys",
  "suggestions": [
    "system design",
    "system",
    "systems",
    "syscall",
    "system requirements"
  ],
  "source": "cache"  // for debugging: "cache" | "trie"
}

6 Write Path — Keeping Suggestions Fresh

Query Log Pipeline

# Step 1: Every search submission → Kafka topic "search-queries"
# Payload: { user_id, query, timestamp, locale, session_id }

# Step 2: Spark Streaming aggregation (sliding window)
# Group by (query, 1-hour window) → count occurrences
# Output: { query: "system design", frequency: 48273, window: "2026-06-24T14:00" }

# Step 3: Frequency table in Cassandra (upsert with decay)
CREATE TABLE query_frequencies (
  query       TEXT PRIMARY KEY,
  frequency   COUNTER,            -- Cassandra counter type: atomic increment
  last_seen   TIMESTAMP,
  locale      TEXT
);
UPDATE query_frequencies SET frequency = frequency + 1 WHERE query = ?;

# Frequency decay: queries trending down should be demoted
# Apply exponential decay: freq_today = freq_yesterday × 0.9 + new_count_today
# This means a query needs sustained search volume to stay in top suggestions

Trie Rebuild (Hourly Batch Job)

# Option A: Full rebuild (simpler, recommended for most scales)
def rebuild_trie():
    # 1. Read top 10M queries by frequency from Cassandra
    top_queries = cassandra.query(
        "SELECT query, frequency FROM query_frequencies ORDER BY frequency DESC LIMIT 10000000"
    )
    # 2. Build new trie in memory (takes ~5 minutes for 10M queries)
    new_trie = Trie(k=5)
    for query, freq in top_queries:
        new_trie.insert(query, freq)

    # 3. Serialize to binary (pickle / protobuf) and upload to S3
    snapshot = serialize(new_trie)
    s3.put("trie-snapshots/trie-latest.bin", snapshot)

    # 4. Signal trie service workers to hot-swap (blue-green deployment)
    redis.publish("trie-reload", "trie-snapshots/trie-latest.bin")

# Hot-swap: each trie server keeps TWO tries in memory
# Active trie: serves requests
# Standby trie: downloads + loads new snapshot
# On reload signal: atomic swap (pointer swap) — zero downtime

# Option B: Incremental update (more complex, for real-time requirements)
# Update only the affected nodes (queries that changed frequency since last build)
# Harder to implement correctly — full rebuild at 10M queries takes only 5 min anyway

Handling Trending Queries

# Problem: "World Cup 2026" suddenly trends → not in trie yet → missing from suggestions
# Hourly rebuild means 59-minute lag → bad for breaking news / live events

# Solution: two-tier suggestion system
# Tier 1: Pre-built trie (top 10M historical queries, updated hourly)
# Tier 2: Real-time trending layer (Redis sorted set, updated every 5 minutes)

# Trending queries identified by Spark: sudden spike relative to 7-day baseline
# If query volume > 3× baseline in last 15 minutes → mark as "trending"

# Merge in API layer:
def get_suggestions(prefix):
    trie_results = trie.search(prefix)          # historical top-K
    trending = redis.zrevrange(f"trending:{prefix}", 0, 2)  # top-3 trending

    # Merge: trending first (flagged with ⬆ icon), then historical
    merged = dedupe(trending + trie_results)
    return merged[:5]

7 Scaling to 100K QPS

Trie Service Horizontal Scaling

# The trie is read-only → trivially horizontally scalable
# Each server loads the full trie snapshot into RAM (~100GB after compression)
# 100K QPS, each request takes <1ms → 100K RPS/server is feasible (event-driven)
# But trie lookup is CPU-bound, not I/O-bound → 1 request per core (not async)

# Practical estimate: 16-core server, 100K QPS → 6.25K RPS/core → <1ms each → fine
# For safety: 5 servers × 20K RPS each = 100K QPS with N+2 redundancy

# Deployment: 5 trie servers behind L7 load balancer
# All 5 serve the same data (identical trie snapshot)
# No sharding needed — the full trie fits in RAM

# Trie snapshot size:
# 10M queries × 40 chars avg = 400 MB raw text
# Patricia trie compression → ~50–100 MB in memory (with top-K precomputed)
# Even if 10× larger: 1 GB per server → trivial with 128 GB RAM servers

Multi-Region Deployment

# Deploy trie servers in each region (US, EU, APAC)
# CDN routes users to nearest region: 60% US, 25% EU, 15% APAC
# Trie snapshot replicated to all regions from S3 (global CDN for the snapshot itself)
# Result: typical user gets suggestions from a server <50ms away → total latency <10ms

Personalisation (Advanced)

# Basic autocomplete: same suggestions for all users with same prefix
# Personalised autocomplete: factor in user's search history + location

# Implementation:
# 1. Get global top-K suggestions for prefix (trie)
# 2. Get user's personal top searches for prefix (Redis: "user:{id}:ac:{prefix}")
# 3. Merge: personal suggestions first, then global, deduplicated

# User's personal search history stored in Redis:
# Key: "user:{user_id}:history"  Value: sorted set of {query: last_searched_timestamp}
# On search: ZADD "user:{user_id}:history" {timestamp} {query}; ZREMRANGEBYRANK ... 0 -101

# Personalisation trade-off: increases query complexity, can't use CDN cache
# Disable for anonymous users; enable for logged-in users with history > 10 searches

8 Trade-offs & Design Decisions

DecisionChosen ApproachTrade-off
Data structure Patricia Trie in RAM with pre-computed top-K High memory use vs O(1) lookup. Worth it — RAM is cheap, latency is not.
Update strategy Hourly batch rebuild + hot-swap Simple, reliable, zero-downtime. Trade-off: 1-hour freshness lag. Acceptable for most apps.
Trie sharding Full trie on each server (no sharding) Simpler routing, no cross-shard joins. Works as long as trie fits in RAM (~100GB after compression). Shard by first character only if trie exceeds available RAM.
Cache placement CDN → Redis → Trie (three tiers) CDN handles 80% of traffic for common prefixes. Redis handles rare cache misses. Trie is last resort. Inverted dependency — CDN hit means <5ms; trie hit means <50ms.
Ranking signal Query frequency only Simple, explainable, fast to compute. In production, also factor: recency (trending), user's location, query completion rate. More signals = better suggestions but more complex ranking.

What if the Trie Doesn't Fit in RAM?

# Option 1: Shard trie by first character (26 shards or by frequency)
# Route "a*" queries to shard A, "b*" to shard B, etc.
# Problem: uneven distribution — "s" is far more common than "x"
# Better: shard by frequency bucket (top 1M queries on fast servers, rest on slower)

# Option 2: Use Elasticsearch instead of custom trie
# Elasticsearch has built-in completion suggester (edge n-gram based, not trie)
# Pros: managed infrastructure, handles 10B+ terms, built-in ranking
# Cons: higher latency (5–20ms vs <1ms for in-memory trie), more infrastructure
# Use when: you also need fuzzy matching, multi-language, faceted suggestions

# Option 3: Redis ZRANGEBYLEX (for moderate scale, up to 1M unique queries)
# Store all queries in a sorted set with empty scores (alphabetical ordering)
# ZRANGEBYLEX "autocomplete" "[sys" "[sys\xff" LIMIT 0 5
# Fast, simple, no custom trie — but limited to alphabetical ranking (no frequency)

What to Study Next