Database Sharding — Complete Guide

Sharding strategies, shard key selection, hot shard problem, cross-shard queries, and when to shard

Database sharding (horizontal partitioning) is the technique of splitting a single large database into multiple smaller databases called shards. Each shard holds a subset of the data and runs on its own server — together they hold the complete dataset.

Sharding is the go-to solution when your data outgrows a single machine. But it introduces significant complexity — cross-shard queries, rebalancing, and the critical shard key decision. This guide explains everything you need to know for system design interviews and real-world implementation.

Pre-reading: Consistent Hashing explains the key routing mechanism behind hash-based sharding.

1 When Do You Need Sharding?

Don't shard prematurely — it adds enormous complexity. Try simpler options first:

# Scaling ladder (try in order before sharding):
1. Vertical scaling:    bigger machine (more CPU, RAM, NVMe SSDs)
                        → PostgreSQL on 192-core, 12TB RAM can handle billions of rows
2. Read replicas:       3–5 read-only replicas for read-heavy workloads
                        → offloads 80%+ of reads from primary
3. Caching:             Redis/Memcached in front of DB
                        → 99% cache hit rate means DB rarely touched
4. Table partitioning:  Partition by date within a SINGLE server (PostgreSQL, MySQL)
                        → fast range queries, automatic old data archival
5. Vertical sharding:   Move specific tables to their own dedicated DB server
                        (user table on Server A, orders table on Server B)
6. Horizontal sharding: Split rows of ONE large table across multiple servers
                        → only when everything else is insufficient

# When sharding is necessary:
✓ Single table > 100–500 GB and growing fast
✓ Write TPS consistently > 10K on a single primary (even with SSD)
✓ Vertical scaling cost exceeds horizontal sharding cost
✓ Data residency / compliance requires geographic separation
Most startups never need sharding. A well-tuned PostgreSQL on a cloud instance can handle 10M users and 100M rows without sharding. Start sharding only when you have a measurable bottleneck.

2 Sharding Strategies

Hash Sharding

shard = hash(shard_key) % N

Even distribution, no hot spots. Bad for range queries. Adding shards remaps data (use consistent hashing to mitigate).

Range Sharding

Shard A: IDs 1–1M, Shard B: 1M–2M, etc.

Good for range queries. Sequential writes create hot shard at tail. Easy to reason about.

Directory Sharding

Lookup table: user_id → shard_id

Maximum flexibility. Lookup table is a SPOF. Extra network hop per request.

Geo Sharding

US users → US shard, EU → EU shard

Data residency compliance (GDPR). Low latency for regional users. Unbalanced if regions grow unevenly.

Hash Sharding — Most Common

# Simple hash sharding:
def get_shard(user_id: int, n_shards: int) -> int:
    return hash(str(user_id)) % n_shards

# Problem: adding a shard remaps (N-1)/N of all keys → mass data migration
# Solution: use consistent hashing (see consistent-hashing-explained.html)
def get_shard_consistent(user_id: int, ring: ConsistentHashRing) -> str:
    return ring.get_server(str(user_id))

# Shard routing at application layer:
class ShardRouter:
    def __init__(self, shard_connections: dict):
        self.ring = ConsistentHashRing(list(shard_connections.keys()))
        self.connections = shard_connections  # {shard_name: DB_connection}

    def get_connection(self, user_id: int):
        shard_name = self.ring.get_server(str(user_id))
        return self.connections[shard_name]

router = ShardRouter({
    "shard_0": psycopg2.connect("host=shard0.db ..."),
    "shard_1": psycopg2.connect("host=shard1.db ..."),
    "shard_2": psycopg2.connect("host=shard2.db ..."),
})

# Query: always include shard key in WHERE clause
def get_user(user_id: int) -> dict:
    conn = router.get_connection(user_id)
    return conn.execute("SELECT * FROM users WHERE id = %s", user_id).fetchone()

Range Sharding — When Sequential Order Matters

# Range sharding config (stored in ZooKeeper or etcd):
SHARD_RANGES = [
    {"shard": "shard_0", "min": 0,          "max": 25_000_000},
    {"shard": "shard_1", "min": 25_000_001, "max": 50_000_000},
    {"shard": "shard_2", "min": 50_000_001, "max": 75_000_000},
    {"shard": "shard_3", "min": 75_000_001, "max": float("inf")},
]

# Hot spot problem: all new users get ID > 75M → all writes go to shard_3
# Fix: use UUIDs instead of sequential IDs for user IDs
# Or: pre-split based on expected growth (Cassandra's approach)

# Good use case: IoT time-series data
# Shard by date range: shard_2026_06 contains all readings from June 2026
# Range queries "give me readings from 2026-06-01 to 2026-06-30" = single shard ✓

3 Choosing the Right Shard Key

The shard key decision is permanent (or extremely expensive to change). Choose carefully.

Properties of a Good Shard Key

# 1. HIGH CARDINALITY — many unique values
#    Bad:  status field (3 values: active/inactive/deleted) → 3 shards max, all data in 1-2
#    Good: user_id (millions of unique values)

# 2. EVEN DISTRIBUTION — no hot shards
#    Bad:  country_code if 90% of users are from one country → one shard gets 90% of data
#    Good: user_id (random/UUID) → uniform distribution

# 3. MINIMISE CROSS-SHARD QUERIES — related data in same shard
#    Bad:  shard by product_id for an e-commerce app
#          → order "user A bought product B" → user on shard 2, product on shard 5 → JOIN = cross-shard
#    Good: shard by user_id → all of user A's orders, cart, preferences on same shard

# 4. MONOTONICALLY INCREASING KEYS ARE DANGEROUS for hash sharding
#    Sequential IDs (1, 2, 3, ...) hash well
#    But auto-increment creates write hotspot in range sharding → use UUIDs or Snowflake IDs

Common Shard Key Choices by System

SystemShard KeyWhy
Social network (Twitter)user_idUser's tweets, follows, and feed all co-located
E-commerce (Amazon)user_id or order_idOrder history per user in same shard; Checkout queries by order_id
Chat (WhatsApp)conversation_idAll messages in a conversation in same shard
Multi-tenant SaaStenant_id (org_id)One tenant's data stays together; easy tenant-level backup/delete
URL shortenershort_codeRedirect lookup is always by short_code
IoT time-seriesdevice_id + time bucketAll readings for a device together; time range queries efficient

4 Challenges & Solutions

Problem 1: Cross-Shard Queries (Joins and Aggregations)

# Problem: "Get top 10 users by follower_count"
# → follower counts are on different shards → must query all shards → expensive

# Solution A: Scatter-gather (fan out to all shards, merge results)
def get_top_users_globally(n=10):
    results = []
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [executor.submit(shard.query,
                   "SELECT id, follower_count FROM users ORDER BY follower_count DESC LIMIT %s", n)
                   for shard in all_shards]
        for future in concurrent.futures.as_completed(futures):
            results.extend(future.result())
    results.sort(key=lambda x: x.follower_count, reverse=True)
    return results[:n]
# Cost: O(N_shards) queries → acceptable if N_shards is small (8–16)

# Solution B: Denormalize — maintain a global analytics table in a separate non-sharded DB
# Write: when follower_count changes → async update to analytics.user_stats
# Read: query analytics.user_stats directly (not sharded → single query)

# Solution C: Avoid the query — pre-compute and cache the leaderboard
# Cron job every 5 minutes → scatter-gather → store result in Redis → serve from cache

Problem 2: Distributed Transactions (Cross-Shard Writes)

# Problem: Transfer $100 from User A (Shard 1) to User B (Shard 3)
# Need atomicity: either BOTH accounts update or NEITHER

# Solution A: Two-Phase Commit (2PC) — strong consistency
# Phase 1 (Prepare): coordinator asks both shards "can you commit?"
# Phase 2 (Commit): if both say yes → both commit; otherwise both rollback
# Problem: coordinator failure during Phase 2 → stuck in limbo (slow, fragile)

# Solution B: Saga Pattern — eventual consistency
# Step 1: Debit User A → success
# Step 2: Credit User B → if fails → compensate: Credit User A back
# Implemented as sequence of local transactions with compensating actions
# Used by most financial systems at scale (much more resilient than 2PC)

# Solution C: Redesign to avoid cross-shard transactions
# For payments: route both users to same "payment shard" (by payment_id, not user_id)
# The payment record is a single-shard operation; update user balances asynchronously

Problem 3: Hot Shard (Data Skew)

# Problem: 10% of users are power users → shard_2 (which has power users) gets 80% of traffic

# Detection: monitor query volume per shard
# Alert if any shard handles > 2× the average traffic

# Solution A: Shard splitting — split the hot shard into 2
# Shard 2 (users A-Z) → Shard 2a (users A-M) + Shard 2b (users N-Z)
# Data migration: copy half the data, update routing, switch traffic

# Solution B: Add read replicas to the hot shard only
# Hot shard: 1 primary + 5 read replicas (others: 1 primary + 1 replica)
# Read requests load-balanced across replicas

# Solution C: Tiered caching — aggressive cache in front of hot shard
# If power users' data is heavily read, cache it → DB sees fewer requests

Problem 4: Resharding (Adding New Shards)

# Problem: going from 4 shards to 8 shards
# With naive hashing: massive data migration (every key remapped)
# With consistent hashing: only 1/8 of data moves (from the half of each old shard)

# Resharding process (zero-downtime):
# 1. Add new shards to the ring
# 2. Start dual-writing: write to old shard AND new shard for each new write
# 3. Background job: migrate existing data from old shard to new shard
# 4. Once migration complete: switch reads to new shard, stop dual-writing, decommission old

# Resharding is expensive — plan shard count conservatively
# Over-shard initially: 64 or 128 shards even if starting small
# Shards can run on fewer physical servers (multiple shards per server)
# Adding servers later = move shards (not resharding), which is much simpler

5 Vertical Sharding vs Horizontal Sharding

TypeWhat It DoesWhen to UseLimitation
Vertical Scaling Bigger machine (more RAM, CPU) First response to bottleneck. Quick and simple. Hardware limit exists. Expensive at high end.
Vertical Sharding Split different tables to different servers (users table on DB1, orders table on DB2) One table is causing all the bottleneck. JOIN between tables on different servers = expensive.
Horizontal Sharding Split rows of same table across multiple servers Single large table; unlimited horizontal scale. Cross-shard JOINs; no distributed transactions.
Read Replicas Copy of primary DB for read-only queries Read-heavy workloads (analytics, reporting). Replication lag; doesn't help write bottleneck.
In interviews: Mention all four options when discussing database scaling. Only recommend sharding after explaining why simpler options are insufficient. "I'd start with read replicas since we're 100:1 read-heavy, add a Redis cache for hot data, and only consider sharding if we hit primary write throughput limits."

6 Managed Sharding Solutions

Building your own sharding layer is complex. In practice, use one of these:

SolutionTypeSharding ApproachBest For
Amazon Aurora ShardingManaged MySQL/PostgreSQLHorizontal + read replicasAWS-native apps with <128TB
Vitess (YouTube/PlanetScale)MySQL proxy layerHash + range sharding over MySQL clustersLarge MySQL deployments, open source
Citus (PostgreSQL extension)PostgreSQL extensionHash/range sharding, distributed queriesPostgreSQL users needing horizontal scale
Amazon DynamoDBManaged NoSQLAutomatic hash sharding via partition keysKey-value/document access patterns, serverless
CassandraDistributed NoSQLConsistent hashing with virtual nodesWrite-heavy, time-series, global distribution
MongoDB AtlasManaged document DBHash/range sharding on shard keyDocument data with flexible schema
Interview advice: Mention Vitess for MySQL or Citus for PostgreSQL when asked about sharding — they show you know real-world solutions, not just theoretical concepts. "Rather than building our own sharding middleware, I'd use Citus which adds horizontal sharding as a PostgreSQL extension — we keep familiar SQL semantics while getting distributed scale."

What to Study Next