A chat system is one of the richest system design problems — it touches real-time communication protocols, message delivery guarantees, storage at massive scale, group chat fan-out strategies, and offline/online presence. This walkthrough covers a WhatsApp-scale design: 500M daily active users, 100B messages per day.
Key clarification: Does the system need end-to-end encryption (E2E)? For WhatsApp-style: yes (Signal Protocol). For Slack-style: no (server-side encryption only). This changes the server's ability to read/process messages.
# Messages: 500M users × 50 messages/day = 25B messages/day 25B ÷ 86,400 sec = ~290,000 messages/sec → 290K msg/sec peak # Storage per message: sender_id (8B) + receiver_id (8B) + timestamp (8B) + content (100B avg) + metadata (20B) ≈ 144 bytes per message # Daily storage: 25B messages × 144B = 3.6 TB/day → 1.3 PB/year # WebSocket connections: 500M DAU, assume 50% active simultaneously = 250M concurrent WebSocket connections Each connection needs a chat server → at 100K connections/server → 2,500 chat servers # Presence updates: Every online user sends heartbeat every 5s: 250M online users ÷ 5 sec = 50M presence events/sec → handled by dedicated presence service
| Protocol | Direction | Latency | Use Case |
|---|---|---|---|
| HTTP Polling | Client → Server only | High (reconnect overhead) | Not suitable — wastes bandwidth, adds latency |
| HTTP Long Polling | Client → Server (hold open) | Medium | Acceptable fallback, but wastes connections |
| Server-Sent Events (SSE) | Server → Client only | Low | Good for one-way feeds (notifications), not chat |
| WebSocket | Full duplex (both ways) | Very low | Chosen — persistent bidirectional connection, sub-10ms overhead |
WebSocket enables the server to push messages to clients immediately without the client polling. Once the WebSocket handshake completes (over HTTP Upgrade), the connection stays open — no TCP handshake overhead per message.
# WebSocket connection lifecycle:
Client → HTTP GET /ws HTTP/1.1
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Server → HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
# After handshake: full-duplex binary/text frames
# Both client and server can send at any time
# Keep-alive: ping/pong frames every 30s to detect dead connections
Chat servers are stateful — each WebSocket connection is pinned to a specific server. The load balancer uses sticky sessions (by user ID) to ensure reconnects go to the same server. This is different from stateless HTTP servers.
The consequence: you can't just "add more chat servers" without routing logic. When User A (on Server 1) sends a message to User B (on Server 7), Server 1 needs a way to reach Server 7.
Solution: pub/sub via Redis or Kafka. Chat servers subscribe to a user-specific channel. When a message arrives for User B, it's published to User B's channel — whichever server holds User B's connection picks it up and delivers it.
# Step 1: Alice sends message to Bob via WebSocket
Alice → ChatServer-A: {
"type": "message",
"to": "bob_id",
"content": "Hey Bob!",
"client_msg_id": "alice-1234" // dedup key — client generates
}
# Step 2: ChatServer-A assigns a global message ID and persists to Cassandra
msg_id = generate_snowflake_id()
cassandra.insert(conversation_id, msg_id, sender="alice", content="Hey Bob!", ...)
# Cassandra write is async — don't block the send path
# Step 3: ChatServer-A publishes to User B's Redis pub/sub channel
redis.publish("user:bob_id:messages", json.dumps({
"msg_id": msg_id,
"from": "alice_id",
"content": "Hey Bob!",
"timestamp": now()
}))
# Step 4: ChatServer-B (which holds Bob's WebSocket) receives from Redis pub/sub
# and immediately pushes to Bob's WebSocket connection
Bob receives message in <10ms
# Step 5: ChatServer-A sends ACK back to Alice
Alice ← ChatServer-A: { "type": "ack", "client_msg_id": "alice-1234", "msg_id": msg_id, "status": "sent" }
# Alice's client changes status: ✓ (sent)
# Step 6: Bob's client sends delivery receipt
Bob → ChatServer-B: { "type": "receipt", "msg_id": msg_id, "status": "delivered" }
ChatServer-B publishes receipt to Alice's channel → Alice sees ✓✓ (delivered)
# Step 7: Bob opens message — read receipt
Bob → ChatServer-B: { "type": "receipt", "msg_id": msg_id, "status": "read" }
→ Alice sees ✓✓ (blue) (read)
client_msg_id is critical for deduplication. If Alice's WebSocket drops after sending but before receiving the ACK, she retries — the server detects the duplicate client_msg_id and doesn't store a second copy.# Same steps 1–3 above. But when Redis pub/sub fires:
ChatServer-B tries to deliver → Bob has no active WebSocket connection
# Fallback flow:
1. Message Delivery Worker receives the Kafka event (all messages go to Kafka for durability)
2. Checks presence service: bob_id → OFFLINE
3. Queries Bob's device push tokens from User Service
4. Sends push notification via:
- APNs (Apple Push Notification service) for iOS
- FCM (Firebase Cloud Messaging) for Android
- Web Push for browsers
# When Bob comes back online:
1. Bob's client connects WebSocket to ChatServer (any server via sticky LB)
2. Client sends "sync" request: { "last_msg_id": "msg-4320" }
3. Chat server fetches all messages with msg_id > 4320 from Cassandra
4. Sends missed messages in order
5. Bob's client sends delivery receipts for all received messages
Group messages are the hardest part of chat system design. A message sent to a 500-member group must be delivered to all 500 members. This is fan-out — 1 write triggers N deliveries.
| Strategy | How It Works | When to Use |
|---|---|---|
| Fan-out on write | When a message is sent, immediately publish to each member's Redis channel (500 pub/sub publishes) | Online-heavy groups — most members are online; direct delivery is fast |
| Fan-out on read | Store message once; each member pulls new messages on sync | Large groups (1000+) — most members are offline; lazy delivery is cheaper |
| Hybrid | Fan-out on write for online members; fan-out on read for offline members | Recommended — WhatsApp/Slack approach |
# Step 1: Alice sends message to Group-123 (500 members)
ChatServer-A receives message → assigns msg_id → persists to Cassandra
# Step 2: Fetch group members and their online status
members = group_service.get_members("group-123") # [alice, bob, carol, ... 500 members]
online_members, offline_members = presence_service.partition(members)
# Step 3: Fan-out to online members via Redis pub/sub
for user_id in online_members:
redis.publish(f"user:{user_id}:messages", message_payload)
# If 100 members are online: 100 Redis publishes in parallel
# Step 4: For offline members: record in "undelivered" table
# Next time they come online, sync fetches messages from Cassandra
cassandra.insert("group_undelivered", group_id="group-123",
offline_members=offline_members, msg_id=msg_id)
# Step 5: Send push notifications to offline members' devices
for user_id in offline_members:
push_notification_service.send(user_id, preview=message[:50])
# Cassandra group messages schema:
CREATE TABLE group_messages (
group_id UUID,
msg_id BIGINT, -- Snowflake ID (ordered by time)
sender_id UUID,
content TEXT,
sent_at TIMESTAMP,
PRIMARY KEY (group_id, msg_id) -- partition by group, cluster by time
) WITH CLUSTERING ORDER BY (msg_id DESC);
Why Cassandra? Message storage has specific characteristics that make Cassandra ideal: append-heavy writes, time-ordered reads (fetch messages in a conversation in time order), linear horizontal scaling, and multi-datacenter replication.
-- Direct messages (1:1 chats): -- conversation_id = deterministic hash of (min(user1_id, user2_id), max(...)) CREATE TABLE messages ( conversation_id UUID, msg_id BIGINT, -- Snowflake ID: naturally time-ordered sender_id UUID, recipient_id UUID, content TEXT, msg_type TINYINT, -- 0=text, 1=image, 2=video, 3=file sent_at TIMESTAMP, status TINYINT, -- 0=sent, 1=delivered, 2=read PRIMARY KEY (conversation_id, msg_id) ) WITH CLUSTERING ORDER BY (msg_id DESC) -- newest first AND gc_grace_seconds = 864000; -- 10-day tombstone window -- Group messages (same table, group_id as conversation_id): -- PRIMARY KEY (group_id, msg_id) -- User last-seen per conversation (for sync on reconnect): CREATE TABLE conversation_cursors ( user_id UUID, conversation_id UUID, last_read_msg_id BIGINT, PRIMARY KEY (user_id, conversation_id) );
-- Relational structure works well for user/group metadata (smaller scale, complex queries)
CREATE TABLE users (
id BIGINT PRIMARY KEY,
username VARCHAR(64) UNIQUE NOT NULL,
phone VARCHAR(20) UNIQUE,
display_name VARCHAR(128),
avatar_url TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE group_members (
group_id BIGINT NOT NULL,
user_id BIGINT NOT NULL,
role ENUM('member','admin','owner') DEFAULT 'member',
joined_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (group_id, user_id),
INDEX idx_user_groups (user_id) -- "what groups is this user in?"
);
# Images/video are NOT stored in Cassandra — use object storage # Flow: 1. Client requests pre-signed S3 upload URL from Media Service 2. Client uploads directly to S3 (bypasses chat server — no bandwidth bottleneck) 3. Media Service stores S3 key + generates CDN URL 4. Message record stores the CDN URL, not the file itself 5. Client renders image via CDN (global, <10ms) # Media thumbnails: generated async by Lambda function triggered on S3 upload # Retention: user-controlled; media can expire while message metadata remains
# Online presence stored in Redis (TTL-based):
# Key: "presence:{user_id}" → Value: "online" TTL: 10 seconds
# Client sends WebSocket heartbeat every 5 seconds:
Client → ChatServer: { "type": "heartbeat" }
ChatServer → redis.setex(f"presence:{user_id}", 10, "online")
# If no heartbeat for 10s → key expires → user is "offline"
# Checking another user's status:
def get_status(user_id: str) -> str:
if redis.exists(f"presence:{user_id}"):
return "online"
last_seen = redis.get(f"last_seen:{user_id}") # stored on disconnect
return f"last seen {format_time(last_seen)}"
# On WebSocket disconnect:
def on_disconnect(user_id: str):
redis.delete(f"presence:{user_id}")
redis.set(f"last_seen:{user_id}", now().isoformat())
# Publish offline event so friends' clients update status
redis.publish(f"presence_events", json.dumps({"user_id": user_id, "status": "offline"}))
# Scaling presence:
# 50M presence events/sec (heartbeats) → Redis Cluster, 10 shards
# Each shard handles 5M events/sec → Redis handles ~500K ops/sec per node → fine
# Presence is eventually consistent — 10s delay in showing "offline" is acceptable
# Each chat server handles: # - 100,000 concurrent WebSocket connections (typical for Go/Rust; ~50K for Node.js) # - CPU: mostly idle (event-driven, waiting for messages) # - Memory: 10KB per connection × 100K = ~1GB RAM per server # For 250M connections: 250M ÷ 100K connections/server = 2,500 chat servers # Load balancer: L4 (not L7) for WebSockets # L4 sticky sessions: hash(user_id) → always same server # Health check: TCP ping every 5s
# Problem: Alice (ChatServer-A) sends message to Bob (ChatServer-B)
# How does ChatServer-A know which server holds Bob's connection?
# Solution 1: Redis pub/sub (recommended for <10K servers)
# Each user subscribes their chat server to "user:{user_id}" channel in Redis
# Publishing to that channel reaches the right server automatically
# Redis Cluster: 1000 channels per server × 2500 servers = 2.5M channels → Redis handles fine
# Solution 2: Consistent hash ring
# User IDs mapped to servers via consistent hashing
# ChatServer-A looks up "Bob → Server-B" in the hash ring → routes directly
# Solution 3: Service registry (ZooKeeper / etcd)
# Each server registers which user IDs it serves
# Other servers query the registry → direct gRPC call
# Overkill for most chat systems; used at extreme scale (Meta)
# Challenge: messages from multiple senders in a group may arrive out of order # Solution: Snowflake IDs are monotonically increasing (within a single generator node) # For global ordering: use logical clocks (Lamport timestamps) # In practice, "good enough" ordering: # 1. Store Snowflake msg_id in Cassandra (time-ordered per partition) # 2. Client sorts messages by msg_id before display # 3. Clock skew (NTP) can cause occasional out-of-order — accept this (common in WhatsApp) # 4. For "last message" preview in conversation list: use server-side timestamp, not client
client_msg_id deduplication on the client side removes visible duplicates. Exactly-once would require distributed transactions — prohibitively expensive at this scale.