Design a Notification System

Push, email, SMS, and in-app — multi-channel fan-out, retry logic, preferences, and 10M/sec scale

A notification system is a common system design interview question and an even more common real-world component — almost every product (social media, e-commerce, banking, SaaS) needs one. The challenge is reliability: a notification that fails to send (or sends twice) damages user trust.

This design covers a multi-channel notification system at Netflix/Amazon scale: 10M notifications per second across push, email, SMS, and in-app channels.

Pre-reading: Start with the System Design Interview Guide for the general framework.

1 Notification Channels

Mobile Push

iOS via APNs, Android via FCM. Requires device token. User must grant permission.

Web Push

Browser notifications via Web Push Protocol. Service workers required.

Email

Via SendGrid, SES, Mailgun. Best for rich content, receipts, digests.

SMS

Via Twilio, SNS. High delivery rate. Best for OTPs, urgent alerts.

In-App

Notification centre within the app. Stored in DB, fetched on login.

Webhooks

HTTP callbacks to third-party systems. Used by B2B/SaaS integrations.

In the interview, ask: "Which channels are in scope?" Starting with all channels is fine — then you can scope down if time is tight. The architecture is the same; only the third-party provider integration differs.

2 Requirements

Functional

  • Send notifications across push, email, SMS, in-app channels
  • User can set channel preferences per notification type
  • User can opt out (unsubscribe) globally or per type
  • Support template-based notifications with variable substitution
  • Scheduled notifications (send at a specific time)
  • Notification history queryable by user
  • Delivery status tracking (sent / delivered / failed / clicked)

Non-Functional

  • 10M notifications per second (peak, e.g., major sports event, flash sale)
  • At-least-once delivery guarantee
  • Latency: <1 second from event trigger to provider submission
  • High availability: 99.9%
  • No duplicate notifications (idempotency)
  • Graceful degradation: one channel failing should not block others

3 Capacity Estimation

# Notifications per day (at 10M/sec peak, typical sustained rate lower):
Sustained: 1M notifications/sec
Peak: 10M notifications/sec (major live event, Black Friday)

# Channel breakdown (typical):
Mobile push: 60%  → 600K/sec
Email:       25%  → 250K/sec
In-app:      10%  → 100K/sec
SMS:          4%  → 40K/sec
Webhooks:     1%  → 10K/sec

# Storage for notification log (30-day retention):
1M/sec × 86,400 sec × 30 days × 200 bytes/record = ~519 TB
→ Use ClickHouse or a time-series DB; partition by month; archive to S3 after 30 days

# Device tokens stored:
1B users × 2 devices avg × 100B token = 200 GB → easily fits in a distributed cache + DB

# Kafka throughput needed:
1M notifications/sec × 1KB per message = 1 GB/sec Kafka throughput
→ Kafka cluster: 50 brokers × 20 MB/sec each = 1 GB/sec ✓

4 High-Level Design

Event Sources (Order Service, Auth Service, Social Service, Marketing Platform, Scheduler)
    ↓
Notification API (REST — accepts notification requests)
    ↓
Notification Service (validates, resolves recipients, checks preferences/opt-outs)
    ↓
Kafka (topic per channel: push-notifications, emails, sms-messages, in-app-alerts)
    ↓──────────────────────────────────────────────────────┐
Push Worker           Email Worker         SMS Worker      In-App Worker
    ↓                     ↓                    ↓                ↓
APNs / FCM          SendGrid / SES         Twilio / SNS    Cassandra (notification store)
    ↓                     ↓                    ↓
Delivery Status Service (tracks sent/delivered/failed/clicked)
    ↓
Notification DB (PostgreSQL — logs, analytics)
Analytics Service (ClickHouse — open rates, click-through rates, delivery rates)

Why Kafka Between Service and Workers

Without Kafka, the Notification Service would call push/email/SMS providers directly — tightly coupled, no retry, no backpressure, no replay. With Kafka:

  • Decoupling: notification creation and delivery are independent — a slow provider doesn't block notification acceptance
  • Durability: Kafka retains messages for 7 days — if a worker crashes, it resumes from its last committed offset (no notification lost)
  • Scalability: scale push workers independently of email workers
  • Replay: if a batch of notifications was mis-delivered, replay from the Kafka offset

5 Notification Flow — Step by Step

Step 1: Event Source Triggers Notification

# Example: User places an order → Order Service triggers notification
POST /api/v1/notifications
{
  "type": "order_confirmed",
  "user_id": "user-123",
  "template_id": "order-confirmation-v2",
  "variables": {
    "order_id": "ORD-98765",
    "total": "$142.50",
    "estimated_delivery": "June 27"
  },
  "priority": "high",              // high|normal|low — affects queue priority
  "scheduled_at": null             // null = send immediately
}

Step 2: Notification Service Processing

def process_notification(request):
    # 1. Validate request
    template = template_store.get(request.template_id)
    rendered = render_template(template, request.variables)

    # 2. Resolve recipient details
    user = user_service.get(request.user_id)
    devices = device_store.get_tokens(request.user_id)   # push tokens

    # 3. Check user preferences + opt-outs
    prefs = preference_store.get(request.user_id, request.type)
    if prefs.opted_out:
        log_skipped(request, "user opted out")
        return

    # 4. Check global rate limits (don't spam users)
    if rate_limiter.is_throttled(request.user_id, request.type):
        log_skipped(request, "rate limited")
        return

    # 5. Fan-out to enabled channels
    notification_id = generate_uuid()
    for channel in prefs.enabled_channels:    # e.g., ["push", "email"]
        kafka.produce(
            topic=f"{channel}-notifications",
            key=request.user_id,              # same partition = ordered per user
            value={
                "notification_id": notification_id,
                "user_id": request.user_id,
                "channel": channel,
                "content": rendered[channel],
                "devices": devices if channel == "push" else None,
                "email": user.email if channel == "email" else None,
                "priority": request.priority,
                "created_at": now()
            }
        )

Step 3: Channel Workers

# Push Worker (one Kafka consumer group per channel):
def push_worker(message):
    payload = message.value
    for device in payload["devices"]:
        try:
            if device.platform == "ios":
                apns.send(
                    device_token=device.token,
                    payload={
                        "aps": {
                            "alert": {"title": payload["content"]["title"],
                                      "body":  payload["content"]["body"]},
                            "badge": 1,
                            "sound": "default"
                        },
                        "notification_id": payload["notification_id"]   # for dedup
                    }
                )
            elif device.platform == "android":
                fcm.send(
                    registration_token=device.token,
                    notification={"title": payload["content"]["title"],
                                  "body":  payload["content"]["body"]},
                    data={"notification_id": payload["notification_id"]}
                )
            delivery_status.mark_sent(payload["notification_id"], device.id)
        except (InvalidTokenError, UnregisteredDeviceError):
            device_store.remove_token(device.token)   # clean up invalid tokens
        except ProviderThrottleError as e:
            # Re-queue with exponential backoff
            retry_queue.enqueue(payload, delay=e.retry_after)

# Email Worker:
def email_worker(message):
    payload = message.value
    sendgrid.send(
        to=payload["email"],
        subject=payload["content"]["subject"],
        html=payload["content"]["html"],
        custom_args={"notification_id": payload["notification_id"]}
    )
    delivery_status.mark_sent(payload["notification_id"])

6 Reliability — Retry Logic & Deduplication

Retry Strategy (Exponential Backoff)

# Retry on transient failures (provider timeout, 5xx, rate limit)
# Do NOT retry on permanent failures (invalid token, unsubscribed email)

RETRY_DELAYS = [10, 30, 120, 600, 3600]   # seconds: 10s, 30s, 2m, 10m, 1hr

def send_with_retry(notification, attempt=0):
    try:
        provider.send(notification)
        delivery_status.mark_delivered(notification.id)
    except TransientError as e:
        if attempt >= len(RETRY_DELAYS):
            delivery_status.mark_failed(notification.id, reason=str(e))
            alert_oncall(notification)   # alert if high-priority notification fails
            return
        delay = RETRY_DELAYS[attempt] * (1 + random.uniform(0, 0.2))  # jitter
        retry_queue.enqueue(notification, delay=delay, attempt=attempt+1)
    except PermanentError as e:
        delivery_status.mark_failed(notification.id, reason=str(e))
        # Don't retry — update user record (invalid email, unsubscribed)
        handle_permanent_failure(notification, e)

Deduplication

# Problem: Kafka at-least-once delivery → consumer may process same message twice
# (e.g., consumer crashes after sending but before committing offset)

# Solution: idempotency key on every notification
# Before sending to provider, check if notification_id was already sent:

def is_duplicate(notification_id: str) -> bool:
    key = f"notif:sent:{notification_id}"
    result = redis.set(key, "1", nx=True, ex=86400)   # SET if Not eXists
    return result is None   # None = key already existed = duplicate

# In the worker:
if is_duplicate(payload["notification_id"]):
    logger.info(f"Duplicate notification {payload['notification_id']} — skipping")
    return   # don't send again

send_to_provider(payload)

# APNs/FCM also support deduplication:
# APNs: apns-collapse-id header (collapses multiple notifications to same device)
# FCM: collapse_key parameter (replaces older undelivered notification with same key)

User Preference & Opt-Out

CREATE TABLE notification_preferences (
  user_id        BIGINT NOT NULL,
  notification_type VARCHAR(64) NOT NULL,   -- "order_confirmed", "marketing_promo", etc.
  push_enabled   BOOLEAN DEFAULT TRUE,
  email_enabled  BOOLEAN DEFAULT TRUE,
  sms_enabled    BOOLEAN DEFAULT FALSE,     -- opt-in by default for SMS
  in_app_enabled BOOLEAN DEFAULT TRUE,
  quiet_hours_start TIME,                   -- don't send push between 22:00–08:00
  quiet_hours_end   TIME,
  PRIMARY KEY (user_id, notification_type)
);

-- Global unsubscribe (overrides all per-type preferences):
CREATE TABLE unsubscribed_users (
  user_id    BIGINT PRIMARY KEY,
  channel    VARCHAR(32),    -- 'all', 'email', 'sms', 'push'
  reason     TEXT,
  created_at TIMESTAMP DEFAULT NOW()
);

-- Check in Notification Service before publishing to Kafka:
def can_send(user_id, notification_type, channel):
    if unsubscribe_store.is_unsubscribed(user_id, channel):
        return False
    prefs = preference_store.get(user_id, notification_type)
    if not prefs.is_channel_enabled(channel):
        return False
    if prefs.in_quiet_hours():
        # For non-urgent: delay to after quiet hours
        # For urgent (OTP, security): send anyway
        return notification_type not in URGENT_TYPES
    return True

7 Data Model

-- Notification log (PostgreSQL — recent 30 days; archive older to S3):
CREATE TABLE notifications (
  id              UUID PRIMARY KEY,
  user_id         BIGINT NOT NULL,
  type            VARCHAR(64) NOT NULL,
  channel         VARCHAR(32) NOT NULL,
  template_id     VARCHAR(64),
  content_preview VARCHAR(200),
  status          VARCHAR(20) DEFAULT 'pending',  -- pending/sent/delivered/failed/clicked
  provider        VARCHAR(32),                    -- 'apns', 'fcm', 'sendgrid', 'twilio'
  provider_msg_id VARCHAR(128),                   -- provider's message ID (for receipts)
  created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  sent_at         TIMESTAMPTZ,
  delivered_at    TIMESTAMPTZ,
  failed_at       TIMESTAMPTZ,
  failure_reason  TEXT,
  retry_count     SMALLINT DEFAULT 0
);
CREATE INDEX idx_notif_user_time ON notifications(user_id, created_at DESC);
CREATE INDEX idx_notif_status    ON notifications(status) WHERE status = 'pending';

-- Device token store:
CREATE TABLE device_tokens (
  user_id      BIGINT NOT NULL,
  token        VARCHAR(512) NOT NULL,
  platform     VARCHAR(16) NOT NULL,   -- 'ios', 'android', 'web'
  app_version  VARCHAR(32),
  registered_at TIMESTAMPTZ DEFAULT NOW(),
  last_seen    TIMESTAMPTZ,
  is_active    BOOLEAN DEFAULT TRUE,
  PRIMARY KEY (user_id, token)
);

8 Scaling to 10M Notifications/Second

Kafka Partition Strategy

# Topic: push-notifications
# Partitions: 500 (each partition handles ~20K msg/sec)
# Replication factor: 3
# Consumer group: push-workers (500 consumers, one per partition)
# Each consumer: sends to APNs/FCM in parallel (async I/O, 100 concurrent sends)
# Throughput per consumer: 100 concurrent × 1K msg/batch = 100K notifications/sec/consumer
# Total: 500 consumers × 100K = 50M notifications/sec capacity → 5× headroom

# Kafka key = user_id → all notifications for a user go to same partition → ordered

# Topic separation by priority:
# push-notifications-high (flash sale, security alert)
# push-notifications-normal (social updates)
# push-notifications-low (marketing)
# High-priority consumers are more numerous and get dedicated resources

APNs / FCM Throughput Limits

# APNs rate limits per certificate:
# No published hard limit, but best practice: use HTTP/2 multiplexing
# One HTTP/2 connection supports 1500 concurrent streams per second
# For 10M iOS pushes/sec: ~6,700 APNs connections needed
# In practice: use APNs provider token (not cert) + connection pooling

# FCM limits:
# Legacy: 1M messages/sec per project
# HTTP v1 API: higher limits, per-project quotas
# For scale: distribute across multiple Firebase projects (sharded by user ID range)

# AWS SNS as abstraction layer:
# SNS handles routing to APNs/FCM/ADM/BAIDU
# Manages connection pools to all providers
# Throttling: SNS transparently batches and respects provider limits
# Cost: $0.50 per 1M push notifications

# Provider failover:
def send_push_with_fallback(device, payload):
    if device.platform == "ios":
        try:
            return apns_primary.send(device.token, payload)
        except ProviderError:
            return apns_backup.send(device.token, payload)   # secondary APNs connection pool
    # For extreme redundancy: OneSignal or Braze as fallback provider

Scheduled Notifications

# For scheduled notifications (e.g., "send at 9am user's local time"):
# Store in a scheduled_notifications table with send_at timestamp
# Scheduler job (cron-like, runs every minute):
#   SELECT * FROM scheduled_notifications WHERE send_at <= NOW() AND status = 'pending'
#   LIMIT 10000 FOR UPDATE SKIP LOCKED   -- distributed locking, no duplicate processing
#   → publish each to Kafka → mark as 'queued'

# For massive scheduled batches (marketing campaign to 100M users):
# Don't do: insert 100M rows into scheduled_notifications
# Do: store the campaign (template + target segment + send_time) in campaigns table
# Scheduler fetches campaign → queries user segment → fans out to Kafka in streaming fashion
# Rate-limit the fan-out: push to Kafka at 1M/sec to avoid overwhelming providers

9 Trade-offs & Design Decisions

  • At-least-once vs exactly-once: We chose at-least-once with client-side deduplication (notification_id idempotency key). Exactly-once across the Kafka + provider pipeline would require distributed transactions — impractical. The dedup key prevents users from seeing duplicate notifications even if the system delivers twice.
  • Synchronous vs asynchronous delivery: The notification API responds immediately (202 Accepted) after publishing to Kafka. The caller doesn't wait for the notification to actually arrive on the user's device — that could take seconds or minutes (APNs delivery is not instantaneous). Async is the only sensible choice at scale.
  • Single notification service vs per-channel services: A single Notification Service handles fan-out to per-channel Kafka topics. Workers are channel-specific. This gives us one place for preference/opt-out logic while allowing independent scaling of each channel.
  • Provider abstraction: Wrap APNs/FCM/Twilio/SendGrid behind thin provider-adapter interfaces. Swapping SendGrid for SES requires only changing the adapter — the worker code and data model stay the same. Crucial for avoiding vendor lock-in.
  • Fan-out for group notifications: When notifying all members of a group (e.g., "Game starts in 5 minutes" for 500K subscribers), expand recipients in the Notification Service and fan-out to Kafka. Do NOT do DB JOINs inside the worker — batch the expansion at the source where you have full context.

What to Study Next