Design a Payment System

Idempotency, exactly-once guarantees, double-spend prevention, reconciliation, and 100K TPS scale

Payment systems are the most correctness-critical systems you'll ever design. A bug in a social feed might annoy users; a bug in a payment system can cause double charges, lost money, or regulatory fines. This walkthrough covers the design principles that make payments reliable — idempotency, exactly-once semantics, reconciliation, and fraud detection — at 100K transactions per second.

Key mindset shift: In payment systems, "fast and approximate" is never acceptable. Every design decision prioritizes correctness over throughput. If forced to choose between processing speed and preventing double charges — always choose correctness.
Pre-reading: See the System Design Interview Guide for the general framework.

1 Requirements

Functional

  • Accept payments (credit/debit card, bank transfer, digital wallet)
  • Route payments to appropriate payment processors (Stripe, Adyen, Braintree)
  • Process refunds and partial refunds
  • Maintain complete transaction history (immutable ledger)
  • Reconcile transactions with bank/processor statements
  • Fraud detection and risk scoring
  • Support multiple currencies with FX conversion

Non-Functional

  • 100K transactions per second (Black Friday peak)
  • Exactly-once processing — no double charges
  • Transaction latency: <500ms end-to-end (p99)
  • Availability: 99.99% (52 min/year max downtime)
  • Consistency: strong (financial data cannot be eventually consistent)
  • Auditability: every state change logged with timestamp + actor
  • PCI-DSS compliance (no raw card data stored in our system)

2 The Core Problem: Exactly-Once in a Distributed System

The fundamental challenge in payment systems: what happens when a network request to the payment processor times out? You don't know if the charge succeeded or failed. If you retry, you might double-charge. If you don't retry, you might miss a failed charge.

The solution: idempotency keys.

# Without idempotency:
# 1. Client sends "charge $100" to Payment Service
# 2. Payment Service sends to Stripe → network timeout
# 3. Did Stripe charge the card? Unknown.
# 4. If retry: Stripe might process TWO charges → customer billed twice
# 5. If no retry: if Stripe DID succeed but we didn't record it → order never fulfilled

# With idempotency key:
# 1. Client generates a unique idempotency_key (UUID) for this payment attempt
# 2. Sends "charge $100, idempotency_key=abc-123"
# 3. Payment Service records intent in DB: INSERT INTO payment_intents (key='abc-123', status='pending')
# 4. Sends to Stripe with the SAME key: stripe.charge(idempotency_key='abc-123', amount=100)
# 5. Network timeout — retry with SAME key
# 6. Stripe: "I already processed abc-123 — returning same result" (no duplicate charge)
# 7. Payment Service: re-query Stripe for status of abc-123 → gets the result

# Stripe, Adyen, Braintree all support idempotency keys natively
# Your own payment service must ALSO be idempotent (client may retry your API)
Idempotency is not optional. It's the foundational primitive of any payment system. Design your entire payment API to be idempotent from day one.

3 Payment State Machine

A payment goes through well-defined states. Never skip states, never go backwards (except for explicit refunds). State transitions are atomic DB writes.

INITIATED
PENDING
PROCESSING
COMPLETED
↓ (processor declined)
FAILED
  or  
TIMED_OUT
  or  
NEEDS_REVIEW
COMPLETED
REFUND_INITIATED
REFUNDED
-- Payment record (state stored in DB, transitions are audit-logged):
CREATE TABLE payments (
  id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  idempotency_key   VARCHAR(128) UNIQUE NOT NULL,   -- client-provided dedup key
  order_id          UUID NOT NULL,
  user_id           BIGINT NOT NULL,
  amount            NUMERIC(19,4) NOT NULL,          -- cents to avoid float issues
  currency          CHAR(3) NOT NULL DEFAULT 'USD',
  status            VARCHAR(20) NOT NULL DEFAULT 'initiated',
  payment_method_id UUID,                            -- tokenized card/wallet reference
  processor         VARCHAR(32),                     -- 'stripe', 'adyen', 'braintree'
  processor_tx_id   VARCHAR(256),                    -- provider's transaction ID
  processor_status  VARCHAR(64),                     -- raw status from provider
  risk_score        SMALLINT,                        -- 0-100, from fraud detection
  created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  completed_at      TIMESTAMPTZ
);
CREATE INDEX idx_payments_order   ON payments(order_id);
CREATE INDEX idx_payments_user    ON payments(user_id, created_at DESC);
CREATE INDEX idx_payments_idempot ON payments(idempotency_key);

-- Immutable audit log (append-only — never update or delete):
CREATE TABLE payment_events (
  id          BIGSERIAL PRIMARY KEY,
  payment_id  UUID NOT NULL REFERENCES payments(id),
  event_type  VARCHAR(64) NOT NULL,    -- 'state_transition', 'processor_response', etc.
  from_status VARCHAR(20),
  to_status   VARCHAR(20),
  actor       VARCHAR(64),             -- 'user', 'system', 'fraud_engine', 'admin'
  metadata    JSONB,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

4 High-Level Architecture

Client (Web/Mobile)
    ↓  HTTPS
API Gateway (TLS termination, auth, rate limiting — <10 payments/min per user)
    ↓
Payment Service (validates request, creates payment_intent, routes to processor)
    ↓                          ↓
Payment Processor        Fraud Detection Service
(Stripe / Adyen)         (ML model: risk score)
    ↓                          ↓
Webhook Handler ←── Processor callback (async: payment succeeded/failed)
    ↓
Order Service (fulfills order on COMPLETED)
Notification Service (sends receipt)
Ledger Service (records double-entry bookkeeping)

Supporting Services:
- Payment Method Vault (tokenized card storage, PCI-DSS compliant)
- Reconciliation Service (nightly: match internal records vs bank statements)
- FX Rate Service (real-time exchange rates for multi-currency)
- Reporting Service (financial dashboards, tax reporting)

Why Two Notification Paths?

# Synchronous response to client (within 500ms):
# → "Payment submitted successfully, processing..."
# → Order ID, payment reference number

# Async webhook from processor (can take 5s–60s):
# → "Payment COMPLETED" or "Payment FAILED"
# → Update DB status, trigger fulfillment, send receipt

# The client CANNOT wait for the processor response — processors can be slow
# Design your UX for async: show "Processing..." → poll for status → show result
# Never design a payment flow where the user waits 30 seconds staring at a spinner

5 Idempotency — Full Implementation

def create_payment(request: PaymentRequest) -> PaymentResponse:
    # Step 1: Validate idempotency key
    if not request.idempotency_key:
        raise ValidationError("idempotency_key is required")

    # Step 2: Check for existing payment with this key (UPSERT pattern)
    with db.transaction():
        existing = db.query(
            "SELECT * FROM payments WHERE idempotency_key = %s FOR UPDATE",
            request.idempotency_key
        )
        if existing:
            # Client is retrying — return same result as first attempt
            return PaymentResponse.from_payment(existing)

        # Step 3: Create payment record FIRST (before calling processor)
        payment = db.insert(payments, {
            "idempotency_key": request.idempotency_key,
            "order_id": request.order_id,
            "user_id": request.user_id,
            "amount": request.amount,
            "currency": request.currency,
            "status": "initiated",
            "payment_method_id": request.payment_method_id
        })
    # DB insert committed — even if the next steps fail, we know this payment was attempted

    # Step 4: Run fraud check (synchronous, <50ms ML inference)
    risk = fraud_service.score(payment, request)
    if risk.score > 80:
        update_payment_status(payment.id, "needs_review")
        return PaymentResponse(status="needs_review", ...)

    # Step 5: Route to payment processor
    update_payment_status(payment.id, "processing")
    try:
        processor = select_processor(payment)   # load balance across Stripe, Adyen
        processor_response = processor.charge(
            amount=payment.amount,
            currency=payment.currency,
            payment_method=vault.get_token(payment.payment_method_id),
            idempotency_key=payment.id,   # use our internal ID as processor idempotency key
            metadata={"order_id": str(payment.order_id)}
        )
        update_payment_status(payment.id, "pending",
                              processor=processor.name,
                              processor_tx_id=processor_response.tx_id)
    except ProcessorTimeout:
        # Don't know if charge succeeded — mark as timed_out, reconciler will resolve
        update_payment_status(payment.id, "timed_out")
        reconciler.schedule_check(payment.id, delay=30)

    return PaymentResponse.from_payment(payment)
TIMED_OUT is not FAILED. When the processor times out, the charge may or may not have gone through. Never mark it as failed immediately — a reconciler job must query the processor to find the actual status. Prematurely marking as failed could allow a second charge attempt while the first succeeded.

6 Double-Entry Ledger

Every payment system needs a financial ledger. The industry-standard approach is double-entry bookkeeping: every transaction has equal debits and credits. This makes it impossible for money to appear or disappear — it must always come from somewhere and go somewhere.

CREATE TABLE ledger_entries (
  id              BIGSERIAL PRIMARY KEY,
  payment_id      UUID NOT NULL,
  account_id      BIGINT NOT NULL,    -- user account, merchant account, fee account, etc.
  entry_type      CHAR(1) NOT NULL,   -- 'D' (debit) or 'C' (credit)
  amount          NUMERIC(19,4) NOT NULL,
  currency        CHAR(3) NOT NULL,
  description     VARCHAR(256),
  created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Constraint: sum of all entries for a payment_id must be zero (debits = credits)

-- Example: user pays merchant $100, we take 3% ($3) fee
-- Debit user's account:    -$100.00
-- Credit merchant account:  +$97.00
-- Credit fee account:        +$3.00
-- Net: 0 (balanced)

INSERT INTO ledger_entries VALUES
  (payment_id, user_account_id,     'D', 100.00, 'USD', 'Order ORD-123'),
  (payment_id, merchant_account_id, 'C',  97.00, 'USD', 'Order ORD-123 settlement'),
  (payment_id, fee_account_id,      'C',   3.00, 'USD', 'Processing fee 3%');

-- Balance query (never store running balance — compute from ledger):
SELECT account_id,
       SUM(CASE WHEN entry_type = 'C' THEN amount ELSE -amount END) AS balance
FROM ledger_entries
WHERE account_id = $1
GROUP BY account_id;

-- Ledger is APPEND-ONLY. To reverse a transaction (refund): add OPPOSITE entries
-- Never UPDATE or DELETE ledger rows — immutability is the audit trail

7 Reconciliation

No matter how good your system is, discrepancies arise — network errors, processor bugs, clock skew, race conditions. Reconciliation is the process of comparing your internal ledger against the processor's records and the bank statement to find and fix discrepancies.

# Nightly reconciliation job:

def reconcile(date: date):
    # 1. Fetch all our COMPLETED payments for this date
    our_payments = db.query(
        "SELECT id, processor_tx_id, amount, status FROM payments WHERE DATE(completed_at) = %s",
        date
    )

    # 2. Fetch processor's settlement report (CSV/API)
    processor_settlements = stripe.get_settlement_report(date)
    # [{"tx_id": "ch_xxx", "amount": 100.00, "status": "succeeded"}, ...]

    # 3. Match by processor_tx_id
    our_tx_ids = {p.processor_tx_id: p for p in our_payments}
    processor_tx_ids = {s["tx_id"]: s for s in processor_settlements}

    discrepancies = []

    # Case A: In our DB but not in processor's report
    for tx_id, payment in our_tx_ids.items():
        if tx_id not in processor_tx_ids:
            discrepancies.append({
                "type": "missing_from_processor",
                "payment_id": payment.id,
                "action": "investigate"
            })

    # Case B: In processor's report but not in our DB
    for tx_id, settlement in processor_tx_ids.items():
        if tx_id not in our_tx_ids:
            discrepancies.append({
                "type": "missing_from_our_db",
                "processor_tx_id": tx_id,
                "amount": settlement["amount"],
                "action": "create_missing_record_or_initiate_refund"
            })

    # Case C: Amount mismatch
    for tx_id in our_tx_ids.keys() & processor_tx_ids.keys():
        if our_tx_ids[tx_id].amount != processor_tx_ids[tx_id]["amount"]:
            discrepancies.append({
                "type": "amount_mismatch",
                "payment_id": our_tx_ids[tx_id].id,
                "our_amount": our_tx_ids[tx_id].amount,
                "processor_amount": processor_tx_ids[tx_id]["amount"],
                "action": "escalate_to_finance_team"
            })

    reconciliation_report.save(date, discrepancies)
    if discrepancies:
        alert_finance_team(discrepancies)
Aim for <0.01% discrepancy rate. Most discrepancies are timing issues (transactions that cross midnight). A fully reconciled system is the gold standard of payment reliability.

8 Fraud Detection

# Rule-based checks (fast, deterministic, <5ms):
FRAUD_RULES = [
    {"name": "velocity_card",   "rule": "same_card_used > 3 times in 1 hour"},
    {"name": "velocity_device", "rule": "same_device > 10 transactions in 1 hour"},
    {"name": "high_value_new",  "rule": "amount > $1000 AND account_age_days < 7"},
    {"name": "geo_mismatch",    "rule": "billing_country != ip_country AND amount > $200"},
    {"name": "card_testing",    "rule": "3+ declined attempts in 10 minutes"},
]

# ML model scoring (async, <50ms for cached model inference):
# Features: transaction amount, merchant category, time of day, device fingerprint,
#           user's historical behaviour, card BIN (bank identification number), IP risk score
# Output: risk score 0-100
# Threshold: score > 80 → flag for manual review; score > 95 → auto-decline

def fraud_score(payment: Payment, context: RequestContext) -> FraudResult:
    # Rule checks (synchronous)
    for rule in FRAUD_RULES:
        if rule.matches(payment, context):
            return FraudResult(score=95, reason=rule.name, action="block")

    # ML model (synchronous, cached)
    features = feature_extractor.extract(payment, context)
    score = fraud_model.predict(features)   # <5ms for cached model
    action = "allow" if score < 60 else "review" if score < 80 else "block"
    return FraudResult(score=score, action=action)

# Chargeback feedback loop:
# When a chargeback is received → label that transaction as fraud → retrain model weekly

9 Scaling to 100K TPS

Database — The Bottleneck

# Challenge: payment DB needs ACID transactions + strong consistency → SQL
# PostgreSQL single primary: ~5K–10K writes/sec → not enough for 100K TPS

# Solution 1: Read replicas for reporting/reconciliation
# Write path (payment creation) → primary only
# Read path (transaction history, analytics) → 3–5 read replicas

# Solution 2: Horizontal sharding by user_id (or merchant_id)
# Shard 0: users 000–099M
# Shard 1: users 100–199M
# ...
# Each shard is a separate PostgreSQL cluster (primary + replicas)
# Route at application layer via consistent hash of user_id

# Solution 3: Command Query Responsibility Segregation (CQRS)
# Write model: PostgreSQL (ACID, normalized)
# Read model: Elasticsearch (denormalized, fast aggregations for dashboards)
# Sync via Debezium CDC (Change Data Capture) → Kafka → Elasticsearch

# At 100K TPS × 1KB/transaction → 100MB/sec writes
# Single PostgreSQL can handle this with NVMe SSD + enough RAM for WAL buffer

Payment Processor Load Balancing

# Don't rely on a single processor — spread risk and get better rates:
PROCESSOR_ROUTING = {
    "stripe":    {"weight": 0.5, "cards": ["visa", "mastercard", "amex"]},
    "adyen":     {"weight": 0.3, "cards": ["visa", "mastercard"], "regions": ["EU", "APAC"]},
    "braintree": {"weight": 0.2, "cards": ["visa", "mastercard", "paypal"]},
}

def select_processor(payment: Payment) -> Processor:
    # Primary routing: by currency/region for better conversion rates
    if payment.currency in ["EUR", "GBP"] and payment.user_region == "EU":
        return adyen   # Adyen has better EU bank relationship

    # Failover: if primary processor is down, route to secondary
    if not stripe.is_healthy():
        return adyen if payment.amount < 10000 else braintree

    # A/B test: slowly migrate from Stripe to Adyen for cost reduction
    if payment.user_id % 10 < 3:   # 30% of traffic
        return adyen

    return stripe

PCI-DSS Compliance — Never Store Raw Card Data

# NEVER store raw card numbers (PAN), CVV, or track data in your database
# Doing so requires PCI-DSS Level 1 compliance — extremely expensive audit

# Correct approach: tokenization
# 1. Client enters card details in a payment provider's hosted field (iframe from Stripe/Adyen)
# 2. Provider tokenizes the card → returns a token (e.g., "pm_xyz")
# 3. Your API receives only the token — never the raw card number
# 4. Store the token in your DB — this token is useless without Stripe's key
# 5. Future charges: pass token to Stripe — they handle the actual card

# Your system only stores: token, last4, card_brand, expiry_month, expiry_year
# This reduces your PCI scope dramatically (SAQ-A level vs Level 1)

10 Trade-offs & Design Decisions

  • Strong consistency over availability: For payments, CAP theorem choice is clear — consistency wins. A payment DB that's briefly unavailable is better than one that processes the same payment twice. Use synchronous replication (PostgreSQL streaming replication with synchronous_commit=on) even though it's slower.
  • Synchronous fraud check (not async): We run fraud scoring synchronously before sending to the processor, adding ~50ms. The alternative (score async and reverse if fraud detected) creates chargebacks — expensive and disruptive. The latency cost is worth it.
  • Postgres over NoSQL for the core payment table: Financial data has complex relationships (payments → refunds → ledger entries → reconciliation records) that benefit from foreign keys and JOIN queries. NoSQL's flexibility is a liability here — schema enforcement prevents data corruption.
  • Why not Kafka for the payment command itself?: The core payment creation uses synchronous DB writes, not Kafka. Kafka is used for downstream events (fulfillment, notifications, analytics) AFTER the payment is committed to DB. Putting the payment command on Kafka would make the user wait, lose synchronous error handling, and complicate idempotency.
  • Multi-processor strategy: Never be locked into one processor. Stripe has 99.99% uptime but does go down occasionally. Routing 30% to Adyen reduces revenue impact of any single provider outage and gives negotiating power on fees.

What to Study Next