Payment systems are the most correctness-critical systems you'll ever design. A bug in a social feed might annoy users; a bug in a payment system can cause double charges, lost money, or regulatory fines. This walkthrough covers the design principles that make payments reliable — idempotency, exactly-once semantics, reconciliation, and fraud detection — at 100K transactions per second.
The fundamental challenge in payment systems: what happens when a network request to the payment processor times out? You don't know if the charge succeeded or failed. If you retry, you might double-charge. If you don't retry, you might miss a failed charge.
The solution: idempotency keys.
# Without idempotency: # 1. Client sends "charge $100" to Payment Service # 2. Payment Service sends to Stripe → network timeout # 3. Did Stripe charge the card? Unknown. # 4. If retry: Stripe might process TWO charges → customer billed twice # 5. If no retry: if Stripe DID succeed but we didn't record it → order never fulfilled # With idempotency key: # 1. Client generates a unique idempotency_key (UUID) for this payment attempt # 2. Sends "charge $100, idempotency_key=abc-123" # 3. Payment Service records intent in DB: INSERT INTO payment_intents (key='abc-123', status='pending') # 4. Sends to Stripe with the SAME key: stripe.charge(idempotency_key='abc-123', amount=100) # 5. Network timeout — retry with SAME key # 6. Stripe: "I already processed abc-123 — returning same result" (no duplicate charge) # 7. Payment Service: re-query Stripe for status of abc-123 → gets the result # Stripe, Adyen, Braintree all support idempotency keys natively # Your own payment service must ALSO be idempotent (client may retry your API)
A payment goes through well-defined states. Never skip states, never go backwards (except for explicit refunds). State transitions are atomic DB writes.
-- Payment record (state stored in DB, transitions are audit-logged): CREATE TABLE payments ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), idempotency_key VARCHAR(128) UNIQUE NOT NULL, -- client-provided dedup key order_id UUID NOT NULL, user_id BIGINT NOT NULL, amount NUMERIC(19,4) NOT NULL, -- cents to avoid float issues currency CHAR(3) NOT NULL DEFAULT 'USD', status VARCHAR(20) NOT NULL DEFAULT 'initiated', payment_method_id UUID, -- tokenized card/wallet reference processor VARCHAR(32), -- 'stripe', 'adyen', 'braintree' processor_tx_id VARCHAR(256), -- provider's transaction ID processor_status VARCHAR(64), -- raw status from provider risk_score SMALLINT, -- 0-100, from fraud detection created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), completed_at TIMESTAMPTZ ); CREATE INDEX idx_payments_order ON payments(order_id); CREATE INDEX idx_payments_user ON payments(user_id, created_at DESC); CREATE INDEX idx_payments_idempot ON payments(idempotency_key); -- Immutable audit log (append-only — never update or delete): CREATE TABLE payment_events ( id BIGSERIAL PRIMARY KEY, payment_id UUID NOT NULL REFERENCES payments(id), event_type VARCHAR(64) NOT NULL, -- 'state_transition', 'processor_response', etc. from_status VARCHAR(20), to_status VARCHAR(20), actor VARCHAR(64), -- 'user', 'system', 'fraud_engine', 'admin' metadata JSONB, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() );
Client (Web/Mobile)
↓ HTTPS
API Gateway (TLS termination, auth, rate limiting — <10 payments/min per user)
↓
Payment Service (validates request, creates payment_intent, routes to processor)
↓ ↓
Payment Processor Fraud Detection Service
(Stripe / Adyen) (ML model: risk score)
↓ ↓
Webhook Handler ←── Processor callback (async: payment succeeded/failed)
↓
Order Service (fulfills order on COMPLETED)
Notification Service (sends receipt)
Ledger Service (records double-entry bookkeeping)
Supporting Services:
- Payment Method Vault (tokenized card storage, PCI-DSS compliant)
- Reconciliation Service (nightly: match internal records vs bank statements)
- FX Rate Service (real-time exchange rates for multi-currency)
- Reporting Service (financial dashboards, tax reporting)
# Synchronous response to client (within 500ms): # → "Payment submitted successfully, processing..." # → Order ID, payment reference number # Async webhook from processor (can take 5s–60s): # → "Payment COMPLETED" or "Payment FAILED" # → Update DB status, trigger fulfillment, send receipt # The client CANNOT wait for the processor response — processors can be slow # Design your UX for async: show "Processing..." → poll for status → show result # Never design a payment flow where the user waits 30 seconds staring at a spinner
def create_payment(request: PaymentRequest) -> PaymentResponse:
# Step 1: Validate idempotency key
if not request.idempotency_key:
raise ValidationError("idempotency_key is required")
# Step 2: Check for existing payment with this key (UPSERT pattern)
with db.transaction():
existing = db.query(
"SELECT * FROM payments WHERE idempotency_key = %s FOR UPDATE",
request.idempotency_key
)
if existing:
# Client is retrying — return same result as first attempt
return PaymentResponse.from_payment(existing)
# Step 3: Create payment record FIRST (before calling processor)
payment = db.insert(payments, {
"idempotency_key": request.idempotency_key,
"order_id": request.order_id,
"user_id": request.user_id,
"amount": request.amount,
"currency": request.currency,
"status": "initiated",
"payment_method_id": request.payment_method_id
})
# DB insert committed — even if the next steps fail, we know this payment was attempted
# Step 4: Run fraud check (synchronous, <50ms ML inference)
risk = fraud_service.score(payment, request)
if risk.score > 80:
update_payment_status(payment.id, "needs_review")
return PaymentResponse(status="needs_review", ...)
# Step 5: Route to payment processor
update_payment_status(payment.id, "processing")
try:
processor = select_processor(payment) # load balance across Stripe, Adyen
processor_response = processor.charge(
amount=payment.amount,
currency=payment.currency,
payment_method=vault.get_token(payment.payment_method_id),
idempotency_key=payment.id, # use our internal ID as processor idempotency key
metadata={"order_id": str(payment.order_id)}
)
update_payment_status(payment.id, "pending",
processor=processor.name,
processor_tx_id=processor_response.tx_id)
except ProcessorTimeout:
# Don't know if charge succeeded — mark as timed_out, reconciler will resolve
update_payment_status(payment.id, "timed_out")
reconciler.schedule_check(payment.id, delay=30)
return PaymentResponse.from_payment(payment)
Every payment system needs a financial ledger. The industry-standard approach is double-entry bookkeeping: every transaction has equal debits and credits. This makes it impossible for money to appear or disappear — it must always come from somewhere and go somewhere.
CREATE TABLE ledger_entries (
id BIGSERIAL PRIMARY KEY,
payment_id UUID NOT NULL,
account_id BIGINT NOT NULL, -- user account, merchant account, fee account, etc.
entry_type CHAR(1) NOT NULL, -- 'D' (debit) or 'C' (credit)
amount NUMERIC(19,4) NOT NULL,
currency CHAR(3) NOT NULL,
description VARCHAR(256),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Constraint: sum of all entries for a payment_id must be zero (debits = credits)
-- Example: user pays merchant $100, we take 3% ($3) fee
-- Debit user's account: -$100.00
-- Credit merchant account: +$97.00
-- Credit fee account: +$3.00
-- Net: 0 (balanced)
INSERT INTO ledger_entries VALUES
(payment_id, user_account_id, 'D', 100.00, 'USD', 'Order ORD-123'),
(payment_id, merchant_account_id, 'C', 97.00, 'USD', 'Order ORD-123 settlement'),
(payment_id, fee_account_id, 'C', 3.00, 'USD', 'Processing fee 3%');
-- Balance query (never store running balance — compute from ledger):
SELECT account_id,
SUM(CASE WHEN entry_type = 'C' THEN amount ELSE -amount END) AS balance
FROM ledger_entries
WHERE account_id = $1
GROUP BY account_id;
-- Ledger is APPEND-ONLY. To reverse a transaction (refund): add OPPOSITE entries
-- Never UPDATE or DELETE ledger rows — immutability is the audit trail
No matter how good your system is, discrepancies arise — network errors, processor bugs, clock skew, race conditions. Reconciliation is the process of comparing your internal ledger against the processor's records and the bank statement to find and fix discrepancies.
# Nightly reconciliation job:
def reconcile(date: date):
# 1. Fetch all our COMPLETED payments for this date
our_payments = db.query(
"SELECT id, processor_tx_id, amount, status FROM payments WHERE DATE(completed_at) = %s",
date
)
# 2. Fetch processor's settlement report (CSV/API)
processor_settlements = stripe.get_settlement_report(date)
# [{"tx_id": "ch_xxx", "amount": 100.00, "status": "succeeded"}, ...]
# 3. Match by processor_tx_id
our_tx_ids = {p.processor_tx_id: p for p in our_payments}
processor_tx_ids = {s["tx_id"]: s for s in processor_settlements}
discrepancies = []
# Case A: In our DB but not in processor's report
for tx_id, payment in our_tx_ids.items():
if tx_id not in processor_tx_ids:
discrepancies.append({
"type": "missing_from_processor",
"payment_id": payment.id,
"action": "investigate"
})
# Case B: In processor's report but not in our DB
for tx_id, settlement in processor_tx_ids.items():
if tx_id not in our_tx_ids:
discrepancies.append({
"type": "missing_from_our_db",
"processor_tx_id": tx_id,
"amount": settlement["amount"],
"action": "create_missing_record_or_initiate_refund"
})
# Case C: Amount mismatch
for tx_id in our_tx_ids.keys() & processor_tx_ids.keys():
if our_tx_ids[tx_id].amount != processor_tx_ids[tx_id]["amount"]:
discrepancies.append({
"type": "amount_mismatch",
"payment_id": our_tx_ids[tx_id].id,
"our_amount": our_tx_ids[tx_id].amount,
"processor_amount": processor_tx_ids[tx_id]["amount"],
"action": "escalate_to_finance_team"
})
reconciliation_report.save(date, discrepancies)
if discrepancies:
alert_finance_team(discrepancies)
# Rule-based checks (fast, deterministic, <5ms):
FRAUD_RULES = [
{"name": "velocity_card", "rule": "same_card_used > 3 times in 1 hour"},
{"name": "velocity_device", "rule": "same_device > 10 transactions in 1 hour"},
{"name": "high_value_new", "rule": "amount > $1000 AND account_age_days < 7"},
{"name": "geo_mismatch", "rule": "billing_country != ip_country AND amount > $200"},
{"name": "card_testing", "rule": "3+ declined attempts in 10 minutes"},
]
# ML model scoring (async, <50ms for cached model inference):
# Features: transaction amount, merchant category, time of day, device fingerprint,
# user's historical behaviour, card BIN (bank identification number), IP risk score
# Output: risk score 0-100
# Threshold: score > 80 → flag for manual review; score > 95 → auto-decline
def fraud_score(payment: Payment, context: RequestContext) -> FraudResult:
# Rule checks (synchronous)
for rule in FRAUD_RULES:
if rule.matches(payment, context):
return FraudResult(score=95, reason=rule.name, action="block")
# ML model (synchronous, cached)
features = feature_extractor.extract(payment, context)
score = fraud_model.predict(features) # <5ms for cached model
action = "allow" if score < 60 else "review" if score < 80 else "block"
return FraudResult(score=score, action=action)
# Chargeback feedback loop:
# When a chargeback is received → label that transaction as fraud → retrain model weekly
# Challenge: payment DB needs ACID transactions + strong consistency → SQL # PostgreSQL single primary: ~5K–10K writes/sec → not enough for 100K TPS # Solution 1: Read replicas for reporting/reconciliation # Write path (payment creation) → primary only # Read path (transaction history, analytics) → 3–5 read replicas # Solution 2: Horizontal sharding by user_id (or merchant_id) # Shard 0: users 000–099M # Shard 1: users 100–199M # ... # Each shard is a separate PostgreSQL cluster (primary + replicas) # Route at application layer via consistent hash of user_id # Solution 3: Command Query Responsibility Segregation (CQRS) # Write model: PostgreSQL (ACID, normalized) # Read model: Elasticsearch (denormalized, fast aggregations for dashboards) # Sync via Debezium CDC (Change Data Capture) → Kafka → Elasticsearch # At 100K TPS × 1KB/transaction → 100MB/sec writes # Single PostgreSQL can handle this with NVMe SSD + enough RAM for WAL buffer
# Don't rely on a single processor — spread risk and get better rates:
PROCESSOR_ROUTING = {
"stripe": {"weight": 0.5, "cards": ["visa", "mastercard", "amex"]},
"adyen": {"weight": 0.3, "cards": ["visa", "mastercard"], "regions": ["EU", "APAC"]},
"braintree": {"weight": 0.2, "cards": ["visa", "mastercard", "paypal"]},
}
def select_processor(payment: Payment) -> Processor:
# Primary routing: by currency/region for better conversion rates
if payment.currency in ["EUR", "GBP"] and payment.user_region == "EU":
return adyen # Adyen has better EU bank relationship
# Failover: if primary processor is down, route to secondary
if not stripe.is_healthy():
return adyen if payment.amount < 10000 else braintree
# A/B test: slowly migrate from Stripe to Adyen for cost reduction
if payment.user_id % 10 < 3: # 30% of traffic
return adyen
return stripe
# NEVER store raw card numbers (PAN), CVV, or track data in your database # Doing so requires PCI-DSS Level 1 compliance — extremely expensive audit # Correct approach: tokenization # 1. Client enters card details in a payment provider's hosted field (iframe from Stripe/Adyen) # 2. Provider tokenizes the card → returns a token (e.g., "pm_xyz") # 3. Your API receives only the token — never the raw card number # 4. Store the token in your DB — this token is useless without Stripe's key # 5. Future charges: pass token to Stripe — they handle the actual card # Your system only stores: token, last4, card_brand, expiry_month, expiry_year # This reduces your PCI scope dramatically (SAQ-A level vs Level 1)
synchronous_commit=on) even though it's slower.