Microservices vs Monolith

Architecture trade-offs, migration patterns, and when each approach actually makes sense

Microservices are one of the most over-discussed and under-examined topics in system design. Engineers frequently advocate for them without understanding the operational costs. This guide cuts through the hype and gives you an honest analysis of both architectures — what each is good for, when to switch, and how to migrate if you must.

The short version: start with a monolith, extract services only when you hit specific problems that microservices solve. But the long version is far more interesting.

1 The Two Architectures Defined

Monolith

Browser/App → Load Balancer → [Single Deployable Unit] ├── User module ├── Order module ├── Payment module ├── Notification module └── Shared DB (PostgreSQL)

All code runs in a single process. Modules are separated by packages/namespaces, not network boundaries. One codebase, one deploy, one database.

Microservices

Browser/App → API Gateway → User Service → users_db (PostgreSQL) → Order Service → orders_db (PostgreSQL) → Payment Service → payments_db (PostgreSQL) → Notification Svc → (stateless, uses Kafka) → Kafka (async event bus between services)

Each service is an independently deployable unit with its own database, codebase, and deployment pipeline. Services communicate over the network — REST, gRPC, or async message queues.

What's Between Them: The Modular Monolith

[Single Deployable Unit — Modular Monolith] ├── User Module → own schema (users.*) ├── Order Module → own schema (orders.*) ├── Payment Module → own schema (payments.*) └── Shared DB, but with strict schema ownership per module Characteristics: - Modules communicate via in-process interfaces (not network calls) - No shared data access across module boundaries (enforced by lint rules / ArchUnit) - Easy to extract into microservices later — boundaries already clean
The Modular Monolith is often the best of both worlds — deploy simplicity of a monolith with the code isolation of microservices. Shopify, Stack Overflow, and Basecamp run this way at massive scale.

2 Trade-offs Side by Side

✅ Monolith Advantages

  • Simple local development — one repo, one server, one DB
  • No network overhead — in-process function calls are nanoseconds
  • No distributed transactions — ACID transactions across all modules
  • Easy debugging — single process, single trace
  • Straightforward deployment — ship one binary/container
  • Refactoring is cheap — rename across modules with IDE
  • No service discovery, load balancing between services
  • One observability stack to manage

✅ Microservices Advantages

  • Independent deployability — ship User Service without touching Payment Service
  • Independent scaling — scale only the bottleneck service
  • Technology diversity — Python for ML service, Go for real-time, Java for core
  • Team autonomy — team owns their service end-to-end
  • Fault isolation — Payment Service crash doesn't kill User Service
  • Smaller codebases per team — easier onboarding
  • Different SLAs — payment service at 99.99%, reporting at 99.9%

❌ Monolith Disadvantages

  • Single point of deployment — one bug can kill everything
  • Scaling the whole app even if only one module needs it
  • Teams step on each other in large codebases
  • Technology lock-in — changing DB/language affects everything
  • Slow builds/tests at scale (10+ engineers on one codebase)

❌ Microservices Disadvantages

  • Network calls between services → latency + failure modes
  • Distributed transactions are hard (Saga pattern required)
  • Operational overhead: K8s, service mesh, distributed tracing
  • Data consistency across services is eventual, not immediate
  • Testing end-to-end requires spinning up many services
  • Versioning APIs between services adds maintenance burden
  • Debugging across services requires distributed tracing (Jaeger/Zipkin)

3 When to Split a Monolith Into Microservices

Don't split for the sake of splitting. Split only when you have a concrete, measurable problem that microservices will solve.

# Signal 1: INDEPENDENT SCALING NEEDS
# "Our image processing module uses 95% CPU during peak, but everything else is idle"
# → Extract ImageProcessingService → scale it to 10 instances while keeping 2 for everything else
# Test: Can you solve this with vertical scaling or async queues first?

# Signal 2: INDEPENDENT DEPLOYMENT VELOCITY
# "Team A (payments) is blocked by Team B's (notifications) release schedule"
# "We ship to prod every 2 weeks because we're afraid of breaking something else"
# → Microservices let teams deploy independently
# Test: Is a modular monolith + better CI/CD sufficient?

# Signal 3: TECHNOLOGY MISMATCH
# "Our ML recommendation engine needs Python + PyTorch but the rest is Java"
# "Our real-time service needs Go's goroutines but our team writes Spring Boot"
# → Extract the service using the right technology

# Signal 4: FAULT ISOLATION IS CRITICAL
# "When our analytics module crashes it takes down checkout — unacceptable"
# → Extract analytics to a separate process so it can fail independently
# Test: Circuit breakers + bulkheads inside a monolith may be sufficient

# Signal 5: ORG STRUCTURE ALIGNMENT (Conway's Law)
# "We have 8 teams of 5 engineers — they can't all commit to the same monorepo without conflicts"
# Conway's Law: systems reflect the org structure that builds them
# 8 teams → 8 services makes coordination implicit (API contracts) rather than explicit (meetings)

# DON'T SPLIT because:
# ✗ "Everyone is doing microservices"
# ✗ "It seems more scalable"
# ✗ "We want to be like Netflix/Amazon"
# ✗ The team is <10 engineers
# ✗ The product-market fit is not yet proven
The Fallacy of Scale: Netflix, Amazon, and Uber all started as monoliths. They migrated to microservices after they had the scale problems AND the engineering headcount to manage the complexity. At 10 engineers, microservices create more problems than they solve.

The Rule of Thumb: 2-Pizza Team Per Service

# Amazon's "2-pizza team" rule: if a service needs more than 2 pizzas worth of engineers
# (6–8 people) to understand and operate it → it's too large → split
#
# Flip side: if a service only has 1 engineer → too small → merge
# (one engineer is an on-call rotation of one, deployment risk, bus factor 1)
#
# Good microservice:
# - 3–6 engineers own and operate it end-to-end
# - API changes can be proposed by the team autonomously
# - The team can deploy to prod without coordinating with other teams

4 Migrating: The Strangler Fig Pattern

Never do a "big bang" rewrite — it almost always fails. Use the Strangler Fig pattern: gradually route traffic from the monolith to new services while both exist in parallel.

# Strangler Fig — step by step

# Step 1: Identify the bounded context to extract (one module at a time)
# Best first candidates: high-traffic, independently scalable, clear API boundary
# Worst first candidates: deeply coupled modules with many shared DB tables

# Step 2: Define the service API before writing any code
# What endpoints does this service expose?
# What events does it publish/consume?
# What data does it own exclusively?

# Step 3: Deploy the new service alongside the monolith
# New service has its OWN database (even if it starts as a replicated copy)
# No shared DB between the new service and monolith — this is non-negotiable

# Step 4: Place a proxy/facade in front of both
class ServiceRouter:
    def __init__(self, feature_flag_client):
        self.flags = feature_flag_client

    def route_get_user(self, user_id: int):
        if self.flags.get("use_user_service", user_id=user_id):
            return user_service_client.get_user(user_id)  # new microservice
        else:
            return monolith_db.query("SELECT * FROM users WHERE id=%s", user_id)

# Step 5: Roll out gradually (1% → 5% → 25% → 50% → 100%)
# Feature flags control rollout speed
# Monitor error rates and latency at each stage
# Roll back instantly if degradation detected

# Step 6: Once 100% traffic on new service, delete the monolith code for that module
# Don't leave "zombie code" in the monolith — technical debt immediately

# Step 7: Repeat for next module
# Order: low coupling first → high coupling last
# Typical migration: 12–24 months for a medium-sized monolith

Database Decomposition — The Hardest Part

# Problem: Monolith has ONE shared DB. Services need their OWN DB.
# You can't just split tables — the monolith's JOIN queries will break.

# Strategy 1: Sync via events (recommended for most cases)
# Phase 1: New service writes to its own DB. Also publishes events.
# Phase 2: Monolith consumes events to update its own DB for the transition period.
# Phase 3: Once monolith no longer reads the extracted table, remove it from shared DB.

# Strategy 2: Read replica transition
# New service reads from a read replica of the shared DB (read-only).
# New service writes go to its own new DB.
# Sync writes back to shared DB via event (temporarily).
# Cut over once the shared DB copy is no longer needed.

# Strategy 3: Dual-write (dangerous but sometimes necessary)
# Write to BOTH shared DB and new service DB simultaneously.
# Compare results to verify correctness.
# Switch reads to new service once verified.
# Risk: divergence between the two copies → use checksums/reconciler

5 Critical Microservices Patterns

API Gateway — Single Entry Point

# API Gateway responsibilities:
# - Request routing (auth requests → AuthService, orders → OrderService)
# - Authentication/Authorization (validate JWT before forwarding)
# - Rate limiting (per-client quotas)
# - Request aggregation (combine responses from multiple services)
# - SSL termination
# - Logging and tracing injection (add X-Request-ID to every request)

# Tools: Kong, AWS API Gateway, Nginx, Envoy, Traefik

Client → API Gateway (Kong/AWS AG)
              ├── /api/auth/* → Auth Service
              ├── /api/users/* → User Service
              ├── /api/orders/* → Order Service
              └── /api/products/* → Product Service

Service-to-Service Communication

# Option A: Synchronous REST/gRPC (request-response)
# Use when: caller needs an immediate response to proceed
# Risk: if Order Service calls Payment Service synchronously,
#        Payment Service downtime → Order Service fails too (cascading failure)
# Mitigation: Circuit Breaker pattern

# Circuit Breaker (using Python/pseudo):
class CircuitBreaker:
    CLOSED = "closed"    # normal; requests go through
    OPEN = "open"        # fast-fail; no requests sent to downstream
    HALF_OPEN = "half-open"  # probe: send one request to test recovery

    def call(self, func, *args):
        if self.state == self.OPEN:
            if time.time() > self.reset_time:
                self.state = self.HALF_OPEN
            else:
                raise CircuitOpenError("Payment Service unavailable")  # fast-fail

        try:
            result = func(*args)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_failure(self):
        self.failure_count += 1
        if self.failure_count >= self.threshold:  # e.g., 5 failures in 10s
            self.state = self.OPEN
            self.reset_time = time.time() + 30  # try again after 30s

# Option B: Asynchronous events via Kafka (fire-and-forget)
# Use when: caller doesn't need immediate response
# Order Service publishes "OrderCreated" event to Kafka
# Payment Service consumes it → processes payment → publishes "PaymentCompleted"
# Order Service consumes "PaymentCompleted" → updates order status
# Decoupled: if Payment Service is down → events queue up in Kafka → processed on recovery

Distributed Tracing

# Problem: User reports "checkout is slow" — which service is the bottleneck?
# In monolith: single thread trace, profiling is straightforward
# In microservices: request touches 6 services — need distributed trace

# Solution: OpenTelemetry (industry standard)
# Every service propagates trace context in HTTP headers:
# traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
#               version-trace_id-parent_span_id-flags

# Example trace (Jaeger/Zipkin visualization):
# Request /checkout → total: 450ms
#  ├── API Gateway: 2ms
#  ├── Order Service: 3ms
#  │   ├── DB query: 2ms
#  ├── Payment Service: 380ms   ← BOTTLENECK
#  │   ├── Fraud check: 50ms
#  │   ├── Stripe API call: 320ms  ← root cause: external API slow
#  ├── Notification Service: 5ms (async, doesn't block response)
#  └── Response: 10ms

6 The Decision Framework

ScenarioRecommendationWhy
New product / startup / <10 engineersMonolithSpeed to market. Don't pay microservices tax before you have the problem.
10–50 engineers, product-market fit foundModular MonolithClean boundaries without network overhead. Extract when needed.
50+ engineers, multiple independent teamsMicroservicesConway's Law: teams need independent deploy pipelines.
One module needs 10× more resourcesExtract that serviceTargeted extraction. Keep everything else in the monolith.
Frequent deploys blocked by team coordinationExtract servicesIndependent deployability is microservices' core value prop.
Need Python/Go/Rust for specific workloadPolyglot serviceExtract only the service that needs a different tech stack.

What to Say in a System Design Interview

# Interviewer: "Would you use microservices or a monolith?"

# Good answer:
"I'd start with a modular monolith. At the scale we're designing for — say 100K DAU —
 a well-structured monolith with clean module boundaries is faster to build, easier to debug,
 and doesn't require Kubernetes or service mesh infrastructure.

 I'd define bounded contexts now (User, Order, Payment, Notification) — each module owns its
 schema and doesn't share data with others. When we hit a specific problem —
 say the Notification module is causing memory spikes that affect checkout — I'd extract it
 using the Strangler Fig pattern with a Kafka event boundary.

 If this were a 100-engineer org where multiple teams are deploying independently, I'd
 recommend microservices with an API Gateway, circuit breakers between services, and
 distributed tracing with OpenTelemetry from day one."

# Mentioning these signals separates good candidates from great ones:
# ✓ Conway's Law
# ✓ Strangler Fig pattern (shows you know HOW to migrate, not just WHEN)
# ✓ Circuit breaker (shows you understand failure modes)
# ✓ Modular monolith as a middle ground (shows nuance)

What to Study Next