Microservices are one of the most over-discussed and under-examined topics in system design. Engineers frequently advocate for them without understanding the operational costs. This guide cuts through the hype and gives you an honest analysis of both architectures — what each is good for, when to switch, and how to migrate if you must.
The short version: start with a monolith, extract services only when you hit specific problems that microservices solve. But the long version is far more interesting.
All code runs in a single process. Modules are separated by packages/namespaces, not network boundaries. One codebase, one deploy, one database.
Each service is an independently deployable unit with its own database, codebase, and deployment pipeline. Services communicate over the network — REST, gRPC, or async message queues.
Don't split for the sake of splitting. Split only when you have a concrete, measurable problem that microservices will solve.
# Signal 1: INDEPENDENT SCALING NEEDS # "Our image processing module uses 95% CPU during peak, but everything else is idle" # → Extract ImageProcessingService → scale it to 10 instances while keeping 2 for everything else # Test: Can you solve this with vertical scaling or async queues first? # Signal 2: INDEPENDENT DEPLOYMENT VELOCITY # "Team A (payments) is blocked by Team B's (notifications) release schedule" # "We ship to prod every 2 weeks because we're afraid of breaking something else" # → Microservices let teams deploy independently # Test: Is a modular monolith + better CI/CD sufficient? # Signal 3: TECHNOLOGY MISMATCH # "Our ML recommendation engine needs Python + PyTorch but the rest is Java" # "Our real-time service needs Go's goroutines but our team writes Spring Boot" # → Extract the service using the right technology # Signal 4: FAULT ISOLATION IS CRITICAL # "When our analytics module crashes it takes down checkout — unacceptable" # → Extract analytics to a separate process so it can fail independently # Test: Circuit breakers + bulkheads inside a monolith may be sufficient # Signal 5: ORG STRUCTURE ALIGNMENT (Conway's Law) # "We have 8 teams of 5 engineers — they can't all commit to the same monorepo without conflicts" # Conway's Law: systems reflect the org structure that builds them # 8 teams → 8 services makes coordination implicit (API contracts) rather than explicit (meetings) # DON'T SPLIT because: # ✗ "Everyone is doing microservices" # ✗ "It seems more scalable" # ✗ "We want to be like Netflix/Amazon" # ✗ The team is <10 engineers # ✗ The product-market fit is not yet proven
# Amazon's "2-pizza team" rule: if a service needs more than 2 pizzas worth of engineers # (6–8 people) to understand and operate it → it's too large → split # # Flip side: if a service only has 1 engineer → too small → merge # (one engineer is an on-call rotation of one, deployment risk, bus factor 1) # # Good microservice: # - 3–6 engineers own and operate it end-to-end # - API changes can be proposed by the team autonomously # - The team can deploy to prod without coordinating with other teams
Never do a "big bang" rewrite — it almost always fails. Use the Strangler Fig pattern: gradually route traffic from the monolith to new services while both exist in parallel.
# Strangler Fig — step by step
# Step 1: Identify the bounded context to extract (one module at a time)
# Best first candidates: high-traffic, independently scalable, clear API boundary
# Worst first candidates: deeply coupled modules with many shared DB tables
# Step 2: Define the service API before writing any code
# What endpoints does this service expose?
# What events does it publish/consume?
# What data does it own exclusively?
# Step 3: Deploy the new service alongside the monolith
# New service has its OWN database (even if it starts as a replicated copy)
# No shared DB between the new service and monolith — this is non-negotiable
# Step 4: Place a proxy/facade in front of both
class ServiceRouter:
def __init__(self, feature_flag_client):
self.flags = feature_flag_client
def route_get_user(self, user_id: int):
if self.flags.get("use_user_service", user_id=user_id):
return user_service_client.get_user(user_id) # new microservice
else:
return monolith_db.query("SELECT * FROM users WHERE id=%s", user_id)
# Step 5: Roll out gradually (1% → 5% → 25% → 50% → 100%)
# Feature flags control rollout speed
# Monitor error rates and latency at each stage
# Roll back instantly if degradation detected
# Step 6: Once 100% traffic on new service, delete the monolith code for that module
# Don't leave "zombie code" in the monolith — technical debt immediately
# Step 7: Repeat for next module
# Order: low coupling first → high coupling last
# Typical migration: 12–24 months for a medium-sized monolith
# Problem: Monolith has ONE shared DB. Services need their OWN DB. # You can't just split tables — the monolith's JOIN queries will break. # Strategy 1: Sync via events (recommended for most cases) # Phase 1: New service writes to its own DB. Also publishes events. # Phase 2: Monolith consumes events to update its own DB for the transition period. # Phase 3: Once monolith no longer reads the extracted table, remove it from shared DB. # Strategy 2: Read replica transition # New service reads from a read replica of the shared DB (read-only). # New service writes go to its own new DB. # Sync writes back to shared DB via event (temporarily). # Cut over once the shared DB copy is no longer needed. # Strategy 3: Dual-write (dangerous but sometimes necessary) # Write to BOTH shared DB and new service DB simultaneously. # Compare results to verify correctness. # Switch reads to new service once verified. # Risk: divergence between the two copies → use checksums/reconciler
# API Gateway responsibilities:
# - Request routing (auth requests → AuthService, orders → OrderService)
# - Authentication/Authorization (validate JWT before forwarding)
# - Rate limiting (per-client quotas)
# - Request aggregation (combine responses from multiple services)
# - SSL termination
# - Logging and tracing injection (add X-Request-ID to every request)
# Tools: Kong, AWS API Gateway, Nginx, Envoy, Traefik
Client → API Gateway (Kong/AWS AG)
├── /api/auth/* → Auth Service
├── /api/users/* → User Service
├── /api/orders/* → Order Service
└── /api/products/* → Product Service
# Option A: Synchronous REST/gRPC (request-response)
# Use when: caller needs an immediate response to proceed
# Risk: if Order Service calls Payment Service synchronously,
# Payment Service downtime → Order Service fails too (cascading failure)
# Mitigation: Circuit Breaker pattern
# Circuit Breaker (using Python/pseudo):
class CircuitBreaker:
CLOSED = "closed" # normal; requests go through
OPEN = "open" # fast-fail; no requests sent to downstream
HALF_OPEN = "half-open" # probe: send one request to test recovery
def call(self, func, *args):
if self.state == self.OPEN:
if time.time() > self.reset_time:
self.state = self.HALF_OPEN
else:
raise CircuitOpenError("Payment Service unavailable") # fast-fail
try:
result = func(*args)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_failure(self):
self.failure_count += 1
if self.failure_count >= self.threshold: # e.g., 5 failures in 10s
self.state = self.OPEN
self.reset_time = time.time() + 30 # try again after 30s
# Option B: Asynchronous events via Kafka (fire-and-forget)
# Use when: caller doesn't need immediate response
# Order Service publishes "OrderCreated" event to Kafka
# Payment Service consumes it → processes payment → publishes "PaymentCompleted"
# Order Service consumes "PaymentCompleted" → updates order status
# Decoupled: if Payment Service is down → events queue up in Kafka → processed on recovery
# Problem: User reports "checkout is slow" — which service is the bottleneck? # In monolith: single thread trace, profiling is straightforward # In microservices: request touches 6 services — need distributed trace # Solution: OpenTelemetry (industry standard) # Every service propagates trace context in HTTP headers: # traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 # version-trace_id-parent_span_id-flags # Example trace (Jaeger/Zipkin visualization): # Request /checkout → total: 450ms # ├── API Gateway: 2ms # ├── Order Service: 3ms # │ ├── DB query: 2ms # ├── Payment Service: 380ms ← BOTTLENECK # │ ├── Fraud check: 50ms # │ ├── Stripe API call: 320ms ← root cause: external API slow # ├── Notification Service: 5ms (async, doesn't block response) # └── Response: 10ms
| Scenario | Recommendation | Why |
|---|---|---|
| New product / startup / <10 engineers | Monolith | Speed to market. Don't pay microservices tax before you have the problem. |
| 10–50 engineers, product-market fit found | Modular Monolith | Clean boundaries without network overhead. Extract when needed. |
| 50+ engineers, multiple independent teams | Microservices | Conway's Law: teams need independent deploy pipelines. |
| One module needs 10× more resources | Extract that service | Targeted extraction. Keep everything else in the monolith. |
| Frequent deploys blocked by team coordination | Extract services | Independent deployability is microservices' core value prop. |
| Need Python/Go/Rust for specific workload | Polyglot service | Extract only the service that needs a different tech stack. |
# Interviewer: "Would you use microservices or a monolith?" # Good answer: "I'd start with a modular monolith. At the scale we're designing for — say 100K DAU — a well-structured monolith with clean module boundaries is faster to build, easier to debug, and doesn't require Kubernetes or service mesh infrastructure. I'd define bounded contexts now (User, Order, Payment, Notification) — each module owns its schema and doesn't share data with others. When we hit a specific problem — say the Notification module is causing memory spikes that affect checkout — I'd extract it using the Strangler Fig pattern with a Kafka event boundary. If this were a 100-engineer org where multiple teams are deploying independently, I'd recommend microservices with an API Gateway, circuit breakers between services, and distributed tracing with OpenTelemetry from day one." # Mentioning these signals separates good candidates from great ones: # ✓ Conway's Law # ✓ Strangler Fig pattern (shows you know HOW to migrate, not just WHEN) # ✓ Circuit breaker (shows you understand failure modes) # ✓ Modular monolith as a middle ground (shows nuance)