Microservices Interview Questions 2026

Top 50 Questions & Answers — Architecture, Patterns, Resilience, Observability

Microservices architecture questions probe your ability to design, build, and operate distributed systems at scale. This guide covers the full interview landscape — from core architecture principles to production war stories on resilience, observability, and data consistency across services.

Easy = 2–4 year roles  |  Medium = 4–7 years  |  Hard = Senior / Architect
Architecture Fundamentals
1
What are microservices? How do they differ from a monolith and SOA?Easy

Monolith — the entire application is a single deployable unit. Simple to develop and test initially but becomes hard to scale and deploy as it grows.

SOA (Service-Oriented Architecture) — decomposes applications into services, but services are typically coarse-grained, share a database, and communicate over heavyweight protocols (SOAP, ESB).

Microservices — fine-grained services, each:

  • Owns a single bounded context (one business capability)
  • Has its own data store
  • Communicates via lightweight APIs (REST, gRPC, events)
  • Is independently deployable and scalable
  • Can be written in a different language/stack (polyglot)
The goal is autonomous deployability — a team can ship a microservice without coordinating with every other team.
2
What is a Bounded Context and why is it the key to sizing microservices?Medium

A bounded context (from Domain-Driven Design) is a self-contained domain model with clearly defined boundaries — a "User" inside the Billing context has different attributes than a "User" in the Social context. Each bounded context is a natural microservice boundary.

Why it matters for sizing:

  • Too coarse (too few services) — one service owns too many concepts, teams step on each other, back to monolith problems.
  • Too fine (nano-services) — network latency, distributed transaction complexity, operational overhead explodes.

Rule: let the team structure drive the service boundary (Conway's Law — your system mirrors your communication structure). A 2-pizza team owning one bounded context is the right starting point.

3
What are the advantages and disadvantages of microservices?Easy

Advantages:

  • Independent deployment — teams ship without coordinating
  • Independent scaling — scale only the bottleneck service
  • Technology diversity — each service can use the best tool for its job
  • Fault isolation — a crash in one service doesn't bring down everything
  • Smaller codebases — easier to understand and test

Disadvantages:

  • Distributed system complexity — network failures, latency, partial failures
  • Data consistency — no ACID across service boundaries
  • Operational overhead — service discovery, load balancing, tracing across many services
  • Testing complexity — integration tests span multiple services
  • Higher initial investment — CI/CD pipelines, infrastructure per service
4
What is the Strangler Fig pattern and when do you use it?Medium

The Strangler Fig pattern is the safest way to migrate from a monolith to microservices. Named after a fig tree that gradually envelops its host:

  1. Route ALL requests through a facade/proxy in front of the monolith.
  2. Extract one bounded context at a time to a new microservice.
  3. Redirect traffic for that context to the new service via the proxy.
  4. Repeat until the monolith is "strangled" (empty).

The monolith continues running throughout — users see no disruption. You migrate incrementally, validating each extracted service before moving on. This is far safer than a "big bang" rewrite.

Key tool: a reverse proxy (Nginx, Spring Cloud Gateway, APISIX) acts as the routing facade.
5
What is an API Gateway and what responsibilities should it own?Medium

The API Gateway is the single entry point for all clients. It should handle cross-cutting concerns so services don't have to duplicate them:

  • Routing — map external paths to internal service URLs
  • Authentication / token validation — verify JWT once at the gateway
  • Rate limiting — protect services from abuse
  • SSL termination — handle HTTPS externally, HTTP internally
  • Request/response transformation — add/remove headers, aggregate multiple service responses
  • Logging and tracing — inject trace IDs on every request

What it should NOT do: business logic. A gateway that knows about order states or user accounts has become a service itself and creates a coupling point.

Tools: Spring Cloud Gateway (Java/reactive), APISIX, Kong, AWS API Gateway.

6
What is the Backend for Frontend (BFF) pattern?Medium

A BFF is a separate backend service tailored for one specific frontend client (mobile app, web app, third-party API). Instead of one generic API that all clients use:

  • Mobile BFF — returns compact payloads optimised for small screens and mobile networks
  • Web BFF — aggregates data from multiple services to reduce round trips for the browser
  • Partner BFF — exposes a stable versioned API for third-party integrators

Each BFF is owned by the frontend team that uses it. This avoids the "API designed by committee" problem where one API tries to serve all clients poorly. Downsides: more services to maintain, potential duplication across BFFs.

7
What is the Sidecar pattern?Medium

A sidecar is a helper container deployed alongside the main service container in the same pod (Kubernetes). It handles infrastructure concerns so the service doesn't have to:

  • Service mesh sidecar (Envoy in Istio/Linkerd) — handles mTLS, retries, circuit breaking, load balancing, and telemetry collection transparently
  • Log shipping sidecar (Fluentd/Fluent Bit) — tails log files and ships to Elasticsearch
  • Config sync sidecar — syncs secrets from Vault to a shared volume

The sidecar shares the same network namespace as the main container — all traffic passes through it. This lets you add cross-cutting capabilities to services without changing their code, and works across polyglot services.

8
What is a Service Mesh and what problems does it solve?Hard

A service mesh is a dedicated infrastructure layer that handles service-to-service communication. It consists of sidecar proxies (data plane) + a control plane (Istio, Linkerd).

Problems it solves — without changing application code:

  • mTLS — automatic mutual TLS between all services; each gets a certificate
  • Traffic management — canary deployments, A/B testing, weighted routing, retries, timeouts
  • Observability — automatic metrics (Prometheus), distributed traces (Jaeger), service topology graphs
  • Circuit breaking — automatic retry and circuit breaker policies across all services

Trade-offs: significant operational complexity. Istio adds ~2ms per hop. Only justified in large deployments with many services where the duplication of cross-cutting code across services would be worse.

9
What is the difference between orchestration and choreography in microservices?Hard

Orchestration — a central coordinator tells each service what to do and when. Like a conductor. The coordinator knows the entire flow.

// Orchestrator calls each service in sequence:
OrderOrchestrator:
  1. Call InventoryService.reserve()
  2. Call PaymentService.charge()
  3. Call ShippingService.schedule()
  4. Call NotificationService.confirm()

Choreography — each service reacts to events published by others. No central coordinator. Each service knows its own responsibilities and publishes events for the next step.

OrderPlaced → Inventory listens → InventoryReserved
InventoryReserved → Payment listens → PaymentCharged
PaymentCharged → Shipping listens → ShipmentScheduled

Tradeoffs: Orchestration is easier to reason about (centralised flow) but creates coupling to the orchestrator. Choreography is more decoupled but harder to trace — you need good distributed tracing to follow a request across events.

10
What is the 12-Factor App methodology and how does it apply to microservices?Medium

The 12-factor methodology defines how to build portable, scalable, cloud-native services. Most relevant factors for microservices:

  • III. Config — store config in environment variables, not in code. Enables different config per environment without rebuilds.
  • IV. Backing services — treat databases, message brokers, caches as attached resources, not local.
  • VI. Processes — services must be stateless. Session state in Redis/DB, not in the JVM heap.
  • VIII. Concurrency — scale out via process model (add instances), not up (bigger server).
  • IX. Disposability — fast startup, graceful shutdown. Enables Kubernetes rolling updates.
  • XI. Logs — treat logs as event streams. Write to stdout; let the platform route to Elasticsearch/Splunk.
Service Communication
11
When do you choose synchronous vs asynchronous communication between services?Medium

Synchronous (REST/gRPC) — caller waits for the response. Use when:

  • The caller needs the result to proceed (query, validation, real-time response)
  • Simple request/response with a clear owner
  • Acceptable to fail fast if the downstream is down

Asynchronous (Kafka/RabbitMQ) — caller publishes an event and moves on. Use when:

  • The operation can be delayed (email, audit log, report generation)
  • You need to fan out to multiple consumers
  • You need temporal decoupling — producer and consumer don't need to be up simultaneously
  • High throughput with buffering (producer is faster than consumer)
Golden rule: prefer async for inter-service calls that don't need an immediate result. Synchronous chains create availability coupling — if A calls B calls C, A's availability = A × B × C.
12
What is gRPC and when would you use it over REST?Medium

gRPC is a high-performance RPC framework that uses Protocol Buffers (binary) over HTTP/2. Advantages over REST/JSON:

  • Performance — Protobuf is ~5-10x smaller and faster to serialise/deserialise than JSON
  • Streaming — supports server streaming, client streaming, and bidirectional streaming
  • Strong typing — .proto schema is the contract; code generated in any language
  • Multiplexing — HTTP/2 multiplexes multiple streams on one connection

Use gRPC when: internal service-to-service calls where performance matters, streaming scenarios (real-time updates, large data), polyglot microservices needing a typed contract.

Use REST when: public APIs (browser-friendly), simple CRUD, tooling/ecosystem maturity matters more than raw performance.

13
What is service discovery and how does it work?Easy

In dynamic environments, service instances start/stop/move. Hardcoding IPs fails. Service discovery solves this:

Client-side discovery (Eureka + Ribbon): the client queries the registry and load-balances itself.

Server-side discovery (Kubernetes Service DNS / AWS ALB): the client calls a fixed DNS name; the infrastructure routes to a healthy instance.

# Kubernetes: every service gets a stable DNS name
http://order-service:8080/api/orders
# DNS resolves to a ClusterIP that load-balances across all pods

In modern Kubernetes environments, Kubernetes Services replace Eureka — built-in DNS, health-checked endpoints, no extra component to run. Eureka is primarily used in VM-based deployments with Spring Cloud.

14
What is the Consumer-Driven Contract Testing pattern?Hard

In microservices, integration tests across services are slow and fragile. Consumer-driven contracts (CDC) solve this with a faster feedback loop:

  1. Consumer writes a contract describing what it expects from the provider (a Pact file).
  2. Provider verifies its API matches the contract in its own CI pipeline — no real consumer needed.
// Consumer (using Pact/Spring Cloud Contract):
given("user 1 exists")
.upon_receiving("a request for user 1")
.with(method: GET, path: "/users/1")
.will_respond_with(status: 200,
    body: {id: 1, name: "Alice"});  // contract

Benefits: each service validates in isolation; breaking changes are caught before deployment, not in production. Tools: Pact (polyglot), Spring Cloud Contract (Java-native).

15
What is the Event-Driven Architecture and how does it enable loose coupling?Medium

In Event-Driven Architecture (EDA), services communicate by publishing and subscribing to events. The publisher doesn't know who consumes its events — it just publishes to a topic.

// OrderService publishes — knows nothing about consumers:
kafka.send("order-events", new OrderPlacedEvent(orderId, items, total));

// InventoryService subscribes independently:
@KafkaListener(topics = "order-events")
void on(OrderPlacedEvent e) { reserve(e.getItems()); }

// EmailService also subscribes independently:
@KafkaListener(topics = "order-events")
void on(OrderPlacedEvent e) { sendConfirmation(e); }

This enables: adding new consumers without touching the publisher, temporal decoupling (consumer can be offline and catch up), and fan-out to many subscribers. Downside: eventual consistency, harder to trace end-to-end flows.

16
What is the Outbox Pattern and why is it needed?Hard

The dual-write problem: when a service must both update its database AND publish an event, these two operations cannot be atomic across different systems. If the DB commit succeeds but Kafka publish fails, data is inconsistent.

Outbox pattern solution:

  1. Write the business entity update AND an outbox event record in the same database transaction.
  2. A separate relay process (Debezium CDC, polling) reads the outbox table and publishes to Kafka.
  3. On successful publish, mark the outbox record as sent.
// In one transaction:
@Transactional
public Order placeOrder(OrderRequest req) {
    Order order = orderRepo.save(new Order(req));
    outboxRepo.save(new OutboxEvent("order-placed", toJson(order)));
    return order;  // both committed atomically
}

Debezium tails the DB transaction log (CDC) — zero polling overhead, guaranteed at-least-once delivery to Kafka.

17
What is idempotency and why is it critical in microservices?Medium

An operation is idempotent if calling it multiple times produces the same result as calling it once. In microservices, retries are common (network failures, circuit breakers) — if operations aren't idempotent, retries cause duplicates.

// Client sends idempotency key in every request:
POST /payments
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000

// Server: check if this key was already processed:
@PostMapping("/payments")
public PaymentResult pay(@RequestHeader("Idempotency-Key") String key,
                         @RequestBody PaymentRequest req) {
    return idempotencyCache.computeIfAbsent(key,
        k -> paymentService.process(req));
}

For Kafka consumers: store processed message offsets or use a seen-message cache keyed on message ID. Natural idempotency: PUT /users/42 with the full state is idempotent; POST /add-5-to-balance is not.

18
What is API versioning and what are the strategies for it?Medium

API versioning lets you evolve APIs without breaking existing consumers. Main strategies:

  • URL path versioning (/api/v1/users, /api/v2/users) — most visible, most explicit. Easy to test in browser. Breaks REST's "resource URI should be stable" principle.
  • Header versioning (Accept: application/vnd.api+json;version=2) — clean URLs, harder to test manually.
  • Query parameter (/api/users?version=2) — simple, but mixes versioning with business parameters.

Best practice: choose URL versioning (most teams do) and run two versions in parallel during transition. Deprecate v1 after all consumers migrate to v2. Never break a published API without a sunset period.

Resilience & Fault Tolerance
19
What is the Circuit Breaker pattern? Explain its three states.Medium

A circuit breaker monitors calls to a downstream service and stops calling it when failures exceed a threshold — preventing cascading failures.

  • Closed (normal) — requests flow through. Failures are counted. If failure rate exceeds threshold, trips to Open.
  • Open (tripped) — all requests fail immediately with a fallback. No calls reach the downstream. After a wait period, moves to Half-Open.
  • Half-Open (testing) — a limited number of requests are allowed through. If they succeed, resets to Closed. If they fail, back to Open.
# Resilience4j config:
resilience4j.circuitbreaker.instances.paymentService:
  failure-rate-threshold: 50        # open at 50% failure rate
  slow-call-rate-threshold: 100     # also open if 100% calls are slow
  slow-call-duration-threshold: 2s
  wait-duration-in-open-state: 30s
  permitted-number-of-calls-in-half-open-state: 5
20
What is bulkhead isolation and why does it prevent cascading failures?Hard

Named after the watertight compartments in a ship's hull — a breach in one compartment doesn't sink the ship. In microservices, bulkheads isolate thread pools or connection pools per downstream dependency.

// Without bulkhead: slow InventoryService exhausts ALL threads
// → other endpoints (Payments, Users) also start timing out

// With bulkhead: separate thread pool per downstream:
resilience4j.bulkhead.instances.inventoryService:
  maxConcurrentCalls: 10    # max 10 threads for inventory calls
  maxWaitDuration: 100ms

resilience4j.bulkhead.instances.paymentService:
  maxConcurrentCalls: 20

Even if InventoryService is slow and all 10 slots fill up, payment and user calls still run on their own pools. The failure is contained.

21
What is the Retry pattern and what is exponential backoff with jitter?Medium

Retrying transient failures (network blip, brief service restart) automatically improves resilience. But naive fixed-interval retries can cause a thundering herd — all failed clients retry simultaneously, overwhelming the recovering service.

Exponential backoff: wait time doubles each retry (1s, 2s, 4s, 8s). Jitter: add random noise so clients don't all retry at the same millisecond.

resilience4j.retry.instances.orderService:
  max-attempts: 3
  wait-duration: 500ms
  enable-exponential-backoff: true
  exponential-backoff-multiplier: 2     # 500ms, 1000ms, 2000ms
  randomized-wait-factor: 0.5           # ±50% jitter

@Retry(name = "orderService")
public Order getOrder(Long id) {
    return orderClient.findById(id); // retried on exception
}

Only retry idempotent operations. Never retry non-idempotent operations (POST to create, payment charge) without an idempotency key.

22
What is the Timeout pattern and how do you set timeouts in a microservice chain?Medium

Without timeouts, a slow downstream service ties up threads indefinitely, eventually exhausting the thread pool. Every service call must have a timeout.

Timeout budget: set timeouts based on the end user's expected response time, working backwards through the chain.

# If the API Gateway has a 3s timeout for the user:
# Service A → Service B → Service C chain
# A should timeout B at 2.5s (leaving 500ms for A's own work)
# B should timeout C at 1.5s (leaving 1s for B's work)

# Resilience4j TimeLimiter:
resilience4j.timelimiter.instances.inventoryService:
  timeout-duration: 1500ms
  cancel-running-future: true

Pass the remaining timeout budget via headers (deadline propagation) so each hop knows how much time is left. This prevents a service from retrying after the client has already timed out and moved on.

23
What is the fallback pattern and what are good fallback strategies?Medium

A fallback is what a service returns when the primary call fails or the circuit is open. Good fallbacks:

  • Cached response — return the last successful response (stale but useful)
  • Default/empty response — return a degraded but valid response (empty recommendations list)
  • Queue for async retry — accept the request, process later (useful for writes)
  • Fail fast with clear error — better than hanging; let the client decide
@CircuitBreaker(name = "recommendations",
                fallbackMethod = "defaultRecommendations")
public List<Product> getRecommendations(Long userId) {
    return recommendationService.getFor(userId);
}

public List<Product> defaultRecommendations(Long userId, Exception ex) {
    return cache.getPopularProducts(); // cached bestsellers as fallback
}
24
What is the Saga pattern for distributed transactions?Hard

Distributed transactions across microservices cannot use traditional ACID (two-phase commit doesn't scale). The Saga pattern achieves consistency through a sequence of local transactions with compensating actions on failure.

Choreography Saga (event-driven):

OrderPlaced
  → Inventory: reserve stock → StockReserved
  → Payment: charge card   → PaymentCharged
  → Shipping: schedule     → OrderComplete

// On payment failure:
PaymentFailed
  → Inventory: release stock (compensating transaction)

Orchestration Saga: A central coordinator (e.g. using Apache Camel, Temporal, AWS Step Functions) drives each step and triggers compensations on failure.

Key insight: compensating transactions must be idempotent and cannot fail (they are the "undo" mechanism of last resort). Design them to be retriable.

25
What is graceful degradation in microservices?Medium

Graceful degradation means a service continues to provide reduced but useful functionality when a dependency is unavailable — instead of failing completely.

Examples:

  • E-commerce: if the Recommendations service is down, show the product page without "You may also like" — not a 500 error
  • Search: if the ML ranking service is down, fall back to chronological sorting
  • Dashboard: if the real-time metrics service is down, show cached data with a "data may be stale" banner

Design services around what is essential vs optional. Non-essential features should have fallbacks that return empty data or cached results. Essential features need synchronous calls with aggressive retries and circuit breakers.

26
What is the Health Check API pattern?Easy

Every service should expose health endpoints that orchestrators and load balancers can poll:

  • Liveness (/actuator/health/liveness) — "is the process alive?" If DOWN, Kubernetes restarts the pod.
  • Readiness (/actuator/health/readiness) — "is the service ready to accept traffic?" If DOWN, Kubernetes removes it from the load balancer. Use this for slow startup, warm-up, or dependency checks.
# application.yml
management.endpoint.health.probes.enabled=true
management.health.livenessState.enabled=true
management.health.readinessState.enabled=true

# Kubernetes deployment.yaml
livenessProbe:
  httpGet: {path: /actuator/health/liveness, port: 8080}
  initialDelaySeconds: 30
readinessProbe:
  httpGet: {path: /actuator/health/readiness, port: 8080}
  initialDelaySeconds: 10
Data Management
27
Why must each microservice own its own database?Medium

Sharing a database between services creates tight coupling at the data layer — the opposite of what microservices aim for:

  • One service's schema change breaks another's queries
  • Services can't scale their database independently
  • You can't migrate one service to a different DB type (polyglot persistence)
  • Teams must coordinate database changes across service boundaries

Each service owns its schema and the only way to access another service's data is via its API. This enforces encapsulation at the data level and allows polyglot persistence:

  • Order service → PostgreSQL (relational, ACID needed)
  • Session service → Redis (key-value, TTL needed)
  • Product catalog → MongoDB (flexible schema, hierarchical data)
  • Search → Elasticsearch (full-text search)
28
What is eventual consistency and how do you design for it?Hard

Eventual consistency means that after an update, all replicas/services will converge to the same state — but not immediately. The window between "committed in service A" and "visible in service B" is the inconsistency window.

Design strategies:

  • Accept it — show "order confirmed, inventory will be updated shortly." Users tolerate brief delays.
  • Read-your-writes — after a write, route subsequent reads to the writer (not a replica) for a short period.
  • Versioned state — include a version/timestamp with each event so consumers know if their view is stale.
  • Compensating actions — if inconsistency is detected late, trigger a corrective process (refund an over-sold item).
  • CQRS — the read model is explicitly "eventually consistent" and is rebuilt from events.
29
What is CQRS and Event Sourcing?Hard

CQRS (Command Query Responsibility Segregation) — separate the write model (commands) from the read model (queries). Write model is optimised for consistency; read model for query performance.

// Write side: normalised, ACID, handles commands
orderCommandService.placeOrder(cmd);

// Read side: denormalised, optimised for queries
OrderSummaryView view = orderQueryService.getOrderSummary(orderId);

Event Sourcing — instead of storing current state, store every state change as an immutable event. Current state is derived by replaying events.

// Events stored (append-only):
OrderPlaced → ItemAdded → ItemRemoved → OrderConfirmed → OrderShipped

// Current state = replay of all events
Order order = eventStore.loadEvents(orderId).stream()
    .reduce(new Order(), Order::apply);

Benefits: complete audit trail, replay events to rebuild read models, time travel (what was the state at time T?). Complexity: event versioning, eventual consistency between command and query models.

30
How do you handle distributed joins when data is split across services?Hard

No database joins across service boundaries — you must use application-level joins. Options:

  • API composition — the caller fetches from each service and joins in memory. Good for small datasets.
  • CQRS read model — build a denormalised view by consuming events from both services and storing pre-joined data locally. Eventual consistency.
  • GraphQL — BFF aggregates data from multiple services in a single query; federation extends this across service boundaries.
  • Data replication — replicate only the fields you need from another service's events into your own database. Ownership stays with the source service.
// API composition (synchronous):
Order order = orderService.getOrder(id);
User user = userService.getUser(order.getUserId());
return new OrderDetailView(order, user);
31
What is the two-phase commit problem and why isn't it used in microservices?Medium

Two-Phase Commit (2PC) coordinates an atomic transaction across multiple systems via a coordinator:

  1. Phase 1 (Prepare) — coordinator asks all participants to prepare and lock resources.
  2. Phase 2 (Commit/Rollback) — if all prepared, coordinator tells everyone to commit; else rollback.

Why not in microservices:

  • Blocking — participants hold locks during the entire protocol. If the coordinator crashes between phases, locks are held indefinitely.
  • Single point of failure — coordinator crash can leave the system in an indeterminate state.
  • Latency — two network round-trips plus locking degrades throughput severely at scale.
  • Coupling — all participants must support XA protocol.

Use Saga pattern instead — compensating transactions replace rollback, no locks, no coordinator SPOF.

32
What is database migration management in microservices?Medium

Each microservice owns its schema and must apply DB migrations as part of its own deployment. Tools: Flyway and Liquibase.

# Spring Boot auto-runs Flyway migrations on startup
spring.flyway.enabled=true
spring.flyway.locations=classpath:db/migration

# Migration file naming: V1__create_users.sql, V2__add_email_index.sql

Key rules for zero-downtime migrations:

  • Never drop a column or rename a column in the same deployment that removes the code using it — always do it in a separate deployment after old instances are gone.
  • Adding a nullable column is safe. Adding a NOT NULL column without a default is not.
  • Expand-contract pattern: add new column → deploy code to write both old and new → backfill → deploy code to read new only → drop old.
33
What is the Cache-Aside pattern?Medium
// Cache-aside (lazy loading):
public Product getProduct(Long id) {
    Product cached = cache.get(id);        // 1. Check cache
    if (cached != null) return cached;     // 2. Cache hit → return

    Product product = db.findById(id);     // 3. Cache miss → load DB
    cache.put(id, product, 10, MINUTES);   // 4. Populate cache
    return product;
}

Invalidation strategies:

  • TTL expiry — simplest, some staleness accepted
  • Write-through — update cache on every write, always consistent but slower writes
  • Event-driven invalidation — consume "product updated" events and evict from cache immediately

Cache stampede: when a popular key expires and many requests simultaneously miss the cache and hammer the database. Fix: probabilistic early expiry, mutex locks on cache miss, or background refresh.

Security
34
How do you secure inter-service communication?Hard

Defense in depth approach:

  • mTLS (mutual TLS) — every service presents a certificate; both sides authenticate. Enforced automatically by a service mesh (Istio, Linkerd). No code changes.
  • Service-to-service JWT — services authenticate with their own short-lived JWT (client credentials OAuth2 flow). The receiving service validates the JWT and checks the sub claim is an allowed service.
  • Network policies — Kubernetes NetworkPolicy restricts which pods can talk to which. Only services that need to communicate are allowed at the network level.
  • Zero-trust — assume the internal network is compromised; verify every request regardless of origin.
35
How does JWT authentication flow work across microservices?Medium
Client → API Gateway: POST /login {username, password}
API Gateway → Auth Service: validate credentials
Auth Service → API Gateway: JWT (signed with RS256 private key)
API Gateway → Client: JWT

Client → API Gateway: GET /orders (Authorization: Bearer <JWT>)
API Gateway: validate JWT signature (using public key), extract claims
API Gateway → Order Service: forward request + user claims in header
Order Service: trust the gateway-validated claims (no re-validation)

Each downstream service doesn't need to validate the JWT signature again — the gateway already did. It reads user claims from a trusted header (X-User-Id, X-User-Roles) set by the gateway. Never trust headers set by the client directly.

36
What is secrets management in microservices?Medium

Secrets (DB passwords, API keys, TLS certs) must never be hardcoded or stored in source control. Options in increasing security order:

  • Environment variables — simple but secrets are visible in process list, not encrypted at rest.
  • Kubernetes Secrets — base64-encoded, access controlled via RBAC. Better but not encrypted at rest by default (needs etcd encryption or Sealed Secrets).
  • HashiCorp Vault — dedicated secret store, dynamic secrets (generates short-lived DB credentials on demand), full audit trail, automatic rotation.
  • AWS Secrets Manager / GCP Secret Manager — managed cloud solution, automatic rotation, fine-grained IAM access control.

Best practice: use Vault or a cloud secrets manager. Rotate secrets regularly. Use short-lived dynamic credentials so a leaked secret expires quickly.

37
What is rate limiting and how do you implement it at the gateway?Medium

Rate limiting prevents abuse and protects downstream services from overload. Common algorithms:

  • Fixed window — allow N requests per minute. Simple but allows burst at window boundary.
  • Sliding window — smoother; counts requests in a rolling window.
  • Token bucket — a bucket fills at a constant rate; each request consumes a token. Allows controlled bursting. Used in AWS API Gateway.
  • Leaky bucket — requests are processed at a constant rate; excess is queued or dropped.
# Spring Cloud Gateway rate limiter (Redis-backed token bucket):
spring.cloud.gateway.routes:
  - id: order-service
    filters:
      - name: RequestRateLimiter
        args:
          redis-rate-limiter.replenishRate: 100    # 100 tokens/sec
          redis-rate-limiter.burstCapacity: 200    # max burst
          key-resolver: "#{@userKeyResolver}"      # per-user limit
38
What is the principle of least privilege applied to microservices?Medium

Each service should have only the minimum permissions needed to do its job:

  • Database: create a dedicated DB user per service with only the tables and operations it needs. Order service shouldn't be able to read the User service's password table.
  • Kubernetes: each pod runs with its own ServiceAccount. RBAC grants only the specific API calls it needs (e.g. read secrets, not create ClusterRoles).
  • Cloud IAM: each service has its own IAM role. The S3 bucket for profile pictures is only accessible by the Profile service's IAM role.
  • Network: Kubernetes NetworkPolicies whitelist which pods can talk to which. Default deny all, then explicitly allow needed connections.
Observability & Deployment
39
What are the three pillars of observability?Easy
  • Logs — time-stamped records of events. Good for debugging a specific request or error. Tool: ELK stack (Elasticsearch, Logstash, Kibana), Loki + Grafana.
  • Metrics — numeric measurements over time (request rate, error rate, latency percentiles, JVM heap). Good for alerting on trends and anomalies. Tool: Prometheus + Grafana, Datadog.
  • Traces — end-to-end view of a request as it flows across services, with timing at each hop. Good for diagnosing latency and finding which service in the chain is slow. Tool: Jaeger, Zipkin, AWS X-Ray.
The goal: when something breaks, you can find the root cause without SSHing into a server. Logs for "what happened", metrics for "how bad is it", traces for "where in the system".
40
What is distributed tracing and how do trace IDs propagate?Medium
// Incoming request gets a Trace ID:
GET /checkout                    [TraceId: abc123, SpanId: span1]

// Gateway calls Order Service:
POST /orders                     [TraceId: abc123, SpanId: span2, ParentSpan: span1]

// Order Service calls Inventory:
GET /inventory/reserve           [TraceId: abc123, SpanId: span3, ParentSpan: span2]

// Inventory calls DB:
SELECT ...                       [TraceId: abc123, SpanId: span4, ParentSpan: span3]

The TraceId stays the same through the entire request. Each hop creates a new SpanId. W3C TraceContext headers (traceparent) are the modern standard for propagating these IDs. Spring Boot + Micrometer Tracing propagates them automatically across RestTemplate, WebClient, and Kafka.

In Jaeger/Zipkin you can visualise the full call tree and see exactly where latency is spent.

41
What is structured logging and why is it important in microservices?Medium

Structured logging means outputting logs as machine-parseable JSON instead of free-text strings:

// Unstructured (hard to query):
2026-06-23 12:01:05 INFO  Order 12345 placed by user 42, total $199.99

// Structured JSON (easy to filter and aggregate):
{"timestamp":"2026-06-23T12:01:05Z","level":"INFO","service":"order-service",
 "traceId":"abc123","spanId":"span2","orderId":12345,"userId":42,
 "total":199.99,"event":"order.placed"}

With structured logs, Kibana/Grafana Loki can filter service="order-service" AND event="order.placed" AND total > 100 without regex parsing. The traceId field lets you jump from a trace in Jaeger directly to the logs for that request.

# Logback JSON config (logstash-logback-encoder):
<encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
42
What are SLI, SLO, and SLA? How do they guide microservice reliability?Hard
  • SLI (Service Level Indicator) — the actual metric being measured. E.g. "p99 latency", "error rate", "availability".
  • SLO (Service Level Objective) — the target for an SLI. E.g. "p99 latency < 200ms for 99.9% of the time over 30 days".
  • SLA (Service Level Agreement) — the contractual commitment to external customers. Usually less strict than internal SLOs (you need headroom to detect and fix before breaching the SLA).

Error budget: if SLO is 99.9% availability, your error budget is 0.1% downtime per month (~43 minutes). Spend it on risky deployments. When the budget is exhausted, freeze deployments and focus on reliability.

Track SLIs with Prometheus alerts. Alert on SLO burn rate, not individual incidents.

43
What is a canary deployment and how does it reduce deployment risk?Medium

A canary deployment routes a small percentage of traffic to the new version before rolling out to everyone. Named after the "canary in a coal mine" — if it dies, you know there's a problem:

# Kubernetes with Argo Rollouts:
strategy:
  canary:
    steps:
    - setWeight: 5        # 5% traffic to new version
    - pause: {duration: 5m}
    - analysis: {templates: [{name: error-rate-check}]}
    - setWeight: 50       # 50% if analysis passed
    - pause: {duration: 10m}
    - setWeight: 100      # full rollout

Monitor error rate and latency during each step. If metrics degrade beyond a threshold, automatically rollback. This lets you validate the new version in production with minimal blast radius.

44
What is blue-green deployment?Medium

Blue-green deployment maintains two identical production environments. "Blue" is live. "Green" has the new version deployed and tested:

  1. Deploy new version to green environment
  2. Run smoke tests on green (real infrastructure, no production traffic)
  3. Switch the load balancer from blue to green (instant cutover)
  4. Keep blue running for immediate rollback if issues arise
  5. After confidence period, tear down blue

Advantages: instant rollback (just switch LB back), zero-downtime. Disadvantages: double infrastructure cost, database schema changes must be backward-compatible with both versions.

45
How do you handle service dependencies during deployment ordering?Hard

When Service A calls Service B, and you're deploying a breaking API change in B, deployment order matters.

Safe deployment order for breaking changes:

  1. Deploy B with BOTH old and new API version (parallel run)
  2. Deploy A to use the new version
  3. Verify all consumers of old API are migrated
  4. Deploy B with old API removed

This is the expand-contract pattern applied to APIs. Always deploy the provider change first (backward compatible), then consumers, then remove old behavior.

Consumer-driven contract tests (Pact) catch this before deployment — the provider's CI pipeline verifies it still satisfies all consumer contracts.

46
What is the difference between horizontal and vertical scaling in microservices?Easy
  • Vertical scaling (scale up) — give the instance more CPU/RAM. Simple, but limited by hardware ceiling. Single point of failure remains. No code change needed.
  • Horizontal scaling (scale out) — add more instances. Cloud-native, theoretically unlimited. Requires the service to be stateless (no in-memory state). Better fault tolerance.

Microservices are designed for horizontal scaling — each service scales independently based on its own load. Kubernetes HPA (Horizontal Pod Autoscaler) scales pods automatically based on CPU/memory or custom metrics (e.g. Kafka consumer lag).

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef: {name: order-service}
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource: {name: cpu, target: {type: Utilization, averageUtilization: 70}}
47
What is chaos engineering and how do tools like Chaos Monkey help?Hard

Chaos engineering is the practice of deliberately injecting failures into a system in production to validate that it can withstand real-world disruptions. Netflix coined the term with Chaos Monkey (randomly terminates EC2 instances).

Experiments:

  • Terminate random pods (Chaos Monkey, LitmusChaos)
  • Inject network latency between services (Toxiproxy, Istio fault injection)
  • Kill a database node and verify failover works
  • Saturate CPU on a service and verify circuit breakers trip
# Istio fault injection:
httpFault:
  delay:
    percentage: {value: 50}
    fixedDelay: 3s           # 50% of requests to inventory get 3s delay

Run chaos experiments in staging first, then production during business hours with an on-call engineer present. The goal: find weaknesses before users do.

48
How do you test microservices effectively?Medium

The microservice testing pyramid:

  • Unit tests — test domain logic in isolation. No Spring context, no HTTP, no DB. Fast.
  • Component tests — test the service in isolation with real HTTP but mocked dependencies (WireMock for downstream services, Testcontainers for DB). Tests the full service stack.
  • Contract tests — verify provider API matches consumer expectations (Pact/Spring Cloud Contract). Replaces slow integration tests.
  • End-to-end tests (E2E) — a full system test covering critical paths. Slow, brittle, expensive — keep minimal. Only the happiest paths.
Avoid large shared integration environments — they create bottlenecks and flaky tests. Use contract tests + Testcontainers instead. Each service tests itself in CI.
49
What CI/CD practices are essential for microservices at scale?Medium
  • Independent pipelines per service — each service has its own CI/CD pipeline and can be deployed without other services.
  • Trunk-based development — short-lived feature branches, merge to main daily. Long-lived branches create integration debt.
  • Automated tests at every stage — unit → integration (Testcontainers) → contract → deploy to staging → E2E → prod.
  • Immutable artefacts — build the Docker image once, promote the same image through environments. Never rebuild for staging or prod.
  • GitOps — environment state is declared in Git (Helm charts, Kustomize). A GitOps operator (ArgoCD, Flux) syncs Kubernetes to match Git. Rollback = revert Git commit.
  • Feature flags — deploy code dark (off by default), enable for internal users, enable for percentage of users, full rollout. Decouple deploy from release.
50
If an order service is down, walk through how you diagnose it in a microservices environment.Hard

This is the "real world incident" senior question. A structured runbook:

  1. Assess blast radius — what percentage of users/requests are affected? Check your metrics dashboard (Grafana) for error rate spike and affected services.
  2. Check pod healthkubectl get pods -n production | grep order. Are pods CrashLoopBackOff? OOMKilled? Pending?
  3. Check logskubectl logs <pod> --previous. Is there a startup exception? DB connection failure? OutOfMemoryError?
  4. Check readiness/liveness — is Kubernetes marking pods not-ready and pulling them from the load balancer?
  5. Check distributed traces — in Jaeger, find a failing trace. Where does it fail — the order service itself, or a dependency (payment, inventory)?
  6. Check recent deployments — was there a deployment in the last 30 minutes? Roll back if yes.
  7. Check dependencies — is the database up? kubectl get pods -n databases. Is Redis up? Is a downstream service the actual root cause?
  8. Check metrics — CPU spike? Memory exhaustion? Connection pool exhaustion? JVM metaspace full?
Mitigation first (rollback, scale up, failover), then root cause analysis after service is restored. Don't spend 30 minutes diagnosing while customers see errors.

What to Study Next