This guide covers the most frequently asked DevOps interview questions in 2026 — from cultural principles and CI/CD pipelines to production reliability, security, and platform engineering. Applicable to DevOps Engineer, SRE, and Platform Engineer roles.
DevOps is a set of practices and cultural philosophies that combine software development (Dev) and IT operations (Ops) to shorten the development cycle and deliver high-quality software continuously.
Problem it solves — the "wall of confusion":
DevOps principles (CALMS):
DORA (DevOps Research and Assessment) metrics measure software delivery and operational performance:
CI: code → build → unit tests → integration tests
CD (Delivery): → deploy to staging → acceptance tests → [HUMAN GATE] → deploy to prod
CD (Deployment): → deploy to staging → acceptance tests → automatically deploy to prod
Most organisations practice CI + Continuous Delivery. Continuous Deployment requires very mature automated testing and feature flag infrastructure.
Shift-left means moving testing, security, and quality checks earlier in the development lifecycle (to the "left" on a timeline). The earlier a bug is found, the cheaper it is to fix.
Traditional (shift-right):
Developer → Code Review → QA → Security Review → Staging → Prod
↑ bugs found here, expensive to fix
Shift-left:
Developer → Pre-commit hooks → CI tests → SAST scan → Code Review → Staging → Prod
↑ bugs caught here, cheap to fix
Shift-left techniques:
A blameless postmortem is a structured review of a production incident that focuses on systems and processes — not on blaming individuals. Pioneered by Google SRE and widely adopted in DevOps culture.
Why blameless: People make mistakes when under pressure, with incomplete information, with poorly designed systems. If engineers fear blame, they hide problems, don't speak up, and incidents repeat. Psychological safety produces better learning.
Postmortem structure:
Feature flags (feature toggles) are conditional code paths that enable/disable features at runtime without deployment. They decouple deployment from feature release.
// Code:
if (featureFlags.isEnabled("new-checkout-flow", userId)) {
return newCheckout(request);
} else {
return legacyCheckout(request);
}
// Feature flag config (toggleable in UI, no deployment):
{"new-checkout-flow": {"enabled": false, "rollout": 5}} // 5% of users
Enables trunk-based development: all developers work on one branch (main/trunk). Incomplete features are merged but hidden behind a flag. No long-lived feature branches → no merge hell. Continuous Integration works correctly.
Use cases: A/B testing, gradual rollout (canary to 1% → 10% → 100%), kill switch (turn off a broken feature without rollback), dark launch (code runs but output not shown to users).
Tools: LaunchDarkly, Unleash, Flagsmith, AWS AppConfig.
Toil is manual, repetitive, operational work that scales with traffic — not engineering work that improves the system. Google SRE policy: keep toil below 50% of each engineer's time; the rest should be engineering (automation, reducing future toil).
Examples of toil:
Why eliminate: toil burns out engineers, blocks feature work, grows linearly with scale (you need to hire more people just to keep up), and is often error-prone.
Elimination approach: if you do it twice, automate it. Write it as code, add it to CI, make it self-service.
As Google SRE says: "SRE is what happens when you ask a software engineer to design an operations function."
| DevOps | SRE |
|---|---|
| Cultural philosophy | Specific practices + job role |
| Focus: speed + collaboration | Focus: reliability + error budgets |
| Broad, flexible | Prescriptive (SLO/SLA/SLI) |
Trigger: push to PR or merge to main
Stage 1: Code Quality
- Linting (ESLint, Checkstyle, flake8)
- Formatting check (Prettier, Black)
- Static analysis (SonarQube, PMD)
Stage 2: Build
- Compile / package (Maven, Gradle, npm build)
- Build Docker image
Stage 3: Test
- Unit tests (JUnit, Jest, pytest)
- Integration tests (against test DB, mocked services)
- Code coverage check (fail if < 80%)
Stage 4: Security
- SAST scan (Semgrep, Checkmarx)
- Dependency vulnerability scan (Snyk, OWASP Dependency-Check)
- Container image scan (Trivy, Grype)
- Secret detection (truffleHog, detect-secrets)
Stage 5: Artifact
- Push image to registry (ECR, GCR, Docker Hub)
- Tag with git SHA
- Publish test reports, coverage report
Total pipeline runtime goal: under 10 minutes. Parallelise independent stages.
GitHub Actions is a CI/CD platform built into GitHub. Workflows are YAML files in .github/workflows/. Events trigger workflow runs on hosted or self-hosted runners.
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up JDK 21
uses: actions/setup-java@v4
with:
java-version: '21'
distribution: 'temurin'
- name: Run tests
run: mvn test
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
push: ${{ github.ref == 'refs/heads/main' }}
tags: myapp:${{ github.sha }}
Actions marketplace has 20,000+ reusable actions. Self-hosted runners for: private networks, GPU builds, larger machines, cost reduction at scale. Use OIDC for keyless authentication to cloud providers (no secrets in GitHub).
Jenkins is an open-source automation server (Java-based) for building CI/CD pipelines. A Jenkinsfile defines the pipeline as code (stored in the repo, version-controlled).
// Declarative Jenkinsfile:
pipeline {
agent { docker { image 'maven:3.9-eclipse-temurin-21' } }
environment {
DOCKER_CREDS = credentials('docker-hub-creds')
}
stages {
stage('Build') {
steps {
sh 'mvn clean package -DskipTests'
}
}
stage('Test') {
steps {
sh 'mvn test'
}
post {
always {
junit 'target/surefire-reports/*.xml'
}
}
}
stage('Docker Push') {
when { branch 'main' }
steps {
sh 'docker build -t myapp:${GIT_COMMIT} .'
sh 'docker push myapp:${GIT_COMMIT}'
}
}
}
}
Jenkins strengths: most mature (2011), largest plugin ecosystem (1800+ plugins), self-hosted control, complex pipeline support. Weaknesses: high operational overhead, complex config.
Argo CD is a declarative GitOps controller for Kubernetes. It continuously syncs the cluster state with the desired state defined in Git.
# Application definition:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/myorg/k8s-manifests
targetRevision: main
path: apps/my-app/overlays/prod
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # delete resources removed from Git
selfHeal: true # revert manual changes in cluster
syncOptions:
- CreateNamespace=true
GitOps workflow:
Never store secrets in code or CI environment variables in plaintext.
Approaches by security level:
# GitHub Actions OIDC → AWS (no secrets needed):
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456:role/github-actions
aws-region: us-east-1
# GitHub exchanges OIDC token for temporary AWS credentials
# No AWS_ACCESS_KEY_ID stored in GitHub
| Monorepo Pros | Monorepo Cons |
|---|---|
| Atomic cross-service changes (one PR) | CI must be smart (only build what changed) |
| Shared libraries, consistent tooling | Larger clone, slower full builds |
| Easier code reuse and refactoring | Access control harder (everyone sees all code) |
| Single source of truth for dependency versions | Noisy blame/history for large teams |
Monorepo tools for affected-only CI: Nx (Node), Bazel (polyglot), Turborepo (JS), Pants (Python). On GitHub: path filters in Actions workflows to only trigger jobs for changed paths.
# Tools: Flyway, Liquibase, Alembic (Python), golang-migrate
# Migration file naming (Flyway):
V1__Create_users_table.sql
V2__Add_email_index.sql
V3__Add_last_login_column.sql
# CI pipeline stage before deployment:
- name: Run DB Migrations
run: |
flyway -url=$DB_URL -user=$DB_USER -password=$DB_PASS migrate
# Kubernetes: run migration as an init container before app starts:
initContainers:
- name: db-migrate
image: my-app:${{ IMAGE_TAG }}
command: ["./migrate.sh"]
env:
- name: DB_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
Rules for zero-downtime migrations:
Semantic versioning (SemVer): MAJOR.MINOR.PATCH
Automated versioning via Conventional Commits:
# Commit message format determines version bump:
feat: add OAuth2 login support → MINOR bump (1.2.0 → 1.3.0)
fix: resolve null pointer in payment → PATCH bump (1.3.0 → 1.3.1)
feat!: redesign API response format → MAJOR bump (1.3.1 → 2.0.0)
# (the ! signals a breaking change)
# Tools:
# semantic-release (Node.js) — reads commits, bumps version, creates tag, publishes changelog
# release-please (Google) — GitHub Action that creates release PRs automatically
For Docker images: tag with git SHA for every build (immutable, traceable), plus semantic version tags (1.3.0, 1.3, 1, latest) on releases.
# Without multi-stage: all build tools end up in the final image (large + insecure)
# With multi-stage: final image only contains runtime artifacts
# Java Spring Boot example:
# Stage 1: Build (JDK + Maven — only in build stage)
FROM eclipse-temurin:21-jdk AS builder
WORKDIR /app
COPY pom.xml .
RUN mvn dependency:go-offline # cache dependencies
COPY src ./src
RUN mvn clean package -DskipTests
# Stage 2: Extract layers (Spring Boot's layertools)
FROM builder AS extractor
RUN java -Djarmode=layertools -jar target/*.jar extract
# Stage 3: Runtime (JRE only — no Maven, no source code, no build caches)
FROM eclipse-temurin:21-jre-jammy
WORKDIR /app
COPY --from=extractor /app/dependencies/ ./
COPY --from=extractor /app/application/ ./
ENTRYPOINT ["java", "org.springframework.boot.loader.launch.JarLauncher"]
Results: image reduced from ~800MB (JDK+Maven) to ~200MB (JRE only). Smaller attack surface, faster pulls, cheaper storage.
IaC defines infrastructure (servers, networks, databases, permissions) in machine-readable configuration files that can be version-controlled, reviewed, and automated.
Why it's essential:
terraform plan before applying (Infracost).# 1. Write HCL configuration:
resource "aws_instance" "web" {
ami = "ami-0abcdef1234567890"
instance_type = "t3.medium"
tags = { Name = "web-server", Env = "prod" }
}
# 2. terraform init — download provider plugins
terraform init
# 3. terraform plan — compare desired state vs actual (no changes applied)
terraform plan
# Shows: +create, ~update, -destroy
# 4. terraform apply — apply the plan to real infrastructure
terraform apply
# 5. terraform destroy — tear down all resources
terraform destroy
State file (terraform.tfstate) — records the mapping between Terraform resources and real cloud resources. Critical: if lost, Terraform loses track of what it manages. Store in S3 + DynamoDB locking for teams.
terraform {
backend "s3" {
bucket = "my-tfstate"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
A Terraform module is a reusable, parameterised set of resources packaged together. Like a function in programming — write once, call many times with different inputs.
# modules/vpc/main.tf — reusable VPC module
variable "cidr_block" {}
variable "env" {}
resource "aws_vpc" "this" { cidr_block = var.cidr_block, tags = {Env=var.env} }
output "vpc_id" { value = aws_vpc.this.id }
# root module — call with different params per environment:
module "dev_vpc" {
source = "./modules/vpc"
cidr_block = "10.0.0.0/16"
env = "dev"
}
module "prod_vpc" {
source = "./modules/vpc"
cidr_block = "10.1.0.0/16"
env = "prod"
}
The Terraform Registry has community modules for all major AWS/GCP/Azure patterns (VPC, EKS, RDS). Use them as starting points. Always pin to a specific version (version = "~> 5.0") so module updates don't break your infra unexpectedly.
# Ansible Playbook: install and start Nginx
- hosts: web_servers
become: yes
tasks:
- name: Install nginx
apt:
name: nginx
state: present
update_cache: yes
- name: Copy config
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
- name: Start nginx
service:
name: nginx
state: started
enabled: yes
Common pattern: Terraform provisions infrastructure (creates EC2, VPC, RDS), Ansible configures the servers (installs software, deploys app). They complement each other. On modern container/K8s deployments, Ansible is often replaced by Docker + Helm.
Configuration drift occurs when a server's actual configuration diverges from the desired state defined in code — due to manual changes, failed automation, different deployments over time.
Prevention strategies:
Mutable: Server-v1 → SSH → apt install nginx → Server-v1-modified (drift)
Immutable: Server-v1 → build new image → Server-v2 → terminate Server-v1
Immutable infrastructure benefits: predictable deploys (tested image is exactly what runs in prod), fast rollback (just deploy previous image), no snowflake servers, horizontal scaling (identical instances).
Containers make this natural — every deploy is a new image. AMI-based infrastructure with Packer achieves the same for EC2.
Policy-as-Code expresses compliance rules (security, cost, governance) as code that can be automatically evaluated in CI/CD pipelines.
# Checkov in CI (fail if any critical violation):
- name: Run Checkov
uses: bridgecrewio/checkov-action@master
with:
directory: terraform/
framework: terraform
soft_fail: false # fail the pipeline on critical issues
Packer (HashiCorp) builds identical machine images (AMI, GCP image, Docker image) from a template file. The image is pre-baked with all dependencies — no configuration happens at boot time.
// Packer template: build an AMI with Java 21 + app pre-installed
{
"builders": [{
"type": "amazon-ebs",
"region": "us-east-1",
"source_ami_filter": {
"filters": {"name": "ubuntu/images/hvm-ssd/ubuntu-22.04-amd64-*"},
"owners": ["099720109477"]
},
"instance_type": "t3.micro",
"ami_name": "my-app-{{timestamp}}"
}],
"provisioners": [
{"type": "shell", "script": "scripts/install-java.sh"},
{"type": "ansible", "playbook_file": "deploy-app.yml"}
]
}
Pipeline: Packer builds AMI in CI → AMI ID stored in artifact → Terraform references the AMI ID when creating EC2 instances → all instances are identical, pre-baked, start in seconds.
Correlation: when an alert fires (metric), navigate to the logs for that time period, then find the trace ID in the logs, then follow the trace to see which service caused the issue. Modern tools (Grafana) link all three.
Error budget — the allowable amount of failure before the SLO is breached. 99.9% SLO = 0.1% error budget = 43.8 minutes/month of downtime allowed. Track error budget consumption. If burning fast → freeze feature work, fix reliability.
Availability SLO = 99.9%
Monthly error budget = 0.1% × 43,200 min = 43.2 min
Used this month = 28 min (64% of budget)
Remaining = 15.2 min
Prometheus is a pull-based time-series metrics database. It scrapes metrics from targets via HTTP at a configured interval (default 15s).
# prometheus.yml
scrape_configs:
- job_name: 'spring-boot-apps'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
# Application exposes /actuator/prometheus with:
http_server_requests_seconds_count{method="GET",status="200",uri="/api/orders"} 4521
http_server_requests_seconds_sum{...} 45.2
jvm_memory_used_bytes{area="heap"} 157286400
Query language: PromQL
# Error rate over last 5m:
rate(http_server_requests_seconds_count{status=~"5.."}[5m])
/ rate(http_server_requests_seconds_count[5m])
# p99 latency:
histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m]))
# ELK: Elasticsearch + Logstash + Kibana
# EFK: Elasticsearch + Fluentd + Kibana (more Kubernetes-native)
# Architecture:
Application pods → stdout
↓ DaemonSet (Fluentd/Filebeat on every node)
↓ tails /var/log/containers/*.log
→ Elasticsearch (indexed, searchable)
→ Kibana (visualise, search, dashboard)
# Structured logging (JSON) — much more queryable than plaintext:
{
"timestamp": "2026-06-23T10:15:30Z",
"level": "ERROR",
"service": "payment-service",
"traceId": "4bf92f3577b34da6",
"userId": "u-12345",
"message": "Payment processing failed",
"errorCode": "CARD_DECLINED"
}
Modern alternative: Grafana Loki — stores logs without indexing content (only labels are indexed, like Prometheus). Much cheaper at scale. Log lines accessed via log labels + time range. Integrates natively with Grafana.
Distributed tracing tracks a request through multiple services by propagating a trace context (trace ID, span ID) in headers. Each service creates a "span" — a unit of work with start/end time and metadata.
# User request creates a trace across 5 services:
Trace ID: abc123
Span: API Gateway 0-10ms
Span: Auth Service 1-8ms
Span: Order Service 8-45ms
Span: Inventory Service 10-20ms
Span: Database Query 20-40ms (!!!)
Span: Notification Service 45-60ms
# Flame graph shows: DB query accounts for 44% of total latency
Without tracing: "the /order endpoint is slow" — impossible to know which of the 5 services is responsible. With tracing: immediately see it's the database query in the Order Service.
Instrument with OpenTelemetry SDK → export to Jaeger, Zipkin, AWS X-Ray, or Grafana Tempo.
Coined in Google's SRE Book — the four metrics that matter most for any serving system:
Synthetic monitoring proactively simulates user interactions to detect issues before real users experience them. Scripted tests run at regular intervals from multiple geographic locations.
/health every 30s from 5 regions. Alert if unreachable./api/payment with test data, verify response time < 500ms and response body is correct.Tools: Datadog Synthetic, Pingdom, AWS CloudWatch Synthetics, k6, Playwright + cron.
Pair with real user monitoring (RUM) — JavaScript in the browser collects actual user experience (Core Web Vitals, page load time) from real traffic.
Alert fatigue occurs when too many low-quality alerts desensitise engineers, causing them to miss real incidents.
Principles:
# Multiwindow, multi-burn-rate alert (from Google SRE workbook):
# Alert when error budget burns >14x rate in last 1h AND last 5min:
(rate(http_requests_total{status=~"5.."}[1h])
/ rate(http_requests_total[1h])) > (14 * 0.001)
AND
(rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])) > (14 * 0.001)
DevSecOps integrates security practices throughout the CI/CD pipeline rather than as a gate at the end. "Security as code."
Developer workstation:
- pre-commit: secret detection (gitleaks), dependency audit
Pull Request:
- SAST: Semgrep, SonarQube — code-level vulnerabilities (SQL injection, XSS)
- SCA: Snyk, OWASP Dependency-Check — vulnerable libraries
- IaC scanning: Checkov, tfsec — cloud misconfigurations
- License compliance check
Build:
- Container image scan: Trivy, Grype — CVEs in OS packages + app dependencies
- Image signing: Cosign — prove image wasn't tampered with
Deploy:
- Admission webhook: verify image is signed before deploying
- Policy enforcement: Kyverno/OPA — no privileged containers, must have resource limits
Runtime:
- Falco: detect anomalous container behaviour
- DAST: OWASP ZAP scan against staging environment
- Secrets rotation: External Secrets Operator + Vault
# Layer 1: Pre-commit hooks (catches before commit):
pip install pre-commit
# .pre-commit-config.yaml:
repos:
- repo: https://github.com/gitleaks/gitleaks
rev: v8.18.0
hooks:
- id: gitleaks
# Layer 2: CI pipeline scan:
- uses: trufflesecurity/trufflehog@main
with:
path: ./
base: ${{ github.event.pull_request.base.sha }}
# Layer 3: GitHub secret scanning (free on public, paid on private):
# Automatically enabled for many known secret patterns (AWS keys, etc.)
# Layer 4: Audit past commits:
gitleaks detect --source . --log-level warn
# If a secret is found in git history:
git filter-repo --invert-paths --path secrets.env
# AND immediately rotate the credential — it's compromised
An SBOM is a complete inventory of all software components, dependencies, and libraries in an application — like an ingredients list for software.
Why it matters:
# Generate SBOM with Syft in CI:
syft myapp:latest -o spdx-json > sbom.json
# Scan SBOM for vulnerabilities with Grype:
grype sbom:sbom.json
# Attach SBOM to container image (cosign):
cosign attach sbom --sbom sbom.json myapp:latest
Zero-trust: "never trust, always verify." No implicit trust based on network location. Everything must authenticate and authorise every request.
Traditional model (perimeter security): firewall protects the network perimeter; everything inside the network is trusted. One breach = lateral movement across all internal systems.
Zero-trust implementation:
Implementation tools: service mesh (mTLS between services), SPIFFE/SPIRE (service identity), BeyondCorp/BeyondProd model, K8s NetworkPolicies, Vault for secrets.
deny if image has critical CVE).# Trivy in CI (fail on critical):
trivy image --exit-code 1 --severity CRITICAL myapp:latest
Vault is a secrets management platform: stores, encrypts, and controls access to secrets. Features: dynamic secrets, lease/renewal, audit log, multiple auth methods.
Dynamic secrets — Vault generates short-lived database credentials on demand. No static password to steal:
# App requests DB credentials from Vault:
curl -H "X-Vault-Token: s.abc123" \
http://vault:8200/v1/database/creds/my-role
# Returns: username=v-app-xyz123, password=A3f..., ttl=1h
# Vault revokes these credentials after 1 hour
K8s integration via Vault Agent Injector:
# Pod annotation — Vault agent injects secrets as files:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "my-app"
vault.hashicorp.com/agent-inject-secret-config: "secret/data/my-app"
# Vault agent sidecar reads secrets and writes to /vault/secrets/config
# App reads file — never handles the Vault token directly
Chaos engineering deliberately introduces controlled failures in production (or staging) to discover weaknesses before they cause unplanned outages. "Break things on purpose before they break on their own."
Principles (Netflix Chaos Engineering):
Tools:
Availability formula:
Availability = MTBF / (MTBF + MTTR)
# Example: MTBF = 720 hours (30 days), MTTR = 2 hours
Availability = 720 / (720 + 2) = 99.72%
# To reach 99.9%: either increase MTBF (fail less) or decrease MTTR (recover faster)
# Both matter — DORA research shows elite teams do both
A runbook is step-by-step documentation for responding to a specific alert or incident. It enables anyone on the team (not just the expert) to handle the issue.
Runbook structure:
kubectl rollout restart deployment/api" not "restart the API."Systematic approach:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-api
spec:
replicas: 10
strategy:
canary:
canaryService: my-api-canary
stableService: my-api-stable
trafficRouting:
istio:
virtualService:
name: my-api-vs
steps:
- setWeight: 5 # 5% to canary
- pause: {duration: 5m} # wait and watch
- analysis: # automated analysis
templates:
- templateName: error-rate
- setWeight: 20 # 20% to canary
- pause: {duration: 5m}
- setWeight: 100 # 100% — rollout complete
# AnalysisTemplate: automated rollback if error rate > 1%
# Argo Rollouts queries Prometheus for the canary's error rate
Argo Rollouts integrates with Istio, NGINX, ALB for traffic splitting. Automated analysis via Prometheus/Datadog metrics automatically rolls back the canary if KPIs degrade.
Infrastructure HA:
- Multiple AZs: at least 2 (ideally 3) across your primary region
- Auto Scaling Group spans all AZs
- Application Load Balancer in all AZs (cross-zone load balancing)
- Health checks: ALB removes unhealthy instances automatically
Application HA:
- Minimum 2 replicas at all times (PodDisruptionBudget in K8s)
- No state stored on the instance (sessions in Redis, files in S3)
- Graceful shutdown: handle SIGTERM, complete in-flight requests
- Circuit breaker: fail fast when downstream is down (Resilience4j, Istio)
Database HA:
- Multi-AZ RDS / Aurora (synchronous replication)
- Connection pooling (RDS Proxy) to survive DB failover
- Read replicas for read-heavy workloads
DNS / Traffic:
- Route 53 health checks with failover routing
- CloudFront for static assets + API caching
- Health endpoints expose real dependency status (/health/ready)
Deployment HA:
- Rolling update with maxUnavailable: 0
- PodDisruptionBudget prevents all pods going down at once
- Liveness + readiness probes: traffic only to healthy instances
Load testing simulates concurrent user traffic to validate performance characteristics and find breaking points before production.
Types:
# k6 (modern, JS-based load testing):
import http from 'k6/http';
import { sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 100 }, // ramp up to 100 users
{ duration: '5m', target: 100 }, // hold at 100 users
{ duration: '2m', target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ['p(99)<500'], // 99% under 500ms
http_req_failed: ['rate<0.01'], // <1% errors
},
};
export default function() {
http.get('https://api.myapp.com/orders');
sleep(1);
}
Other tools: JMeter (Java, UI + scripted), Gatling (Scala DSL), Locust (Python).
A circuit breaker stops cascading failures. When a downstream service is unavailable, threads accumulate waiting for timeouts — exhausting the thread pool and taking down the upstream service too.
// Resilience4j circuit breaker (Spring Boot):
@CircuitBreaker(name = "inventoryService", fallbackMethod = "fallbackInventory")
public InventoryResponse checkInventory(String productId) {
return inventoryClient.check(productId);
}
private InventoryResponse fallbackInventory(String productId, Exception ex) {
// Return cached response, default value, or degrade gracefully
return new InventoryResponse(productId, 0, "UNAVAILABLE");
}
# Configuration:
resilience4j.circuitbreaker.instances.inventoryService:
slidingWindowSize: 10
failureRateThreshold: 50 # open after 50% failures in window
waitDurationInOpenState: 30s # wait 30s before trying again (half-open)
States: Closed (normal) → Open (failing fast, no calls to downstream) → Half-Open (allow one test call) → back to Closed if test succeeds.
The bulkhead pattern isolates thread pools per downstream dependency so a slow/failed service doesn't exhaust the common thread pool and take down the entire application.
# Without bulkhead:
# One thread pool shared by all services:
# Inventory service slow → 100 threads waiting → Payment calls also blocked
# With bulkhead:
# Separate thread pool per service:
resilience4j.bulkhead.instances:
inventoryService:
maxConcurrentCalls: 10 # only 10 threads for inventory calls
maxWaitDuration: 0ms # fail immediately if all 10 are busy
paymentService:
maxConcurrentCalls: 20 # payment gets its own 20 threads
Named after ship bulkheads — watertight compartments so flooding one compartment doesn't sink the ship. Combined with circuit breakers: circuit breaker prevents calling a failing service; bulkhead limits damage when the service is slow (not yet failing).
Capacity planning ensures you have enough infrastructure to handle current and future load before it becomes a problem.
Process:
Modern cloud + auto-scaling reduces reactive capacity planning burden but doesn't eliminate it — you still need to set correct ASG minimums, reserved instances for baseline, and understand burst patterns.
OpenTelemetry (OTel) is a CNCF open standard for collecting and exporting telemetry data (metrics, logs, traces) from applications. It replaces vendor-specific SDKs with a single, vendor-neutral instrumentation layer.
// Java: OTel auto-instrumentation (no code changes)
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=payment-service \
-Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
-jar app.jar
# OTel Collector receives data, can export to multiple backends:
# → Prometheus (metrics)
# → Jaeger / Grafana Tempo (traces)
# → Loki (logs)
# → Datadog / Dynatrace / New Relic (commercial)
Why it matters: switch observability backends without changing application code. Instrument once, export anywhere. Avoid vendor lock-in. Adopted by AWS, Google, Microsoft, Datadog — the de facto standard for observability in 2026.
# 1. Order Dockerfile layers by change frequency (most stable first):
FROM eclipse-temurin:21-jre # rarely changes
COPY pom.xml . # changes rarely
RUN mvn dependency:go-offline # expensive; cache this layer
COPY src ./src # changes often
RUN mvn package # rebuild only when src changes
# 2. BuildKit cache mounts (persist Maven/npm cache between builds):
# syntax=docker/dockerfile:1
RUN --mount=type=cache,target=/root/.m2 mvn package
# 3. CI-level caching (GitHub Actions):
- uses: actions/cache@v4
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
# 4. Parallelise CI jobs that don't depend on each other
# 5. Use a remote cache registry (BuildKit):
docker buildx build \
--cache-from type=registry,ref=myregistry/myapp:cache \
--cache-to type=registry,ref=myregistry/myapp:cache,mode=max \
-t myapp:latest .
Repository strategy: Monorepo with path-based CI triggers
Tool stack: GitHub Actions + Docker + ECR + Argo CD + EKS
Per-service CI (triggered only when service dir changes):
1. Lint + unit tests
2. Build Docker image (multi-stage, cached layers)
3. SAST (Semgrep) + SCA (Snyk) + image scan (Trivy)
4. Push image to ECR: :sha + :latest-dev
5. Update image tag in k8s-manifests repo (separate GitOps repo)
CD via Argo CD:
k8s-manifests repo/
apps/
service-A/overlays/dev → Argo CD auto-syncs to dev cluster
service-A/overlays/staging → Argo CD auto-syncs on PR merge
service-A/overlays/prod → Argo CD syncs only on release tag
Promotion workflow:
Dev: every PR merge → auto-deploy to dev (Argo CD)
Staging: every merge to main → deploy all services to staging
→ automated integration tests (k6 load + Playwright E2E)
Prod: create release tag → Argo Rollouts canary (5% → 20% → 100%)
→ automated analysis (Prometheus error rate gate)
→ auto-rollback if error rate > SLO threshold
Contract testing (Pact): each service PR runs consumer-driven contract tests
→ prevents interface breaking changes before deployment