DevOps Interview Questions 2026

Top 55 Questions & Answers — CI/CD, IaC, Monitoring, Docker, Kubernetes, DevSecOps & SRE

This guide covers the most frequently asked DevOps interview questions in 2026 — from cultural principles and CI/CD pipelines to production reliability, security, and platform engineering. Applicable to DevOps Engineer, SRE, and Platform Engineer roles.

Easy = Concepts, tools, basic CI/CD  |  Medium = Production patterns, configuration  |  Hard = Architecture, SRE, advanced design
DevOps Culture & Principles
1
What is DevOps and what problem does it solve?Easy

DevOps is a set of practices and cultural philosophies that combine software development (Dev) and IT operations (Ops) to shorten the development cycle and deliver high-quality software continuously.

Problem it solves — the "wall of confusion":

  • Dev teams optimise for feature velocity; Ops teams optimise for stability — conflicting incentives
  • Long release cycles (months) mean big-bang deployments with high failure rates
  • Manual operations create toil, inconsistency, and slow incident response
  • Blame culture between teams slows improvement

DevOps principles (CALMS):

  • Culture — shared ownership, blameless postmortems
  • Automation — CI/CD, IaC, automated testing
  • Lean — small batch sizes, eliminate waste, continuous flow
  • Measurement — metrics on everything, data-driven decisions
  • Sharing — knowledge, tooling, on-call responsibilities
2
What are the four key DORA metrics?Easy

DORA (DevOps Research and Assessment) metrics measure software delivery and operational performance:

  • Deployment Frequency — how often code is deployed to production. Elite: multiple times per day. High: weekly.
  • Lead Time for Changes — time from code committed to running in production. Elite: <1 hour. High: <1 day.
  • Change Failure Rate — % of deployments that cause production failures requiring hotfix/rollback. Elite: 0–5%. High: 5–10%.
  • Mean Time to Restore (MTTR) — how long to recover from a production failure. Elite: <1 hour. High: <1 day.
Improving deployment frequency typically reduces change failure rate (smaller, more frequent changes are safer than large infrequent releases). The counterintuitive insight from 9 years of DORA research.
3
What is the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment?Easy
  • Continuous Integration (CI) — developers merge code frequently (multiple times per day) to a shared branch. Every merge triggers automated build + test. Finds integration bugs early.
  • Continuous Delivery (CD) — every change that passes CI is automatically prepared and ready to deploy to production. The actual deployment is a manual decision (someone clicks "deploy"). Enables on-demand releases.
  • Continuous Deployment — goes one step further: every change that passes all automated tests is automatically deployed to production with no human gate. Used by companies like Netflix, Amazon, GitHub.
CI:          code → build → unit tests → integration tests
CD (Delivery): → deploy to staging → acceptance tests → [HUMAN GATE] → deploy to prod
CD (Deployment): → deploy to staging → acceptance tests → automatically deploy to prod

Most organisations practice CI + Continuous Delivery. Continuous Deployment requires very mature automated testing and feature flag infrastructure.

4
What is "shift-left" in DevOps?Easy

Shift-left means moving testing, security, and quality checks earlier in the development lifecycle (to the "left" on a timeline). The earlier a bug is found, the cheaper it is to fix.

Traditional (shift-right):
Developer → Code Review → QA → Security Review → Staging → Prod
                                    ↑ bugs found here, expensive to fix

Shift-left:
Developer → Pre-commit hooks → CI tests → SAST scan → Code Review → Staging → Prod
    ↑ bugs caught here, cheap to fix

Shift-left techniques:

  • Pre-commit hooks (linting, formatting, secret detection)
  • Unit tests written by developers alongside code (TDD)
  • SAST (Static Application Security Testing) in CI
  • Container image vulnerability scanning before deployment
  • Policy-as-code validation (Terraform plan checks, K8s admission webhooks)
5
What is a blameless postmortem and why is it important?Medium

A blameless postmortem is a structured review of a production incident that focuses on systems and processes — not on blaming individuals. Pioneered by Google SRE and widely adopted in DevOps culture.

Why blameless: People make mistakes when under pressure, with incomplete information, with poorly designed systems. If engineers fear blame, they hide problems, don't speak up, and incidents repeat. Psychological safety produces better learning.

Postmortem structure:

  1. Timeline — chronological sequence of events (what happened, what was detected, what actions were taken)
  2. Root cause analysis — 5-Whys until you find systemic causes
  3. Contributing factors — not "who made a mistake" but "why was this mistake possible?"
  4. Impact — users affected, duration, revenue impact
  5. Action items — concrete fixes with owners and deadlines
The goal: make it harder to fail next time, not punish the person who happened to be the last one to touch the system.
6
What are feature flags and how do they enable trunk-based development?Medium

Feature flags (feature toggles) are conditional code paths that enable/disable features at runtime without deployment. They decouple deployment from feature release.

// Code:
if (featureFlags.isEnabled("new-checkout-flow", userId)) {
  return newCheckout(request);
} else {
  return legacyCheckout(request);
}

// Feature flag config (toggleable in UI, no deployment):
{"new-checkout-flow": {"enabled": false, "rollout": 5}}  // 5% of users

Enables trunk-based development: all developers work on one branch (main/trunk). Incomplete features are merged but hidden behind a flag. No long-lived feature branches → no merge hell. Continuous Integration works correctly.

Use cases: A/B testing, gradual rollout (canary to 1% → 10% → 100%), kill switch (turn off a broken feature without rollback), dark launch (code runs but output not shown to users).

Tools: LaunchDarkly, Unleash, Flagsmith, AWS AppConfig.

7
What is toil in SRE and why is it important to eliminate?Medium

Toil is manual, repetitive, operational work that scales with traffic — not engineering work that improves the system. Google SRE policy: keep toil below 50% of each engineer's time; the rest should be engineering (automation, reducing future toil).

Examples of toil:

  • Manually restarting services that crash regularly
  • Running the same runbook steps for every deployment
  • Rotating credentials by hand
  • Manually responding to the same alert every Tuesday morning
  • Clearing disk space on the same server repeatedly

Why eliminate: toil burns out engineers, blocks feature work, grows linearly with scale (you need to hire more people just to keep up), and is often error-prone.

Elimination approach: if you do it twice, automate it. Write it as code, add it to CI, make it self-service.

8
What is the difference between DevOps and SRE?Medium
  • DevOps — a culture and set of practices. Broad philosophy about collaboration between Dev and Ops, CI/CD, automation. Not a specific job title at Google (though many companies use it as one).
  • SRE (Site Reliability Engineering) — Google's specific implementation of DevOps principles. SRE is a prescriptive approach: error budgets, SLOs, toil budget (50% max), on-call engineering, postmortems. Treats operations as a software engineering problem.

As Google SRE says: "SRE is what happens when you ask a software engineer to design an operations function."

DevOpsSRE
Cultural philosophySpecific practices + job role
Focus: speed + collaborationFocus: reliability + error budgets
Broad, flexiblePrescriptive (SLO/SLA/SLI)
CI/CD Pipelines
9
What should a CI pipeline contain?Easy
Trigger: push to PR or merge to main

Stage 1: Code Quality
  - Linting (ESLint, Checkstyle, flake8)
  - Formatting check (Prettier, Black)
  - Static analysis (SonarQube, PMD)

Stage 2: Build
  - Compile / package (Maven, Gradle, npm build)
  - Build Docker image

Stage 3: Test
  - Unit tests (JUnit, Jest, pytest)
  - Integration tests (against test DB, mocked services)
  - Code coverage check (fail if < 80%)

Stage 4: Security
  - SAST scan (Semgrep, Checkmarx)
  - Dependency vulnerability scan (Snyk, OWASP Dependency-Check)
  - Container image scan (Trivy, Grype)
  - Secret detection (truffleHog, detect-secrets)

Stage 5: Artifact
  - Push image to registry (ECR, GCR, Docker Hub)
  - Tag with git SHA
  - Publish test reports, coverage report

Total pipeline runtime goal: under 10 minutes. Parallelise independent stages.

10
What is GitHub Actions and how does it work?Easy

GitHub Actions is a CI/CD platform built into GitHub. Workflows are YAML files in .github/workflows/. Events trigger workflow runs on hosted or self-hosted runners.

name: CI
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4

    - name: Set up JDK 21
      uses: actions/setup-java@v4
      with:
        java-version: '21'
        distribution: 'temurin'

    - name: Run tests
      run: mvn test

    - name: Build and push Docker image
      uses: docker/build-push-action@v5
      with:
        push: ${{ github.ref == 'refs/heads/main' }}
        tags: myapp:${{ github.sha }}

Actions marketplace has 20,000+ reusable actions. Self-hosted runners for: private networks, GPU builds, larger machines, cost reduction at scale. Use OIDC for keyless authentication to cloud providers (no secrets in GitHub).

11
What is Jenkins and how does a Jenkinsfile work?Easy

Jenkins is an open-source automation server (Java-based) for building CI/CD pipelines. A Jenkinsfile defines the pipeline as code (stored in the repo, version-controlled).

// Declarative Jenkinsfile:
pipeline {
  agent { docker { image 'maven:3.9-eclipse-temurin-21' } }

  environment {
    DOCKER_CREDS = credentials('docker-hub-creds')
  }

  stages {
    stage('Build') {
      steps {
        sh 'mvn clean package -DskipTests'
      }
    }
    stage('Test') {
      steps {
        sh 'mvn test'
      }
      post {
        always {
          junit 'target/surefire-reports/*.xml'
        }
      }
    }
    stage('Docker Push') {
      when { branch 'main' }
      steps {
        sh 'docker build -t myapp:${GIT_COMMIT} .'
        sh 'docker push myapp:${GIT_COMMIT}'
      }
    }
  }
}

Jenkins strengths: most mature (2011), largest plugin ecosystem (1800+ plugins), self-hosted control, complex pipeline support. Weaknesses: high operational overhead, complex config.

12
What deployment strategies exist and when do you use each?Medium
  • Recreate — stop all v1, start all v2. Simple but has downtime. Only for dev/test.
  • Rolling update — gradually replace v1 Pods with v2 (K8s default). Zero downtime. Risk: v1 and v2 run simultaneously during update — must be backward compatible.
  • Blue/Green — run full v2 stack alongside v1. Switch load balancer to v2 instantly. Zero downtime, instant rollback (flip back to blue). Cost: double infrastructure during cutover. Best for: critical apps where you want instant rollback.
  • Canary — route a small % of traffic to v2 (e.g. 5%), monitor metrics (error rate, latency), gradually increase to 100%. Best for: validating performance impact on real users before full rollout. Use with feature flags or weighted routing (Argo Rollouts, Istio).
  • A/B Testing — like canary but route based on user attributes (geography, account tier, user ID) — for business experiments, not just risk mitigation.
  • Shadow deployment — duplicate production traffic to v2 (doesn't affect users). Validate v2 behaviour under real load without user impact.
13
What is ArgoCD and how does it implement GitOps?Medium

Argo CD is a declarative GitOps controller for Kubernetes. It continuously syncs the cluster state with the desired state defined in Git.

# Application definition:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/k8s-manifests
    targetRevision: main
    path: apps/my-app/overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true      # delete resources removed from Git
      selfHeal: true   # revert manual changes in cluster
    syncOptions:
    - CreateNamespace=true

GitOps workflow:

  1. CI pipeline builds image, updates image tag in Git repo
  2. Argo CD detects the Git change (polling or webhook)
  3. Argo CD applies the new manifests to the cluster
  4. App health status shown in Argo CD UI
14
How do you manage secrets in a CI/CD pipeline?Medium

Never store secrets in code or CI environment variables in plaintext.

Approaches by security level:

  • CI platform secrets (GitHub Actions Secrets, GitLab CI Variables) — encrypted at rest, masked in logs. Adequate for small teams. Still requires managing rotation.
  • Vault (HashiCorp) — centralised secrets management. CI retrieves secrets at pipeline start via short-lived tokens. Audit log, auto-rotation, fine-grained policies.
  • OIDC / Keyless auth — best practice for cloud. GitHub Actions has built-in OIDC; exchange for short-lived AWS/GCP credentials without storing any secret in GitHub. No static credentials.
# GitHub Actions OIDC → AWS (no secrets needed):
- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456:role/github-actions
    aws-region: us-east-1
    # GitHub exchanges OIDC token for temporary AWS credentials
    # No AWS_ACCESS_KEY_ID stored in GitHub
15
What is a monorepo and what are its trade-offs?Medium
  • Monorepo — all services/apps in a single repository. Used by Google, Facebook, Twitter, Uber.
  • Polyrepo — one repository per service.
Monorepo ProsMonorepo Cons
Atomic cross-service changes (one PR)CI must be smart (only build what changed)
Shared libraries, consistent toolingLarger clone, slower full builds
Easier code reuse and refactoringAccess control harder (everyone sees all code)
Single source of truth for dependency versionsNoisy blame/history for large teams

Monorepo tools for affected-only CI: Nx (Node), Bazel (polyglot), Turborepo (JS), Pants (Python). On GitHub: path filters in Actions workflows to only trigger jobs for changed paths.

16
How do you implement automated database schema migrations in a CI/CD pipeline?Hard
# Tools: Flyway, Liquibase, Alembic (Python), golang-migrate

# Migration file naming (Flyway):
V1__Create_users_table.sql
V2__Add_email_index.sql
V3__Add_last_login_column.sql

# CI pipeline stage before deployment:
- name: Run DB Migrations
  run: |
    flyway -url=$DB_URL -user=$DB_USER -password=$DB_PASS migrate

# Kubernetes: run migration as an init container before app starts:
initContainers:
- name: db-migrate
  image: my-app:${{ IMAGE_TAG }}
  command: ["./migrate.sh"]
  env:
  - name: DB_URL
    valueFrom:
      secretKeyRef:
        name: db-credentials
        key: url

Rules for zero-downtime migrations:

  • Never drop a column that old code still reads (add first, stop reading, then drop)
  • Never rename a column in one step (add new, dual-write, migrate, remove old)
  • Always make migrations reversible (down migrations)
  • Test migrations against a copy of production data in staging
17
What is semantic versioning and how do you automate it?Medium

Semantic versioning (SemVer): MAJOR.MINOR.PATCH

  • MAJOR — breaking change (existing APIs removed or incompatible)
  • MINOR — new features, backward compatible
  • PATCH — bug fixes, no new features

Automated versioning via Conventional Commits:

# Commit message format determines version bump:
feat: add OAuth2 login support         → MINOR bump (1.2.0 → 1.3.0)
fix: resolve null pointer in payment   → PATCH bump (1.3.0 → 1.3.1)
feat!: redesign API response format    → MAJOR bump (1.3.1 → 2.0.0)
# (the ! signals a breaking change)

# Tools:
# semantic-release (Node.js) — reads commits, bumps version, creates tag, publishes changelog
# release-please (Google) — GitHub Action that creates release PRs automatically

For Docker images: tag with git SHA for every build (immutable, traceable), plus semantic version tags (1.3.0, 1.3, 1, latest) on releases.

18
What is a multi-stage Docker build and why should you use it?Medium
# Without multi-stage: all build tools end up in the final image (large + insecure)
# With multi-stage: final image only contains runtime artifacts

# Java Spring Boot example:
# Stage 1: Build (JDK + Maven — only in build stage)
FROM eclipse-temurin:21-jdk AS builder
WORKDIR /app
COPY pom.xml .
RUN mvn dependency:go-offline    # cache dependencies
COPY src ./src
RUN mvn clean package -DskipTests

# Stage 2: Extract layers (Spring Boot's layertools)
FROM builder AS extractor
RUN java -Djarmode=layertools -jar target/*.jar extract

# Stage 3: Runtime (JRE only — no Maven, no source code, no build caches)
FROM eclipse-temurin:21-jre-jammy
WORKDIR /app
COPY --from=extractor /app/dependencies/ ./
COPY --from=extractor /app/application/ ./
ENTRYPOINT ["java", "org.springframework.boot.loader.launch.JarLauncher"]

Results: image reduced from ~800MB (JDK+Maven) to ~200MB (JRE only). Smaller attack surface, faster pulls, cheaper storage.

Infrastructure as Code
19
What is Infrastructure as Code and why is it essential?Easy

IaC defines infrastructure (servers, networks, databases, permissions) in machine-readable configuration files that can be version-controlled, reviewed, and automated.

Why it's essential:

  • Reproducibility — same code creates identical environments every time. Eliminates "works in staging, broken in prod" due to config drift.
  • Version control — every infrastructure change is a Git commit. Who changed what, when, why. Rollback = git revert.
  • Code review — infrastructure changes go through PRs like code. Catch security misconfigurations before they reach prod.
  • Disaster recovery — recreate entire infrastructure from code in minutes after a catastrophic failure.
  • Cost visibility — can estimate cost of a terraform plan before applying (Infracost).
  • Documentation — the code IS the documentation of what infrastructure exists.
20
How does Terraform work? Explain the core workflow.Easy
# 1. Write HCL configuration:
resource "aws_instance" "web" {
  ami           = "ami-0abcdef1234567890"
  instance_type = "t3.medium"
  tags = { Name = "web-server", Env = "prod" }
}

# 2. terraform init — download provider plugins
terraform init

# 3. terraform plan — compare desired state vs actual (no changes applied)
terraform plan
# Shows: +create, ~update, -destroy

# 4. terraform apply — apply the plan to real infrastructure
terraform apply

# 5. terraform destroy — tear down all resources
terraform destroy

State file (terraform.tfstate) — records the mapping between Terraform resources and real cloud resources. Critical: if lost, Terraform loses track of what it manages. Store in S3 + DynamoDB locking for teams.

terraform {
  backend "s3" {
    bucket         = "my-tfstate"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}
21
What are Terraform modules and why do you use them?Medium

A Terraform module is a reusable, parameterised set of resources packaged together. Like a function in programming — write once, call many times with different inputs.

# modules/vpc/main.tf — reusable VPC module
variable "cidr_block" {}
variable "env" {}
resource "aws_vpc" "this" { cidr_block = var.cidr_block, tags = {Env=var.env} }
output "vpc_id" { value = aws_vpc.this.id }

# root module — call with different params per environment:
module "dev_vpc" {
  source     = "./modules/vpc"
  cidr_block = "10.0.0.0/16"
  env        = "dev"
}
module "prod_vpc" {
  source     = "./modules/vpc"
  cidr_block = "10.1.0.0/16"
  env        = "prod"
}

The Terraform Registry has community modules for all major AWS/GCP/Azure patterns (VPC, EKS, RDS). Use them as starting points. Always pin to a specific version (version = "~> 5.0") so module updates don't break your infra unexpectedly.

22
What is Ansible and how does it differ from Terraform?Medium
  • Terraform — declarative IaC for provisioning cloud infrastructure (VMs, networks, load balancers). Manages state. Cloud-native resources.
  • Ansible — procedural configuration management. Installs software on servers, configures OS settings, deploys applications. Agentless (SSH). Playbooks are YAML sequences of tasks.
# Ansible Playbook: install and start Nginx
- hosts: web_servers
  become: yes
  tasks:
  - name: Install nginx
    apt:
      name: nginx
      state: present
      update_cache: yes

  - name: Copy config
    template:
      src: nginx.conf.j2
      dest: /etc/nginx/nginx.conf

  - name: Start nginx
    service:
      name: nginx
      state: started
      enabled: yes

Common pattern: Terraform provisions infrastructure (creates EC2, VPC, RDS), Ansible configures the servers (installs software, deploys app). They complement each other. On modern container/K8s deployments, Ansible is often replaced by Docker + Helm.

23
What is configuration drift and how do you prevent it?Medium

Configuration drift occurs when a server's actual configuration diverges from the desired state defined in code — due to manual changes, failed automation, different deployments over time.

Prevention strategies:

  • Immutable infrastructure — never modify running servers. Build a new image, deploy it, terminate the old one. No SSH-and-fix, no config changes in place. Golden AMIs, Docker containers.
  • Continuous reconciliation — tools like Terraform Cloud, Flux, Argo CD continuously compare desired state with actual. Alert or auto-fix on drift.
  • AWS Config + drift detection — CloudFormation drift detection finds resources that deviate from the template.
  • Ansible idempotent runs — run Ansible periodically to enforce desired state (not just on deploy).
The most reliable prevention: immutable infrastructure. If servers can't be SSHed into and modified, they can't drift.
24
What is the difference between mutable and immutable infrastructure?Medium
  • Mutable — update servers in place. SSH in, run apt upgrade, deploy new code. Snowflake servers — each one accumulates unique changes over time. Hard to reproduce, drifts, "works on my server."
  • Immutable — never modify running servers. New version = new image (AMI, Docker image). Deploy new, remove old. Every instance is identical to the image. No drift possible.
Mutable:   Server-v1 → SSH → apt install nginx → Server-v1-modified (drift)
Immutable: Server-v1 → build new image → Server-v2 → terminate Server-v1

Immutable infrastructure benefits: predictable deploys (tested image is exactly what runs in prod), fast rollback (just deploy previous image), no snowflake servers, horizontal scaling (identical instances).

Containers make this natural — every deploy is a new image. AMI-based infrastructure with Packer achieves the same for EC2.

25
What is Policy-as-Code and which tools implement it?Medium

Policy-as-Code expresses compliance rules (security, cost, governance) as code that can be automatically evaluated in CI/CD pipelines.

  • OPA (Open Policy Agent) — general-purpose policy engine with Rego language. Used for: K8s admission (Gatekeeper), Terraform plan validation, API authorisation.
  • Checkov — scans Terraform/CloudFormation/Kubernetes for misconfigurations (open S3 buckets, unencrypted resources, public SSH access). 750+ built-in rules.
  • tfsec / Terrascan — Terraform security scanners.
  • Kyverno — Kubernetes-native policy engine (YAML policies, no Rego).
  • Sentinel — HashiCorp's policy framework for Terraform Cloud.
# Checkov in CI (fail if any critical violation):
- name: Run Checkov
  uses: bridgecrewio/checkov-action@master
  with:
    directory: terraform/
    framework: terraform
    soft_fail: false   # fail the pipeline on critical issues
26
What is Packer and how does it fit in a DevOps pipeline?Medium

Packer (HashiCorp) builds identical machine images (AMI, GCP image, Docker image) from a template file. The image is pre-baked with all dependencies — no configuration happens at boot time.

// Packer template: build an AMI with Java 21 + app pre-installed
{
  "builders": [{
    "type": "amazon-ebs",
    "region": "us-east-1",
    "source_ami_filter": {
      "filters": {"name": "ubuntu/images/hvm-ssd/ubuntu-22.04-amd64-*"},
      "owners": ["099720109477"]
    },
    "instance_type": "t3.micro",
    "ami_name": "my-app-{{timestamp}}"
  }],
  "provisioners": [
    {"type": "shell", "script": "scripts/install-java.sh"},
    {"type": "ansible", "playbook_file": "deploy-app.yml"}
  ]
}

Pipeline: Packer builds AMI in CI → AMI ID stored in artifact → Terraform references the AMI ID when creating EC2 instances → all instances are identical, pre-baked, start in seconds.

Monitoring, Logging & Observability
27
What are the three pillars of observability?Easy
  • Metrics — numerical measurements over time (CPU usage, request rate, error rate, latency percentiles). Cheap to store, aggregate, and alert on. Tools: Prometheus, CloudWatch, Datadog.
  • Logs — timestamped event records. Detailed context for debugging. Expensive at scale. Tools: ELK stack (Elasticsearch, Logstash, Kibana), Loki, CloudWatch Logs, Splunk.
  • Traces — distributed traces track a request across multiple services. Show the full path, latency breakdown per service/operation. Essential for microservices debugging. Tools: Jaeger, Zipkin, AWS X-Ray, Tempo.

Correlation: when an alert fires (metric), navigate to the logs for that time period, then find the trace ID in the logs, then follow the trace to see which service caused the issue. Modern tools (Grafana) link all three.

OpenTelemetry (OTel) is the open standard for instrumentation — collect metrics, logs, traces with one SDK, export to any backend.
28
What are SLI, SLO, and SLA?Medium
  • SLI (Service Level Indicator) — a quantitative measurement of a service aspect. "What are we measuring?" Examples: request success rate, latency at p99, availability percentage.
  • SLO (Service Level Objective) — the target value for an SLI. "What's our goal?" Example: 99.9% of requests succeed; p99 latency < 200ms; 99.95% uptime.
  • SLA (Service Level Agreement) — a business contract with consequences if SLOs are violated. "What happens if we miss?" Examples: service credits, refunds. SLA is typically looser than internal SLO.

Error budget — the allowable amount of failure before the SLO is breached. 99.9% SLO = 0.1% error budget = 43.8 minutes/month of downtime allowed. Track error budget consumption. If burning fast → freeze feature work, fix reliability.

Availability SLO = 99.9%
Monthly error budget = 0.1% × 43,200 min = 43.2 min
Used this month = 28 min (64% of budget)
Remaining = 15.2 min
29
What is Prometheus and how does it work?Medium

Prometheus is a pull-based time-series metrics database. It scrapes metrics from targets via HTTP at a configured interval (default 15s).

# prometheus.yml
scrape_configs:
  - job_name: 'spring-boot-apps'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      target_label: __metrics_path__

# Application exposes /actuator/prometheus with:
http_server_requests_seconds_count{method="GET",status="200",uri="/api/orders"} 4521
http_server_requests_seconds_sum{...} 45.2
jvm_memory_used_bytes{area="heap"} 157286400

Query language: PromQL

# Error rate over last 5m:
rate(http_server_requests_seconds_count{status=~"5.."}[5m])
  / rate(http_server_requests_seconds_count[5m])

# p99 latency:
histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m]))
30
What is the ELK/EFK stack and how does centralised logging work?Medium
# ELK: Elasticsearch + Logstash + Kibana
# EFK: Elasticsearch + Fluentd + Kibana (more Kubernetes-native)

# Architecture:
Application pods → stdout
  ↓ DaemonSet (Fluentd/Filebeat on every node)
  ↓ tails /var/log/containers/*.log
  → Elasticsearch (indexed, searchable)
  → Kibana (visualise, search, dashboard)

# Structured logging (JSON) — much more queryable than plaintext:
{
  "timestamp": "2026-06-23T10:15:30Z",
  "level": "ERROR",
  "service": "payment-service",
  "traceId": "4bf92f3577b34da6",
  "userId": "u-12345",
  "message": "Payment processing failed",
  "errorCode": "CARD_DECLINED"
}

Modern alternative: Grafana Loki — stores logs without indexing content (only labels are indexed, like Prometheus). Much cheaper at scale. Log lines accessed via log labels + time range. Integrates natively with Grafana.

31
What is distributed tracing and why is it critical for microservices?Medium

Distributed tracing tracks a request through multiple services by propagating a trace context (trace ID, span ID) in headers. Each service creates a "span" — a unit of work with start/end time and metadata.

# User request creates a trace across 5 services:
Trace ID: abc123
  Span: API Gateway              0-10ms
    Span: Auth Service           1-8ms
    Span: Order Service          8-45ms
      Span: Inventory Service    10-20ms
      Span: Database Query       20-40ms (!!!)
    Span: Notification Service   45-60ms

# Flame graph shows: DB query accounts for 44% of total latency

Without tracing: "the /order endpoint is slow" — impossible to know which of the 5 services is responsible. With tracing: immediately see it's the database query in the Order Service.

Instrument with OpenTelemetry SDK → export to Jaeger, Zipkin, AWS X-Ray, or Grafana Tempo.

32
What are the four golden signals for monitoring a service?Medium

Coined in Google's SRE Book — the four metrics that matter most for any serving system:

  • Latency — time to serve a request. Distinguish successful vs failed request latency (a fast error is not a success). Monitor p50, p95, p99 — averages hide tail latency issues.
  • Traffic — amount of demand on the system: requests/second, queries/second, transactions/minute. Gives context to other signals ("errors went up — but so did traffic").
  • Errors — rate of failed requests. Include implicit errors (200 responses with wrong content, slow responses exceeding SLO).
  • Saturation — how "full" the service is: CPU %, memory %, disk I/O, thread pool queue depth. High saturation predicts future errors and latency spikes before they happen.
Start with these four before adding more metrics. Alert on symptoms (latency, errors) not causes (CPU) — CPU at 90% is only a problem if it's causing high latency or errors.
33
What is synthetic monitoring?Easy

Synthetic monitoring proactively simulates user interactions to detect issues before real users experience them. Scripted tests run at regular intervals from multiple geographic locations.

  • Uptime checks — HTTP GET to /health every 30s from 5 regions. Alert if unreachable.
  • Browser tests — headless Chrome runs user flows (sign in → add to cart → checkout). Detects JavaScript errors that don't show up in server logs.
  • API tests — POST to /api/payment with test data, verify response time < 500ms and response body is correct.

Tools: Datadog Synthetic, Pingdom, AWS CloudWatch Synthetics, k6, Playwright + cron.

Pair with real user monitoring (RUM) — JavaScript in the browser collects actual user experience (Core Web Vitals, page load time) from real traffic.

34
How do you design an alerting strategy that avoids alert fatigue?Hard

Alert fatigue occurs when too many low-quality alerts desensitise engineers, causing them to miss real incidents.

Principles:

  • Alert on symptoms, not causes — "error rate > 1%" (symptom) not "CPU > 90%" (cause). CPU at 90% is only a problem if users are impacted.
  • Every alert requires human action — if an alert fires and the response is "this is normal, ignore," delete the alert.
  • Different urgency levels: Page (wake someone at 3 AM — only for SLO breaches) vs Ticket (next business day) vs Info (logged, no notification).
  • Error budget alerts — burn rate alerting: alert only when SLO error budget is being consumed faster than normal. 2% burn rate for 1h = 14.4h to exhaustion. Page. 5% burn for 6h = page. Avoids alerting on transient spikes.
# Multiwindow, multi-burn-rate alert (from Google SRE workbook):
# Alert when error budget burns >14x rate in last 1h AND last 5min:
(rate(http_requests_total{status=~"5.."}[1h])
  / rate(http_requests_total[1h])) > (14 * 0.001)
AND
(rate(http_requests_total{status=~"5.."}[5m])
  / rate(http_requests_total[5m])) > (14 * 0.001)
DevSecOps & Security
35
What is DevSecOps and how do you embed security in the pipeline?Medium

DevSecOps integrates security practices throughout the CI/CD pipeline rather than as a gate at the end. "Security as code."

Developer workstation:
  - pre-commit: secret detection (gitleaks), dependency audit

Pull Request:
  - SAST: Semgrep, SonarQube — code-level vulnerabilities (SQL injection, XSS)
  - SCA: Snyk, OWASP Dependency-Check — vulnerable libraries
  - IaC scanning: Checkov, tfsec — cloud misconfigurations
  - License compliance check

Build:
  - Container image scan: Trivy, Grype — CVEs in OS packages + app dependencies
  - Image signing: Cosign — prove image wasn't tampered with

Deploy:
  - Admission webhook: verify image is signed before deploying
  - Policy enforcement: Kyverno/OPA — no privileged containers, must have resource limits

Runtime:
  - Falco: detect anomalous container behaviour
  - DAST: OWASP ZAP scan against staging environment
  - Secrets rotation: External Secrets Operator + Vault
36
What is SAST vs DAST vs SCA?Easy
  • SAST (Static Application Security Testing) — analyses source code without running it. Finds: SQL injection, XSS, hardcoded secrets, insecure crypto, buffer overflow. Runs in CI on every PR. Fast. False positives common. Tools: Semgrep, SonarQube, Checkmarx, Veracode.
  • DAST (Dynamic Application Security Testing) — attacks a running application like a hacker would. Finds: injection vulnerabilities the app actually allows, misconfigured headers, authentication issues. Slower, runs against staging. Tools: OWASP ZAP, Burp Suite.
  • SCA (Software Composition Analysis) — scans dependencies and libraries for known CVEs. Finds: Log4Shell, Spring4Shell — vulnerabilities in third-party code you imported. Tools: Snyk, OWASP Dependency-Check, GitHub Dependabot.
Use all three: SAST for your code, SCA for your dependencies, DAST for your running application.
37
How do you prevent secrets from being committed to Git?Medium
# Layer 1: Pre-commit hooks (catches before commit):
pip install pre-commit
# .pre-commit-config.yaml:
repos:
- repo: https://github.com/gitleaks/gitleaks
  rev: v8.18.0
  hooks:
  - id: gitleaks

# Layer 2: CI pipeline scan:
- uses: trufflesecurity/trufflehog@main
  with:
    path: ./
    base: ${{ github.event.pull_request.base.sha }}

# Layer 3: GitHub secret scanning (free on public, paid on private):
# Automatically enabled for many known secret patterns (AWS keys, etc.)

# Layer 4: Audit past commits:
gitleaks detect --source . --log-level warn

# If a secret is found in git history:
git filter-repo --invert-paths --path secrets.env
# AND immediately rotate the credential — it's compromised
38
What is a Software Bill of Materials (SBOM)?Medium

An SBOM is a complete inventory of all software components, dependencies, and libraries in an application — like an ingredients list for software.

Why it matters:

  • Vulnerability response — when Log4Shell was announced, companies with SBOMs knew within minutes which applications used Log4j and what version. Companies without SBOMs spent weeks searching.
  • Supply chain security — proves what's actually in a container image (can be signed alongside the image)
  • Compliance — US Executive Order 14028 (2021) requires SBOMs for federal software procurement
  • License compliance — identifies open-source licenses in use (GPL, MIT, etc.)
# Generate SBOM with Syft in CI:
syft myapp:latest -o spdx-json > sbom.json

# Scan SBOM for vulnerabilities with Grype:
grype sbom:sbom.json

# Attach SBOM to container image (cosign):
cosign attach sbom --sbom sbom.json myapp:latest
39
What is zero-trust security in a DevOps context?Hard

Zero-trust: "never trust, always verify." No implicit trust based on network location. Everything must authenticate and authorise every request.

Traditional model (perimeter security): firewall protects the network perimeter; everything inside the network is trusted. One breach = lateral movement across all internal systems.

Zero-trust implementation:

  • Identity-based access — service A doesn't trust service B just because they're in the same VPC. Every service call requires authentication (mTLS, JWT, service account).
  • Least privilege everywhere — humans and services only have the minimum permissions needed
  • Verify continuously — re-authenticate/re-authorise for every request, not just at login
  • Assume breach — design systems assuming an attacker is already inside; segment, log, detect lateral movement

Implementation tools: service mesh (mTLS between services), SPIFFE/SPIRE (service identity), BeyondCorp/BeyondProd model, K8s NetworkPolicies, Vault for secrets.

40
How do you handle container image vulnerabilities in production?Medium
  • Prevention in CI — scan every image before push. Fail build on critical/high CVEs (configurable threshold). Tools: Trivy, Grype, Snyk Container.
  • Continuous scanning in registry — ECR, GCR, Harbor scan stored images continuously. New CVE discovered → alert on images now vulnerable even without new build.
  • Admission control — reject deployment of images with unacceptable vulnerabilities (Kyverno policy: deny if image has critical CVE).
  • Rapid rebuild strategy — automate: new base image released → rebuild + test → deploy. Use Renovate/Dependabot to auto-raise PRs for dependency updates.
  • Minimal base images — distroless images have no shell, no package manager, no utilities. Dramatically fewer CVEs. Google distroless, Chainguard.
# Trivy in CI (fail on critical):
trivy image --exit-code 1 --severity CRITICAL myapp:latest
41
What is HashiCorp Vault and how do you integrate it with Kubernetes?Medium

Vault is a secrets management platform: stores, encrypts, and controls access to secrets. Features: dynamic secrets, lease/renewal, audit log, multiple auth methods.

Dynamic secrets — Vault generates short-lived database credentials on demand. No static password to steal:

# App requests DB credentials from Vault:
curl -H "X-Vault-Token: s.abc123" \
  http://vault:8200/v1/database/creds/my-role
# Returns: username=v-app-xyz123, password=A3f..., ttl=1h
# Vault revokes these credentials after 1 hour

K8s integration via Vault Agent Injector:

# Pod annotation — Vault agent injects secrets as files:
annotations:
  vault.hashicorp.com/agent-inject: "true"
  vault.hashicorp.com/role: "my-app"
  vault.hashicorp.com/agent-inject-secret-config: "secret/data/my-app"
# Vault agent sidecar reads secrets and writes to /vault/secrets/config
# App reads file — never handles the Vault token directly
42
What is chaos engineering?Hard

Chaos engineering deliberately introduces controlled failures in production (or staging) to discover weaknesses before they cause unplanned outages. "Break things on purpose before they break on their own."

Principles (Netflix Chaos Engineering):

  1. Define "steady state" — what normal looks like (metrics, SLOs)
  2. Hypothesise: "if we kill a pod, the service will stay healthy because we have 3 replicas"
  3. Introduce failure in production (or a production-like environment)
  4. Observe: does steady state hold?
  5. Minimise blast radius — start small, have automated stop conditions

Tools:

  • Chaos Monkey (Netflix) — randomly terminates EC2 instances during business hours
  • LitmusChaos — Kubernetes-native, CNCF project. Pod delete, node drain, network delay, memory stress.
  • AWS Fault Injection Simulator (FIS) — managed chaos experiments on AWS resources
  • Gremlin — SaaS platform, fine-grained blast radius controls
SRE, Reliability & Troubleshooting
43
What is the difference between MTTR, MTBF, and MTTF?Easy
  • MTTR (Mean Time to Recovery/Restore) — average time to restore service after failure. Includes detection + diagnosis + fix. Key metric for on-call response. Lower is better: elite teams <1 hour.
  • MTBF (Mean Time Between Failures) — average time between one failure ending and the next beginning. Higher is better (system is reliable).
  • MTTF (Mean Time to Failure) — average time until a component fails (non-repairable systems). Used for hardware/disk lifetime.

Availability formula:

Availability = MTBF / (MTBF + MTTR)

# Example: MTBF = 720 hours (30 days), MTTR = 2 hours
Availability = 720 / (720 + 2) = 99.72%

# To reach 99.9%: either increase MTBF (fail less) or decrease MTTR (recover faster)
# Both matter — DORA research shows elite teams do both
44
What is an on-call runbook and what should it contain?Medium

A runbook is step-by-step documentation for responding to a specific alert or incident. It enables anyone on the team (not just the expert) to handle the issue.

Runbook structure:

  • Alert description — what does this alert mean? What is it measuring?
  • Severity and impact — how many users affected? Is this P1/P2?
  • Diagnostic steps — what dashboards to check, what queries to run, what logs to look at
  • Common causes — the top 3–5 root causes with their symptoms
  • Remediation steps — numbered, unambiguous steps. "Run kubectl rollout restart deployment/api" not "restart the API."
  • Escalation path — who to call if these steps don't work
  • Post-incident — link to postmortem template
Runbooks should be executable by someone new. Test them: have a new team member follow the runbook during a drill and note every step that was unclear.
45
How do you approach root cause analysis for a production incident?Hard

Systematic approach:

  1. Establish a timeline — correlate deployment times, config changes, traffic spikes with the first sign of the issue (often the root cause happened before the symptoms)
  2. Define the problem precisely — "users in us-east-1 get 504s on /checkout after 5pm" not "site is down"
  3. 5 Whys:
    • Why? → DB queries are slow
    • Why? → Index missing on orders table
    • Why? → Migration ran but index creation was in a separate migration that didn't run
    • Why? → Migration was split across two PRs with a dependency not enforced
    • Why? → No CI check validates migration ordering
  4. Differentiate contributing factors from root cause — the deployment triggered it, but the root cause is the missing migration check
  5. Action items that prevent recurrence — add a CI test that validates all migrations are present and ordered correctly
46
What is a service mesh canary deployment with Argo Rollouts?Hard
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-api
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: my-api-canary
      stableService: my-api-stable
      trafficRouting:
        istio:
          virtualService:
            name: my-api-vs
      steps:
      - setWeight: 5           # 5% to canary
      - pause: {duration: 5m}  # wait and watch
      - analysis:              # automated analysis
          templates:
          - templateName: error-rate
      - setWeight: 20          # 20% to canary
      - pause: {duration: 5m}
      - setWeight: 100         # 100% — rollout complete

  # AnalysisTemplate: automated rollback if error rate > 1%
  # Argo Rollouts queries Prometheus for the canary's error rate

Argo Rollouts integrates with Istio, NGINX, ALB for traffic splitting. Automated analysis via Prometheus/Datadog metrics automatically rolls back the canary if KPIs degrade.

47
How do you implement high availability for a stateless web application?Hard
Infrastructure HA:
  - Multiple AZs: at least 2 (ideally 3) across your primary region
  - Auto Scaling Group spans all AZs
  - Application Load Balancer in all AZs (cross-zone load balancing)
  - Health checks: ALB removes unhealthy instances automatically

Application HA:
  - Minimum 2 replicas at all times (PodDisruptionBudget in K8s)
  - No state stored on the instance (sessions in Redis, files in S3)
  - Graceful shutdown: handle SIGTERM, complete in-flight requests
  - Circuit breaker: fail fast when downstream is down (Resilience4j, Istio)

Database HA:
  - Multi-AZ RDS / Aurora (synchronous replication)
  - Connection pooling (RDS Proxy) to survive DB failover
  - Read replicas for read-heavy workloads

DNS / Traffic:
  - Route 53 health checks with failover routing
  - CloudFront for static assets + API caching
  - Health endpoints expose real dependency status (/health/ready)

Deployment HA:
  - Rolling update with maxUnavailable: 0
  - PodDisruptionBudget prevents all pods going down at once
  - Liveness + readiness probes: traffic only to healthy instances
48
What is load testing and which tools do you use?Medium

Load testing simulates concurrent user traffic to validate performance characteristics and find breaking points before production.

Types:

  • Load test — simulate expected peak traffic. Verify system meets SLOs under normal high load.
  • Stress test — push beyond expected capacity to find the breaking point.
  • Soak test — run at sustained load for hours/days. Finds memory leaks, connection pool exhaustion, log disk fill.
  • Spike test — sudden 10x traffic increase. Does the system scale fast enough? Does it recover?
# k6 (modern, JS-based load testing):
import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },  // ramp up to 100 users
    { duration: '5m', target: 100 },  // hold at 100 users
    { duration: '2m', target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],  // 99% under 500ms
    http_req_failed: ['rate<0.01'],    // <1% errors
  },
};

export default function() {
  http.get('https://api.myapp.com/orders');
  sleep(1);
}

Other tools: JMeter (Java, UI + scripted), Gatling (Scala DSL), Locust (Python).

49
What is a circuit breaker pattern and why is it essential in microservices?Medium

A circuit breaker stops cascading failures. When a downstream service is unavailable, threads accumulate waiting for timeouts — exhausting the thread pool and taking down the upstream service too.

// Resilience4j circuit breaker (Spring Boot):
@CircuitBreaker(name = "inventoryService", fallbackMethod = "fallbackInventory")
public InventoryResponse checkInventory(String productId) {
  return inventoryClient.check(productId);
}

private InventoryResponse fallbackInventory(String productId, Exception ex) {
  // Return cached response, default value, or degrade gracefully
  return new InventoryResponse(productId, 0, "UNAVAILABLE");
}

# Configuration:
resilience4j.circuitbreaker.instances.inventoryService:
  slidingWindowSize: 10
  failureRateThreshold: 50      # open after 50% failures in window
  waitDurationInOpenState: 30s  # wait 30s before trying again (half-open)

States: Closed (normal) → Open (failing fast, no calls to downstream) → Half-Open (allow one test call) → back to Closed if test succeeds.

50
What is a bulkhead pattern?Medium

The bulkhead pattern isolates thread pools per downstream dependency so a slow/failed service doesn't exhaust the common thread pool and take down the entire application.

# Without bulkhead:
# One thread pool shared by all services:
# Inventory service slow → 100 threads waiting → Payment calls also blocked

# With bulkhead:
# Separate thread pool per service:
resilience4j.bulkhead.instances:
  inventoryService:
    maxConcurrentCalls: 10    # only 10 threads for inventory calls
    maxWaitDuration: 0ms      # fail immediately if all 10 are busy
  paymentService:
    maxConcurrentCalls: 20    # payment gets its own 20 threads

Named after ship bulkheads — watertight compartments so flooding one compartment doesn't sink the ship. Combined with circuit breakers: circuit breaker prevents calling a failing service; bulkhead limits damage when the service is slow (not yet failing).

51
How do you handle a production outage? Walk through the incident response process.Hard
  1. Declare the incident — create an incident channel (#incident-20260623), assign roles: Incident Commander (owns communication), Tech Lead (drives investigation), Scribe (documents timeline).
  2. Assess and triage — what's the impact? How many users? Is it getting worse? Set severity (P1 = all users, revenue impact; P2 = subset of users; P3 = degraded but functional).
  3. Communicate — status page update within 5 minutes. Internal stakeholders. Don't go dark — "we're investigating" is better than silence.
  4. Mitigate first, fix later — rollback the last deployment if that's the likely cause. Bring the service back up before finding root cause. Restoration over diagnosis.
  5. Investigate in parallel — check recent deployments, config changes, traffic anomalies in dashboards, logs, traces.
  6. Fix and verify — apply fix, watch metrics, confirm steady state restored.
  7. All-clear — update status page, notify stakeholders, close incident channel.
  8. Postmortem — within 48 hours while it's fresh. Blameless. Action items with owners and due dates.
52
What is capacity planning in DevOps?Medium

Capacity planning ensures you have enough infrastructure to handle current and future load before it becomes a problem.

Process:

  • Baseline — measure current resource usage: CPU at peak, memory at peak, DB connections, I/O throughput
  • Trend analysis — graph resource usage over 3–6 months. Extrapolate growth.
  • Load model — "1 user session = 3 API calls = 50ms each = 0.15 CPU-seconds." How many concurrent users can one instance serve?
  • Headroom — run at no more than 70% of capacity at peak (30% headroom for spikes and auto-scaling lag)
  • Pre-scaling for events — known traffic spikes (Black Friday, product launch) require proactive scaling: scheduled scaling + load test in advance

Modern cloud + auto-scaling reduces reactive capacity planning burden but doesn't eliminate it — you still need to set correct ASG minimums, reserved instances for baseline, and understand burst patterns.

53
What is OpenTelemetry and why is it important?Medium

OpenTelemetry (OTel) is a CNCF open standard for collecting and exporting telemetry data (metrics, logs, traces) from applications. It replaces vendor-specific SDKs with a single, vendor-neutral instrumentation layer.

// Java: OTel auto-instrumentation (no code changes)
java -javaagent:opentelemetry-javaagent.jar \
     -Dotel.service.name=payment-service \
     -Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
     -jar app.jar

# OTel Collector receives data, can export to multiple backends:
# → Prometheus (metrics)
# → Jaeger / Grafana Tempo (traces)
# → Loki (logs)
# → Datadog / Dynatrace / New Relic (commercial)

Why it matters: switch observability backends without changing application code. Instrument once, export anywhere. Avoid vendor lock-in. Adopted by AWS, Google, Microsoft, Datadog — the de facto standard for observability in 2026.

54
How do you optimise Docker image build times in CI?Medium
# 1. Order Dockerfile layers by change frequency (most stable first):
FROM eclipse-temurin:21-jre              # rarely changes
COPY pom.xml .                           # changes rarely
RUN mvn dependency:go-offline            # expensive; cache this layer
COPY src ./src                           # changes often
RUN mvn package                          # rebuild only when src changes

# 2. BuildKit cache mounts (persist Maven/npm cache between builds):
# syntax=docker/dockerfile:1
RUN --mount=type=cache,target=/root/.m2 mvn package

# 3. CI-level caching (GitHub Actions):
- uses: actions/cache@v4
  with:
    path: ~/.m2/repository
    key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}

# 4. Parallelise CI jobs that don't depend on each other

# 5. Use a remote cache registry (BuildKit):
docker buildx build \
  --cache-from type=registry,ref=myregistry/myapp:cache \
  --cache-to type=registry,ref=myregistry/myapp:cache,mode=max \
  -t myapp:latest .
55
Design a CI/CD pipeline for a microservices application with 10 services.Hard
Repository strategy: Monorepo with path-based CI triggers
Tool stack: GitHub Actions + Docker + ECR + Argo CD + EKS

Per-service CI (triggered only when service dir changes):
  1. Lint + unit tests
  2. Build Docker image (multi-stage, cached layers)
  3. SAST (Semgrep) + SCA (Snyk) + image scan (Trivy)
  4. Push image to ECR: :sha + :latest-dev
  5. Update image tag in k8s-manifests repo (separate GitOps repo)

CD via Argo CD:
  k8s-manifests repo/
    apps/
      service-A/overlays/dev   → Argo CD auto-syncs to dev cluster
      service-A/overlays/staging → Argo CD auto-syncs on PR merge
      service-A/overlays/prod  → Argo CD syncs only on release tag

Promotion workflow:
  Dev:     every PR merge → auto-deploy to dev (Argo CD)
  Staging: every merge to main → deploy all services to staging
           → automated integration tests (k6 load + Playwright E2E)
  Prod:    create release tag → Argo Rollouts canary (5% → 20% → 100%)
           → automated analysis (Prometheus error rate gate)
           → auto-rollback if error rate > SLO threshold

Contract testing (Pact): each service PR runs consumer-driven contract tests
  → prevents interface breaking changes before deployment

What to Study Next