Kubernetes Chaos Engineering with Chaos Mesh

Chaos engineering is the practice of deliberately introducing controlled failures into a system to verify that it behaves as expected under adverse conditions. Chaos Mesh is a CNCF project that brings chaos engineering natively to Kubernetes, offering dozens of fault injection types — pod kills, network partitions, CPU/memory stress, I/O errors, and JVM exceptions — all controlled through Kubernetes CRDs. By running chaos experiments in staging before they happen in production, teams build confidence that their services are truly resilient.

Why Chaos Engineering in Kubernetes

Kubernetes introduces several failure modes that do not exist in traditional deployments: pod scheduling failures, node NotReady states, network policy misconfigurations, etcd leader elections, and DNS resolution delays during pod startup. Traditional load testing validates throughput under normal conditions but does not surface these failure modes. Chaos engineering fills that gap by answering a different question: not "how fast can my system go?" but "how does my system fail, and does it recover gracefully?"

Common weaknesses that chaos experiments reveal in Kubernetes clusters include:

  • Services with no readiness probes that receive traffic before they are ready
  • Deployments with minAvailable: 0 PodDisruptionBudgets that allow total unavailability
  • Missing circuit breakers that cause cascading failures when a downstream service is slow
  • Incorrect retry logic that amplifies load on an already-struggling dependency
  • StatefulSets without persistent volume reclaim policies that lose data on pod restart
GameDay principle: Run chaos experiments as a structured "GameDay" — announce the experiment window to the team, define a steady-state hypothesis (e.g., "p99 latency stays under 500ms"), run the experiment, and debrief on findings. This builds organisational resilience practices alongside technical ones.

Installing Chaos Mesh

Chaos Mesh installs via Helm and deploys a controller, a daemon set on every node, and a web dashboard for experiment management.

# Add the Chaos Mesh Helm repository
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

# Create the namespace
kubectl create namespace chaos-mesh

# Install Chaos Mesh
helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
  --set dashboard.create=true \
  --set dashboard.securityMode=false    # enable auth in production

# Verify the pods are running
kubectl get pods -n chaos-mesh

After installation, access the dashboard via port-forward:

kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
# Open http://localhost:2333
Runtime detection: Specify the correct container runtime socket. Use /run/containerd/containerd.sock for containerd (most clusters since Kubernetes 1.24), or /var/run/docker.sock for older Docker-based nodes.

Pod Chaos: Kill, Failure, and Container Kills

PodChaos is the most fundamental chaos experiment — it simulates pod failures that occur naturally in production due to OOM kills, preemption, or node evictions.

# Kill one random pod in the payment namespace every 5 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: payment-pod-kill
  namespace: chaos-mesh
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  scheduler:
    cron: "@every 5m"
  duration: "30s"

---
# Kill a specific percentage of pods to test HA
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: api-pod-failure-50pct
  namespace: chaos-mesh
spec:
  action: pod-failure
  mode: fixed-percent
  value: "50"
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  duration: "2m"

The pod-failure action makes pods unschedulable (simulating a crash without actually killing the pod process) while pod-kill terminates the pod forcefully. Use container-kill to kill a specific container within a multi-container pod.

Network Chaos: Partitions, Latency, and Packet Loss

NetworkChaos experiments simulate the network degradation that happens during cloud provider incidents, cross-AZ communication issues, or simply overloaded network interfaces.

# Add 100ms latency with 20ms jitter to all traffic from the frontend
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: frontend-network-delay
  namespace: chaos-mesh
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: frontend
  delay:
    latency: 100ms
    jitter: 20ms
    correlation: "25"
  direction: to
  duration: "3m"

---
# Simulate 5% packet loss between API and database
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: api-db-packet-loss
  namespace: chaos-mesh
spec:
  action: loss
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  loss:
    loss: "5"
    correlation: "25"
  direction: to
  target:
    mode: all
    selector:
      namespaces:
        - production
      labelSelectors:
        app: postgres
  duration: "5m"
Partition vs delay: Use partition action to completely block traffic between two services (tests circuit breakers and fallback logic). Use delay to test timeout handling and retry logic without full loss of connectivity.

Stress Chaos: CPU and Memory Pressure

StressChaos consumes CPU or memory on target pods to simulate resource contention, noisy-neighbour effects, or GC pressure on JVM applications.

# Consume 80% CPU on 2 workers in api-server pods for 3 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: api-cpu-stress
  namespace: chaos-mesh
spec:
  mode: fixed
  value: "2"
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  stressors:
    cpu:
      workers: 4
      load: 80
  duration: "3m"

---
# Memory pressure to trigger OOM or GC pauses
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: api-memory-stress
  namespace: chaos-mesh
spec:
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  stressors:
    memory:
      workers: 4
      size: 512MB
  duration: "2m"

I/O Chaos: Disk Errors and Latency

IOChaos injects file system errors or latency into pods, testing how services handle slow or failing persistent storage — critical for databases and file-processing workloads.

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: postgres-io-delay
  namespace: chaos-mesh
spec:
  action: latency
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: postgres
  volumePath: /var/lib/postgresql/data
  path: "**"
  delay: 50ms
  percent: 30      # apply to 30% of I/O operations
  duration: "2m"

Chaos Workflows and Schedules

Chaos Mesh Workflows let you compose multiple experiments into a sequential or parallel pipeline, modelling realistic failure scenarios like "network partition followed by pod kill followed by recovery verification".

apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: full-resilience-test
  namespace: chaos-mesh
spec:
  entry: start
  templates:
    - name: start
      templateType: Serial
      deadline: 20m
      children:
        - network-delay-phase
        - pod-kill-phase
        - recovery-check

    - name: network-delay-phase
      templateType: NetworkChaos
      deadline: 5m
      networkChaos:
        action: delay
        mode: all
        selector:
          namespaces: [production]
          labelSelectors:
            app: api-server
        delay:
          latency: 200ms
        duration: 4m

    - name: pod-kill-phase
      templateType: PodChaos
      deadline: 3m
      podChaos:
        action: pod-kill
        mode: one
        selector:
          namespaces: [production]
          labelSelectors:
            app: api-server
        duration: 1m

    - name: recovery-check
      templateType: Suspend
      deadline: 5m

Observing Experiments and Steady-State Hypothesis

A chaos experiment without observability is just random destruction. Define a steady-state hypothesis before each experiment: a measurable condition that describes "normal" system behaviour. Verify the hypothesis before the experiment (baseline), monitor it during the experiment, and check it immediately after (recovery validation).

Useful Prometheus queries to validate during chaos experiments:

# HTTP error rate (steady state: below 1%)
sum(rate(http_requests_total{status=~"5.."}[1m]))
  / sum(rate(http_requests_total[1m])) * 100

# p99 response time (steady state: below 500ms)
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[1m])) by (le)
)

# Pod restart count spike
increase(kube_pod_container_status_restarts_total[5m]) > 0

Set up Grafana annotations to mark experiment start and end times on your dashboards. Chaos Mesh can emit events to Kubernetes which can be captured by your alerting stack to automatically mark experiment windows.

Always have an abort button: Chaos Mesh experiments can be paused or deleted instantly with kubectl delete podchaos payment-pod-kill -n chaos-mesh. For production chaos runs, assign a "safety operator" whose only job is monitoring and aborting the experiment if it exceeds defined blast radius limits.