Kubernetes Chaos Engineering with Chaos Mesh
Chaos engineering is the practice of deliberately introducing controlled failures into a system to verify that it behaves as expected under adverse conditions. Chaos Mesh is a CNCF project that brings chaos engineering natively to Kubernetes, offering dozens of fault injection types — pod kills, network partitions, CPU/memory stress, I/O errors, and JVM exceptions — all controlled through Kubernetes CRDs. By running chaos experiments in staging before they happen in production, teams build confidence that their services are truly resilient.
Table of Contents
- Why Chaos Engineering in Kubernetes
- Installing Chaos Mesh
- Pod Chaos: Kill, Failure, and Container Kills
- Network Chaos: Partitions, Latency, and Packet Loss
- Stress Chaos: CPU and Memory Pressure
- I/O Chaos: Disk Errors and Latency
- Chaos Workflows and Schedules
- Observing Experiments and Steady-State Hypothesis
Why Chaos Engineering in Kubernetes
Kubernetes introduces several failure modes that do not exist in traditional deployments: pod scheduling failures, node NotReady states, network policy misconfigurations, etcd leader elections, and DNS resolution delays during pod startup. Traditional load testing validates throughput under normal conditions but does not surface these failure modes. Chaos engineering fills that gap by answering a different question: not "how fast can my system go?" but "how does my system fail, and does it recover gracefully?"
Common weaknesses that chaos experiments reveal in Kubernetes clusters include:
- Services with no readiness probes that receive traffic before they are ready
- Deployments with
minAvailable: 0PodDisruptionBudgets that allow total unavailability - Missing circuit breakers that cause cascading failures when a downstream service is slow
- Incorrect retry logic that amplifies load on an already-struggling dependency
- StatefulSets without persistent volume reclaim policies that lose data on pod restart
Installing Chaos Mesh
Chaos Mesh installs via Helm and deploys a controller, a daemon set on every node, and a web dashboard for experiment management.
# Add the Chaos Mesh Helm repository
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
# Create the namespace
kubectl create namespace chaos-mesh
# Install Chaos Mesh
helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock \
--set dashboard.create=true \
--set dashboard.securityMode=false # enable auth in production
# Verify the pods are running
kubectl get pods -n chaos-mesh
After installation, access the dashboard via port-forward:
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
# Open http://localhost:2333
/run/containerd/containerd.sock for containerd (most clusters since Kubernetes 1.24), or /var/run/docker.sock for older Docker-based nodes.
Pod Chaos: Kill, Failure, and Container Kills
PodChaos is the most fundamental chaos experiment — it simulates pod failures that occur naturally in production due to OOM kills, preemption, or node evictions.
# Kill one random pod in the payment namespace every 5 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: payment-pod-kill
namespace: chaos-mesh
spec:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: payment-service
scheduler:
cron: "@every 5m"
duration: "30s"
---
# Kill a specific percentage of pods to test HA
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: api-pod-failure-50pct
namespace: chaos-mesh
spec:
action: pod-failure
mode: fixed-percent
value: "50"
selector:
namespaces:
- production
labelSelectors:
app: api-server
duration: "2m"
The pod-failure action makes pods unschedulable (simulating a crash without actually killing the pod process) while pod-kill terminates the pod forcefully. Use container-kill to kill a specific container within a multi-container pod.
Network Chaos: Partitions, Latency, and Packet Loss
NetworkChaos experiments simulate the network degradation that happens during cloud provider incidents, cross-AZ communication issues, or simply overloaded network interfaces.
# Add 100ms latency with 20ms jitter to all traffic from the frontend
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: frontend-network-delay
namespace: chaos-mesh
spec:
action: delay
mode: all
selector:
namespaces:
- production
labelSelectors:
app: frontend
delay:
latency: 100ms
jitter: 20ms
correlation: "25"
direction: to
duration: "3m"
---
# Simulate 5% packet loss between API and database
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: api-db-packet-loss
namespace: chaos-mesh
spec:
action: loss
mode: all
selector:
namespaces:
- production
labelSelectors:
app: api-server
loss:
loss: "5"
correlation: "25"
direction: to
target:
mode: all
selector:
namespaces:
- production
labelSelectors:
app: postgres
duration: "5m"
partition action to completely block traffic between two services (tests circuit breakers and fallback logic). Use delay to test timeout handling and retry logic without full loss of connectivity.
Stress Chaos: CPU and Memory Pressure
StressChaos consumes CPU or memory on target pods to simulate resource contention, noisy-neighbour effects, or GC pressure on JVM applications.
# Consume 80% CPU on 2 workers in api-server pods for 3 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: api-cpu-stress
namespace: chaos-mesh
spec:
mode: fixed
value: "2"
selector:
namespaces:
- production
labelSelectors:
app: api-server
stressors:
cpu:
workers: 4
load: 80
duration: "3m"
---
# Memory pressure to trigger OOM or GC pauses
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: api-memory-stress
namespace: chaos-mesh
spec:
mode: one
selector:
namespaces:
- production
labelSelectors:
app: api-server
stressors:
memory:
workers: 4
size: 512MB
duration: "2m"
I/O Chaos: Disk Errors and Latency
IOChaos injects file system errors or latency into pods, testing how services handle slow or failing persistent storage — critical for databases and file-processing workloads.
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: postgres-io-delay
namespace: chaos-mesh
spec:
action: latency
mode: one
selector:
namespaces:
- production
labelSelectors:
app: postgres
volumePath: /var/lib/postgresql/data
path: "**"
delay: 50ms
percent: 30 # apply to 30% of I/O operations
duration: "2m"
Chaos Workflows and Schedules
Chaos Mesh Workflows let you compose multiple experiments into a sequential or parallel pipeline, modelling realistic failure scenarios like "network partition followed by pod kill followed by recovery verification".
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: full-resilience-test
namespace: chaos-mesh
spec:
entry: start
templates:
- name: start
templateType: Serial
deadline: 20m
children:
- network-delay-phase
- pod-kill-phase
- recovery-check
- name: network-delay-phase
templateType: NetworkChaos
deadline: 5m
networkChaos:
action: delay
mode: all
selector:
namespaces: [production]
labelSelectors:
app: api-server
delay:
latency: 200ms
duration: 4m
- name: pod-kill-phase
templateType: PodChaos
deadline: 3m
podChaos:
action: pod-kill
mode: one
selector:
namespaces: [production]
labelSelectors:
app: api-server
duration: 1m
- name: recovery-check
templateType: Suspend
deadline: 5m
Observing Experiments and Steady-State Hypothesis
A chaos experiment without observability is just random destruction. Define a steady-state hypothesis before each experiment: a measurable condition that describes "normal" system behaviour. Verify the hypothesis before the experiment (baseline), monitor it during the experiment, and check it immediately after (recovery validation).
Useful Prometheus queries to validate during chaos experiments:
# HTTP error rate (steady state: below 1%)
sum(rate(http_requests_total{status=~"5.."}[1m]))
/ sum(rate(http_requests_total[1m])) * 100
# p99 response time (steady state: below 500ms)
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[1m])) by (le)
)
# Pod restart count spike
increase(kube_pod_container_status_restarts_total[5m]) > 0
Set up Grafana annotations to mark experiment start and end times on your dashboards. Chaos Mesh can emit events to Kubernetes which can be captured by your alerting stack to automatically mark experiment windows.
kubectl delete podchaos payment-pod-kill -n chaos-mesh. For production chaos runs, assign a "safety operator" whose only job is monitoring and aborting the experiment if it exceeds defined blast radius limits.