Kubernetes Upgrade Guide: Zero-Downtime Cluster Updates

Upgrading a Kubernetes cluster is one of the highest-risk operational tasks in a production environment — a mistake can take down all workloads simultaneously. Kubernetes releases three minor versions per year and supports each version for roughly 14 months, so staying within the support window requires upgrading every 4-5 months. This guide walks through the complete upgrade process for both self-managed kubeadm clusters and managed services like EKS, GKE, and AKS, with a focus on achieving zero downtime for running workloads.

Pre-Upgrade Checklist
Upgrade Strategy: Skew Policy and Order
Upgrading the Control Plane with kubeadm
Draining and Upgrading Worker Nodes
Upgrading Managed Clusters (EKS, GKE, AKS)
Upgrading Add-ons and API Deprecations
Rollback Procedures
Post-Upgrade Validation

Pre-Upgrade Checklist

Preparation prevents the most common upgrade failures. Complete every item in this checklist before touching any cluster component.

Backup etcd: Take a snapshot with etcdctl snapshot save. This is your only rollback path if the control plane upgrade fails.
Review API deprecations: Run kubectl deprecations (via the kubent tool) to find resources using deprecated API versions that will be removed in the target version.
Upgrade staging first: Always rehearse on a non-production cluster running the same workloads. Document the actual time taken and any issues encountered.
Check add-on compatibility: CNI plugins (Calico, Cilium), storage drivers (CSI), and ingress controllers all have version compatibility matrices.
Verify PodDisruptionBudgets: Ensure all critical deployments have PDBs that allow at least one replica to be unavailable during node drains.
Notify stakeholders: Schedule a maintenance window even for zero-downtime upgrades. Something can always go wrong.

# Install kubent to check deprecated API versions
brew install kubent    # macOS
# or
curl -sL https://github.com/doitintl/kube-no-trouble/releases/latest/download/kubent-linux-amd64.tar.gz | tar xz
./kubent

# Check current cluster version
kubectl version --short

# List available kubeadm versions
apt-cache madison kubeadm | head -10

One minor version at a time: Kubernetes does not support skipping minor versions during an upgrade. You must upgrade from 1.28 → 1.29 → 1.30, not directly from 1.28 → 1.30. Plan accordingly.

Upgrade Strategy: Skew Policy and Order

Kubernetes defines a strict version skew policy that dictates the safe order of component upgrades. Violating this policy can cause API incompatibility and cluster instability.

kube-apiserver: Must be upgraded first among control plane components. All other components talk to it.
kube-controller-manager and kube-scheduler: Can be up to 1 minor version behind kube-apiserver.
kubelet: Can be up to 2 minor versions behind kube-apiserver (as of Kubernetes 1.28+).
kubectl: Can be ±1 minor version from kube-apiserver.
Worker nodes: Always upgraded after the control plane. Never before.

The safest sequence for a kubeadm cluster is: upgrade all control plane nodes → verify cluster health → upgrade worker nodes one at a time via drain/upgrade/uncordon.

Upgrading the Control Plane with kubeadm

The kubeadm upgrade process is largely automated but requires careful execution. Run these steps on the first control plane node, then repeat on additional control plane nodes.

# Step 1: Upgrade kubeadm on the first control plane node
sudo apt-mark unhold kubeadm
sudo apt-get update
sudo apt-get install -y kubeadm=1.30.0-1.1
sudo apt-mark hold kubeadm

# Step 2: Verify the upgrade plan
sudo kubeadm upgrade plan

# Step 3: Apply the upgrade (this upgrades API server, controller manager, scheduler, etcd)
sudo kubeadm upgrade apply v1.30.0

# Step 4: Upgrade kubelet and kubectl on the control plane node
sudo apt-mark unhold kubelet kubectl
sudo apt-get install -y kubelet=1.30.0-1.1 kubectl=1.30.0-1.1
sudo apt-mark hold kubelet kubectl
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Step 5: Verify control plane health
kubectl get nodes
kubectl get componentstatus

For additional control plane nodes (HA setup), run a different command that skips the cluster-level upgrade that was already applied:

# On additional control plane nodes only
sudo kubeadm upgrade node

# Then upgrade kubelet/kubectl same as step 4 above

Draining and Upgrading Worker Nodes

Worker node upgrades must be performed one node at a time to maintain workload availability. The drain → upgrade → uncordon sequence is the standard procedure.

# Step 1: Cordon the node (mark as unschedulable)
kubectl cordon worker-node-01

# Step 2: Drain all pods off the node
# --ignore-daemonsets: DaemonSet pods are managed globally, skip them
# --delete-emptydir-data: evicts pods using emptyDir volumes (ephemeral data will be lost)
kubectl drain worker-node-01 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --timeout=300s

# Step 3: SSH to the worker node and upgrade kubeadm
sudo apt-mark unhold kubeadm
sudo apt-get install -y kubeadm=1.30.0-1.1
sudo apt-mark hold kubeadm
sudo kubeadm upgrade node

# Step 4: Upgrade kubelet and kubectl
sudo apt-mark unhold kubelet kubectl
sudo apt-get install -y kubelet=1.30.0-1.1 kubectl=1.30.0-1.1
sudo apt-mark hold kubelet kubectl
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Step 5: Uncordon the node (return to schedulable)
kubectl uncordon worker-node-01

# Wait for node to be Ready before proceeding to next node
kubectl wait --for=condition=Ready node/worker-node-01 --timeout=120s

Drain timeout: If kubectl drain hangs, a PodDisruptionBudget may be blocking eviction. Check with kubectl get pdb -A. You may need to temporarily patch the PDB to allow eviction, or investigate why the PDB constraint is not satisfiable.

Upgrading Managed Clusters (EKS, GKE, AKS)

Managed Kubernetes services abstract away most of the control plane upgrade complexity, but you still control the timing and must handle node group upgrades.

# EKS — upgrade control plane via AWS CLI
aws eks update-cluster-version \
  --name my-cluster \
  --kubernetes-version 1.30 \
  --region us-east-1

# Wait for control plane upgrade to complete
aws eks wait cluster-active --name my-cluster --region us-east-1

# Upgrade node group (creates new nodes, migrates pods, terminates old nodes)
aws eks update-nodegroup-version \
  --cluster-name my-cluster \
  --nodegroup-name production-nodes \
  --kubernetes-version 1.30 \
  --region us-east-1

# GKE — upgrade cluster with surge upgrade (adds extra nodes during upgrade)
gcloud container clusters upgrade my-cluster \
  --master \
  --cluster-version 1.30.0-gke.100 \
  --region us-central1

# Upgrade node pool with surge settings
gcloud container node-pools update production-pool \
  --cluster my-cluster \
  --region us-central1 \
  --max-surge-upgrade 1 \
  --max-unavailable-upgrade 0

Upgrading Add-ons and API Deprecations

Core add-ons (CoreDNS, kube-proxy) are upgraded automatically by kubeadm, but third-party add-ons require manual updates. Always update these after the cluster upgrade:

CNI plugin: Check the Calico/Cilium/Flannel compatibility matrix and upgrade the Helm release
Ingress controller: NGINX and Traefik may require API version updates in their Helm charts
cert-manager: Always check for CRD changes between versions
Metrics Server: Upgrade to support the current API version for HPA

# Check deprecated API versions in use (kubent)
kubent --target-version 1.30

# Example: migrate PodSecurityPolicy (removed in 1.25) to Pod Security Standards
# Find PSP usage
kubectl get psp
kubectl get clusterrolebinding | grep psp

# Update Helm add-ons after cluster upgrade
helm repo update
helm upgrade calico projectcalico/tigera-operator --namespace tigera-operator
helm upgrade ingress-nginx ingress-nginx/ingress-nginx --namespace ingress-nginx

Rollback Procedures

True rollback of a Kubernetes control plane is difficult because etcd stores cluster state in the upgraded format. The practical rollback path is an etcd snapshot restore.

# Restore etcd from a pre-upgrade snapshot
# Step 1: Stop the API server (on all control plane nodes)
sudo systemctl stop kube-apiserver

# Step 2: Restore the etcd snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-pre-upgrade.db \
  --data-dir=/var/lib/etcd-restore \
  --name master-01 \
  --initial-cluster "master-01=https://10.0.0.1:2380" \
  --initial-cluster-token etcd-cluster-restore \
  --initial-advertise-peer-urls https://10.0.0.1:2380

# Step 3: Update etcd pod manifest to use restored data dir
sudo sed -i 's|/var/lib/etcd|/var/lib/etcd-restore|' /etc/kubernetes/manifests/etcd.yaml

# Step 4: Downgrade kubeadm/kubelet/kubectl back to the previous version
sudo apt-get install -y kubeadm=1.29.0-1.1 kubelet=1.29.0-1.1 kubectl=1.29.0-1.1
sudo systemctl restart kubelet

For managed clusters: EKS, GKE, and AKS do not support control plane rollback. Node group rollback is possible by reverting to the previous launch template / node pool configuration. This is why testing in staging and having rollback-free deployments (using feature flags) matters more for managed clusters.

Post-Upgrade Validation

After every upgrade, run a structured validation suite before declaring the upgrade complete.

# Node health
kubectl get nodes -o wide

# System pod health
kubectl get pods -n kube-system

# API server health
kubectl cluster-info
kubectl get cs   # componentstatus

# Workload health — check all namespaces
kubectl get deployments -A | grep -v "1/1\|2/2\|3/3"   # find non-ready deployments
kubectl get pods -A | grep -v Running | grep -v Completed

# Verify HPA and VPA are functioning
kubectl get hpa -A

# Run a quick smoke test — deploy and delete a test pod
kubectl run upgrade-smoke-test --image=nginx --restart=Never
kubectl wait --for=condition=Ready pod/upgrade-smoke-test --timeout=60s
kubectl delete pod upgrade-smoke-test