Kubernetes Debugging: kubectl Techniques and Common Issues

Debugging Kubernetes workloads efficiently requires knowing which kubectl commands to reach for first and understanding what the output means. Most production incidents trace back to one of a handful of root causes: resource exhaustion, image pull failures, misconfigured probes, networking problems, or scheduling constraints. This guide gives you a systematic debugging toolkit — from diagnosing a stuck pod to tracing network connectivity failures between services.

Debugging Workflow Overview
Diagnosing CrashLoopBackOff
Fixing Pending Pods
Investigating OOMKill
Debugging Network Connectivity
kubectl debug: Ephemeral Containers
Diagnosing Slow or Stuck Deployments
Essential kubectl One-Liners

Debugging Workflow Overview

A consistent debugging order prevents wasted time. When something is broken, follow this sequence:

Check pod status: kubectl get pods -n <ns> — identify which pod is problematic and what its status is.
Describe the pod: kubectl describe pod <name> -n <ns> — events section shows the most recent cause of failure.
Read logs: kubectl logs <pod> -n <ns> --previous — the --previous flag reads logs from the last crashed container, not the current one.
Check resource usage: kubectl top pod <name> -n <ns>
Inspect the node: kubectl describe node <node-name> — check node conditions and allocated resources.

# Quick cluster health check
kubectl get nodes -o wide
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

# Most useful single command when something is broken
kubectl get events -n production --sort-by='.lastTimestamp' | tail -30

Events are time-limited: Kubernetes events expire after 1 hour by default. If a pod crashed hours ago and you are investigating now, the events may be gone. Use kubectl logs --previous and node-level logs (journalctl -u kubelet) instead.

Diagnosing CrashLoopBackOff

CrashLoopBackOff means the container started, crashed, and Kubernetes is waiting before restarting it (with exponential backoff). The root cause is almost always in the container's exit logs.

# Get logs from the previous (crashed) container instance
kubectl logs my-pod -n production --previous

# If the pod has multiple containers, specify the container
kubectl logs my-pod -n production -c app --previous

# Describe to see exit code and reason
kubectl describe pod my-pod -n production
# Look for:
# Last State:   Terminated
#   Reason:     Error
#   Exit Code:  1
#   Started:    Mon, 16 Jun 2026 10:00:00
#   Finished:   Mon, 16 Jun 2026 10:00:02

Common exit codes and their meanings:

Exit 1 — application error. Check application logs for the actual exception or error message.
Exit 137 — killed by signal 9 (SIGKILL). Usually means OOMKill — the container exceeded its memory limit.
Exit 143 — killed by signal 15 (SIGTERM). The container did not shut down cleanly within the terminationGracePeriodSeconds.
Exit 126/127 — command not found or not executable. The container's command or args are wrong.

# If the container crashes too fast to exec into, override the command
kubectl run debug-pod --image=my-app:latest \
  --restart=Never \
  --command -- sleep 3600
# Then exec in and manually run the application
kubectl exec -it debug-pod -- /bin/sh

Fixing Pending Pods

A pod stuck in Pending has not been scheduled to any node. The describe Events section always tells you why. The most common causes are insufficient resources, node selector mismatches, taints without tolerations, and PVC binding failures.

kubectl describe pod my-pending-pod -n production
# Events section examples:

# Insufficient CPU/memory:
# Warning  FailedScheduling  0/5 nodes are available:
#   5 Insufficient cpu.

# Node selector mismatch:
# Warning  FailedScheduling  0/5 nodes are available:
#   5 node(s) didn't match Pod's node affinity/selector

# PVC not bound:
# Warning  FailedScheduling  0/5 nodes are available:
#   5 pod has unbound immediate PersistentVolumeClaims

# Check node capacity vs allocatable
kubectl describe nodes | grep -A5 "Allocated resources"

# Check if any nodes have relevant taints
kubectl get nodes -o json | jq '.items[].spec.taints'

# Check PVC status
kubectl get pvc -n production
# STATUS should be Bound; if Pending, the StorageClass may not exist or have no provisioner

# Check available StorageClasses
kubectl get storageclass

Investigating OOMKill

OOMKill (Out of Memory Kill) happens when a container's memory usage hits its limits.memory value. The Linux kernel kills the process and Kubernetes records it as exit code 137. Repeated OOMKills cause CrashLoopBackOff.

# Check if a pod was OOMKilled
kubectl describe pod my-pod -n production | grep -A5 "Last State"
# Last State:  Terminated
#   Reason:    OOMKilled
#   Exit Code: 137

# Check current memory usage vs limits
kubectl top pod my-pod -n production --containers

# Get memory limits for all pods in a namespace
kubectl get pods -n production -o json | \
  jq '.items[] | {name: .metadata.name, limits: .spec.containers[].resources.limits}'

To fix OOMKills, either increase the memory limit or reduce the application's memory usage. For JVM applications, ensure -Xmx is set to about 75% of the container memory limit to leave room for non-heap memory. For Node.js apps, set --max-old-space-size accordingly.

# Increase memory limits in the deployment
resources:
  requests:
    memory: 512Mi
  limits:
    memory: 1Gi   # Was 512Mi — increase to give headroom

Debugging Network Connectivity

Network issues in Kubernetes typically manifest as connection refused, connection timeout, or DNS resolution failures between pods or services. Systematic isolation narrows the root cause quickly.

# Test DNS resolution from within the cluster
kubectl run dnstest --image=busybox:1.28 --restart=Never --rm -it \
  -- nslookup my-service.my-namespace.svc.cluster.local

# Test TCP connectivity to a service
kubectl run nettest --image=nicolaka/netshoot --restart=Never --rm -it \
  -- curl -v http://my-service.production.svc.cluster.local:8080/health

# Check if a service has Endpoints (if Endpoints is empty, the selector is wrong)
kubectl get endpoints my-service -n production
# NAME         ENDPOINTS                   AGE
# my-service   10.0.1.5:8080,10.0.1.6:8080  5d

# If ENDPOINTS is , the pod labels don't match the service selector
kubectl get svc my-service -n production -o yaml | grep selector -A5
kubectl get pods -n production --show-labels

Network Policy check: If pods have correct endpoints but cannot reach each other, check NetworkPolicies. A NetworkPolicy that selects a pod but has no ingress rules blocks all traffic to that pod. Use kubectl get networkpolicies -n production to list active policies.

kubectl debug: Ephemeral Containers

Kubernetes 1.23+ supports ephemeral containers via kubectl debug. This injects a temporary debug container into a running pod without restarting it — invaluable for debugging minimal production images (like distroless) that have no shell or debugging tools.

# Inject a debug container into a running pod
kubectl debug -it my-pod -n production \
  --image=nicolaka/netshoot \
  --target=app-container

# The debug container shares the process namespace of the target container
# You can inspect the app's filesystem, network, and running processes

# Debug a node by running a privileged pod on it
kubectl debug node/my-node-01 -it --image=ubuntu
# This mounts the node filesystem at /host

# Copy a crashing pod and override its command
kubectl debug my-crashing-pod -n production \
  --copy-to=debug-copy \
  --container=app \
  -- sleep 3600

Diagnosing Slow or Stuck Deployments

A rolling deployment that is not progressing is often caused by failing readiness probes, resource limits preventing new pods from scheduling, or a misconfigured update strategy.

# Check rollout status
kubectl rollout status deployment/my-app -n production

# If stuck, check the deployment events
kubectl describe deployment my-app -n production

# Check if new pods are created but not becoming Ready
kubectl get pods -n production -l app=my-app -w

# Readiness probe failures show up in pod Events
kubectl describe pod my-app-new-pod-xxx -n production
# Warning  Unhealthy  Readiness probe failed: HTTP probe failed with statuscode: 500

# Pause a bad rollout to stop it spreading
kubectl rollout pause deployment/my-app -n production

# Roll back to the previous version
kubectl rollout undo deployment/my-app -n production

# Roll back to a specific revision
kubectl rollout history deployment/my-app -n production
kubectl rollout undo deployment/my-app --to-revision=3 -n production

Essential kubectl One-Liners

A collection of high-value kubectl commands that experienced operators reach for daily:

# All non-running pods across the cluster
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

# Pods sorted by restart count (highest first)
kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20

# Pod resource requests vs limits for a namespace
kubectl get pods -n production -o json | \
  jq '.items[] | {pod: .metadata.name, cpu_req: .spec.containers[].resources.requests.cpu, mem_req: .spec.containers[].resources.requests.memory}'

# Force delete a stuck Terminating pod
kubectl delete pod my-pod -n production --grace-period=0 --force

# Watch events in real time
kubectl get events -n production -w --sort-by='.lastTimestamp'

# Get the image used by each container in a deployment
kubectl get deployment my-app -n production \
  -o jsonpath='{range .spec.template.spec.containers[*]}{.name}{"\t"}{.image}{"\n"}{end}'

# Port-forward to a pod for local testing
kubectl port-forward pod/my-pod 8080:8080 -n production

# Execute a one-off command in a pod
kubectl exec my-pod -n production -- env | sort

# Copy files to/from a pod
kubectl cp production/my-pod:/app/logs/app.log ./app.log