Kubernetes HPA and VPA: Auto Scaling Your Applications (2026)
Kubernetes offers three complementary autoscaling primitives: the Horizontal Pod Autoscaler (HPA) scales replica count, the Vertical Pod Autoscaler (VPA) adjusts CPU/memory requests, and KEDA enables event-driven scaling from any queue or metrics source. Used together with Cluster Autoscaler, they give you fully elastic infrastructure. This guide covers all three with production-ready configurations.
HPA v2: CPU and Memory Metrics
The autoscaling/v2 API (stable since Kubernetes 1.23) supports multiple metrics per HPA, including CPU, memory, and external/custom metrics. Always use v2 — the older v1 API only supports CPU and has no behavior tuning.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 20
metrics:
# Scale on CPU utilization
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # Target 60% of request
# Also scale on memory
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: 400Mi # Target 400Mi per pod
# Check HPA status and current metrics
kubectl get hpa my-app-hpa -n production
# Detailed output showing current vs target
kubectl describe hpa my-app-hpa -n production
# Watch HPA in real-time during a load test
kubectl get hpa -n production --watch
HPA calculates the desired replica count with the formula:
desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))
With 4 replicas at 80% CPU average and a 60% target: ceil(4 * 80/60) = ceil(5.33) = 6 replicas.
Custom Metrics with Prometheus Adapter
The Prometheus Adapter bridges Prometheus metrics into the Kubernetes custom metrics API, allowing HPA to scale on application-specific metrics like request rate, queue depth, or error rate.
# Install Prometheus Adapter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
-n monitoring \
--set prometheus.url=http://prometheus-operated.monitoring.svc \
--set prometheus.port=9090
Configure the adapter to expose your custom metric:
# values.yaml for prometheus-adapter
rules:
custom:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total$"
as: "${1}_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa-custom
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100" # Target 100 req/s per pod
# Verify the custom metric is available
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq '.resources[].name'
# Check the metric value for your pods
kubectl get --raw \
"/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" \
| jq .
KEDA: Event-Driven Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with 60+ built-in scalers including SQS, Kafka, RabbitMQ, Redis, Datadog, and more. Critically, KEDA can scale to zero — something native HPA cannot do (minimum is 1).
# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda \
--namespace keda \
--create-namespace
Scale a worker deployment based on SQS queue depth:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sqs-worker-scaler
namespace: production
spec:
scaleTargetRef:
name: sqs-worker
minReplicaCount: 0 # Scale to zero when queue is empty
maxReplicaCount: 30
pollingInterval: 15 # Check queue every 15 seconds
cooldownPeriod: 60 # Wait 60s before scaling down to zero
triggers:
- type: aws-sqs-queue
authenticationRef:
name: keda-aws-credentials
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/my-queue
queueLength: "10" # Target: 1 worker per 10 messages
awsRegion: us-east-1
identityOwner: operator # Use IRSA
# TriggerAuthentication for AWS IRSA
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: keda-aws-credentials
namespace: production
spec:
podIdentity:
provider: aws
# Uses the pod's IRSA annotation — no static credentials needed
KEDA also supports scaling on Kafka consumer lag:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
namespace: production
spec:
scaleTargetRef:
name: kafka-consumer
minReplicaCount: 1
maxReplicaCount: 20
triggers:
- type: kafka
metadata:
bootstrapServers: kafka.production.svc:9092
consumerGroup: my-consumer-group
topic: orders
lagThreshold: "50" # Scale up when lag exceeds 50 messages
offsetResetPolicy: latest
VPA: Vertical Pod Autoscaler Modes
VPA recommends and optionally applies right-sized CPU/memory requests to pods. It has four operating modes that control how aggressively it applies changes.
# Install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
| VPA Mode | Behavior | Use Case |
|---|---|---|
Off | Compute recommendations only, apply nothing | Observability / right-sizing analysis |
Initial | Apply recommendations only to new pods at scheduling time | Safe production use without live evictions |
Recreate | Evict pods when recommendation differs significantly | Stateful sets or when tolerable disruption |
Auto | Same as Recreate currently; may add in-place in future | Dev/staging environments |
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Initial" # Safest mode for production
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 4Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
# Check VPA recommendations
kubectl describe vpa my-app-vpa -n production
# Sample output
# Recommendation:
# Container Recommendations:
# Container Name: app
# Lower Bound:
# Cpu: 120m
# Memory: 256Mi
# Target:
# Cpu: 250m
# Memory: 512Mi
# Upper Bound:
# Cpu: 1200m
# Memory: 2Gi
Cluster Autoscaler
Cluster Autoscaler (CA) adds or removes nodes when pods are unschedulable due to insufficient resources or when nodes are underutilized. It works at the node group level (AWS Auto Scaling Groups, GCP Managed Instance Groups, etc.).
# cluster-autoscaler Deployment (AWS EKS example)
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
serviceAccountName: cluster-autoscaler
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste # Choose cheapest node group
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
- --balance-similar-node-groups
- --scale-down-enabled=true
- --scale-down-utilization-threshold=0.5
- --scale-down-unneeded-time=10m
- --scale-down-delay-after-add=10m
# Check CA logs for scaling decisions
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50
# Check node autoscaler events
kubectl get events -n kube-system | grep cluster-autoscaler
# Annotate a node to prevent CA from removing it
kubectl annotate node my-critical-node \
cluster-autoscaler.kubernetes.io/scale-down-disabled=true
HPA Behavior Tuning
The behavior field in HPA v2 lets you control scale-up and scale-down rates independently, preventing thrashing and protecting against sudden traffic spikes.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa-tuned
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Double replicas at most per period
periodSeconds: 60
- type: Pods
value: 4 # Or add at most 4 pods per period
periodSeconds: 60
selectPolicy: Max # Use whichever allows more pods
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 25 # Remove at most 25% of replicas
periodSeconds: 60
selectPolicy: Min # Use the most conservative policy
The stabilizationWindowSeconds for scale-down prevents flapping: HPA looks at the maximum desired replica count over the window and uses that. This means a spike that resolves quickly won't immediately trigger scale-down.
Combining HPA + Cluster Autoscaler
HPA and Cluster Autoscaler are designed to work together: HPA requests more pods, the scheduler finds no capacity, Cluster Autoscaler adds a node, then the pods schedule. Key settings to make this work smoothly:
# PodDisruptionBudget — ensure CA doesn't remove nodes with critical pods
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
namespace: production
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: my-app
# Ensure pods spread across nodes for CA to add new ones (not just fill existing)
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: my-app
topologyKey: kubernetes.io/hostname
HPA vs VPA vs KEDA Comparison
| Feature | HPA | VPA | KEDA |
|---|---|---|---|
| Scales | Replica count | CPU/memory requests | Replica count (including 0) |
| Metrics source | CPU, memory, custom, external | Historical CPU/memory usage | 60+ event sources (SQS, Kafka, Redis…) |
| Scale to zero | No (min 1) | N/A | Yes |
| Pod restart needed | No | Yes (except In-Place, alpha) | No |
| Best for | Stateless web/API services | Right-sizing any workload | Queue workers, batch jobs |
| Conflict with HPA | — | Yes, on same metrics | No (replaces HPA internally) |
| Latency to scale | 15–30s (metrics lag) | Seconds–minutes | 15s (configurable) |
Frequently Asked Questions
Why is my HPA stuck at minimum replicas despite high CPU?
Check three things: (1) The pods must have CPU requests defined — HPA cannot compute utilization without a request. (2) The metrics server must be running: kubectl top pods should return values. (3) The HPA's stabilizationWindowSeconds for scale-up may be set too high. Run kubectl describe hpa and look at the Conditions section for the exact failure reason.
Can I use KEDA alongside a standard HPA on the same deployment?
No. KEDA creates and manages its own HPA under the hood. If you manually create an HPA targeting the same deployment, they will conflict and produce unpredictable scaling behavior. Delete any manually created HPA before creating a KEDA ScaledObject for the same target.
What's the fastest way to scale up for a known traffic spike (e.g., a flash sale)?
Pre-scale manually before the event: kubectl scale deployment my-app --replicas=30. HPA will not scale below this until load drops and the stabilization window passes. Combined with KEDA or custom metrics that detect incoming load in the queue before it hits your pods, you can achieve proactive scaling. Some teams use CronJobs that call the Kubernetes API to pre-scale at known times.
How do I prevent Cluster Autoscaler from removing a specific node?
Annotate the node with cluster-autoscaler.kubernetes.io/scale-down-disabled=true. You can also run pods with local storage (emptyDir) on nodes you want to keep — CA respects the --skip-nodes-with-local-storage flag. For permanent nodes (e.g., dedicated database nodes), place them in a separate node group outside CA's management scope.
What's the difference between HPA scaleDown stabilizationWindowSeconds and cooldown?
The stabilizationWindowSeconds (default 300s for scale-down) is a sliding window over which HPA tracks the maximum desired replica count. This prevents scale-down when a metric fluctuates. The old --horizontal-pod-autoscaler-downscale-stabilization flag (deprecated) was the equivalent. There's no separate "cooldown" in HPA v2 — the stabilization window serves that purpose. KEDA has a separate cooldownPeriod field that controls scale-to-zero behavior.