Kubernetes HPA and VPA: Auto Scaling Your Applications (2026)

Kubernetes offers three complementary autoscaling primitives: the Horizontal Pod Autoscaler (HPA) scales replica count, the Vertical Pod Autoscaler (VPA) adjusts CPU/memory requests, and KEDA enables event-driven scaling from any queue or metrics source. Used together with Cluster Autoscaler, they give you fully elastic infrastructure. This guide covers all three with production-ready configurations.

HPA v2: CPU and Memory Metrics

The autoscaling/v2 API (stable since Kubernetes 1.23) supports multiple metrics per HPA, including CPU, memory, and external/custom metrics. Always use v2 — the older v1 API only supports CPU and has no behavior tuning.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  # Scale on CPU utilization
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60   # Target 60% of request
  # Also scale on memory
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 400Mi      # Target 400Mi per pod
Note: HPA scales based on requests, not limits. Your pods MUST have CPU requests set for HPA to function. A pod without a CPU request cannot be measured, and the HPA controller will report a missing metrics error.
# Check HPA status and current metrics
kubectl get hpa my-app-hpa -n production

# Detailed output showing current vs target
kubectl describe hpa my-app-hpa -n production

# Watch HPA in real-time during a load test
kubectl get hpa -n production --watch

HPA calculates the desired replica count with the formula:

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

With 4 replicas at 80% CPU average and a 60% target: ceil(4 * 80/60) = ceil(5.33) = 6 replicas.

Custom Metrics with Prometheus Adapter

The Prometheus Adapter bridges Prometheus metrics into the Kubernetes custom metrics API, allowing HPA to scale on application-specific metrics like request rate, queue depth, or error rate.

# Install Prometheus Adapter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  -n monitoring \
  --set prometheus.url=http://prometheus-operated.monitoring.svc \
  --set prometheus.port=9090

Configure the adapter to expose your custom metric:

# values.yaml for prometheus-adapter
rules:
  custom:
  - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)_total$"
      as: "${1}_per_second"
    metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa-custom
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"   # Target 100 req/s per pod
# Verify the custom metric is available
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq '.resources[].name'

# Check the metric value for your pods
kubectl get --raw \
  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" \
  | jq .

KEDA: Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with 60+ built-in scalers including SQS, Kafka, RabbitMQ, Redis, Datadog, and more. Critically, KEDA can scale to zero — something native HPA cannot do (minimum is 1).

# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace

Scale a worker deployment based on SQS queue depth:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: sqs-worker-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: sqs-worker
  minReplicaCount: 0          # Scale to zero when queue is empty
  maxReplicaCount: 30
  pollingInterval: 15         # Check queue every 15 seconds
  cooldownPeriod: 60          # Wait 60s before scaling down to zero
  triggers:
  - type: aws-sqs-queue
    authenticationRef:
      name: keda-aws-credentials
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789/my-queue
      queueLength: "10"       # Target: 1 worker per 10 messages
      awsRegion: us-east-1
      identityOwner: operator  # Use IRSA
# TriggerAuthentication for AWS IRSA
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-aws-credentials
  namespace: production
spec:
  podIdentity:
    provider: aws
    # Uses the pod's IRSA annotation — no static credentials needed

KEDA also supports scaling on Kafka consumer lag:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-consumer-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: kafka-consumer
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka.production.svc:9092
      consumerGroup: my-consumer-group
      topic: orders
      lagThreshold: "50"      # Scale up when lag exceeds 50 messages
      offsetResetPolicy: latest

VPA: Vertical Pod Autoscaler Modes

VPA recommends and optionally applies right-sized CPU/memory requests to pods. It has four operating modes that control how aggressively it applies changes.

# Install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
VPA ModeBehaviorUse Case
OffCompute recommendations only, apply nothingObservability / right-sizing analysis
InitialApply recommendations only to new pods at scheduling timeSafe production use without live evictions
RecreateEvict pods when recommendation differs significantlyStateful sets or when tolerable disruption
AutoSame as Recreate currently; may add in-place in futureDev/staging environments
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Initial"     # Safest mode for production
  resourcePolicy:
    containerPolicies:
    - containerName: app
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 4
        memory: 4Gi
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsAndLimits
# Check VPA recommendations
kubectl describe vpa my-app-vpa -n production

# Sample output
# Recommendation:
#   Container Recommendations:
#     Container Name:  app
#     Lower Bound:
#       Cpu:     120m
#       Memory:  256Mi
#     Target:
#       Cpu:     250m
#       Memory:  512Mi
#     Upper Bound:
#       Cpu:     1200m
#       Memory:  2Gi
Pro Tip: Do NOT use HPA and VPA together on the same resource metric (CPU). HPA scales replicas based on CPU; VPA changes CPU requests — they fight each other. Use HPA on custom/memory metrics + VPA in Off mode for recommendations, or use VPA Initial mode with HPA on custom metrics only.

Cluster Autoscaler

Cluster Autoscaler (CA) adds or removes nodes when pods are unschedulable due to insufficient resources or when nodes are underutilized. It works at the node group level (AWS Auto Scaling Groups, GCP Managed Instance Groups, etc.).

# cluster-autoscaler Deployment (AWS EKS example)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - name: cluster-autoscaler
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste        # Choose cheapest node group
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
        - --balance-similar-node-groups
        - --scale-down-enabled=true
        - --scale-down-utilization-threshold=0.5
        - --scale-down-unneeded-time=10m
        - --scale-down-delay-after-add=10m
# Check CA logs for scaling decisions
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50

# Check node autoscaler events
kubectl get events -n kube-system | grep cluster-autoscaler

# Annotate a node to prevent CA from removing it
kubectl annotate node my-critical-node \
  cluster-autoscaler.kubernetes.io/scale-down-disabled=true

HPA Behavior Tuning

The behavior field in HPA v2 lets you control scale-up and scale-down rates independently, preventing thrashing and protecting against sudden traffic spikes.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa-tuned
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0      # Scale up immediately
      policies:
      - type: Percent
        value: 100                        # Double replicas at most per period
        periodSeconds: 60
      - type: Pods
        value: 4                          # Or add at most 4 pods per period
        periodSeconds: 60
      selectPolicy: Max                   # Use whichever allows more pods
    scaleDown:
      stabilizationWindowSeconds: 300    # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 25                         # Remove at most 25% of replicas
        periodSeconds: 60
      selectPolicy: Min                   # Use the most conservative policy

The stabilizationWindowSeconds for scale-down prevents flapping: HPA looks at the maximum desired replica count over the window and uses that. This means a spike that resolves quickly won't immediately trigger scale-down.

Combining HPA + Cluster Autoscaler

HPA and Cluster Autoscaler are designed to work together: HPA requests more pods, the scheduler finds no capacity, Cluster Autoscaler adds a node, then the pods schedule. Key settings to make this work smoothly:

# PodDisruptionBudget — ensure CA doesn't remove nodes with critical pods
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
  namespace: production
spec:
  minAvailable: 2             # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: my-app
# Ensure pods spread across nodes for CA to add new ones (not just fill existing)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: my-app
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: my-app
              topologyKey: kubernetes.io/hostname

HPA vs VPA vs KEDA Comparison

FeatureHPAVPAKEDA
ScalesReplica countCPU/memory requestsReplica count (including 0)
Metrics sourceCPU, memory, custom, externalHistorical CPU/memory usage60+ event sources (SQS, Kafka, Redis…)
Scale to zeroNo (min 1)N/AYes
Pod restart neededNoYes (except In-Place, alpha)No
Best forStateless web/API servicesRight-sizing any workloadQueue workers, batch jobs
Conflict with HPAYes, on same metricsNo (replaces HPA internally)
Latency to scale15–30s (metrics lag)Seconds–minutes15s (configurable)

Frequently Asked Questions

Why is my HPA stuck at minimum replicas despite high CPU?

Check three things: (1) The pods must have CPU requests defined — HPA cannot compute utilization without a request. (2) The metrics server must be running: kubectl top pods should return values. (3) The HPA's stabilizationWindowSeconds for scale-up may be set too high. Run kubectl describe hpa and look at the Conditions section for the exact failure reason.

Can I use KEDA alongside a standard HPA on the same deployment?

No. KEDA creates and manages its own HPA under the hood. If you manually create an HPA targeting the same deployment, they will conflict and produce unpredictable scaling behavior. Delete any manually created HPA before creating a KEDA ScaledObject for the same target.

What's the fastest way to scale up for a known traffic spike (e.g., a flash sale)?

Pre-scale manually before the event: kubectl scale deployment my-app --replicas=30. HPA will not scale below this until load drops and the stabilization window passes. Combined with KEDA or custom metrics that detect incoming load in the queue before it hits your pods, you can achieve proactive scaling. Some teams use CronJobs that call the Kubernetes API to pre-scale at known times.

How do I prevent Cluster Autoscaler from removing a specific node?

Annotate the node with cluster-autoscaler.kubernetes.io/scale-down-disabled=true. You can also run pods with local storage (emptyDir) on nodes you want to keep — CA respects the --skip-nodes-with-local-storage flag. For permanent nodes (e.g., dedicated database nodes), place them in a separate node group outside CA's management scope.

What's the difference between HPA scaleDown stabilizationWindowSeconds and cooldown?

The stabilizationWindowSeconds (default 300s for scale-down) is a sliding window over which HPA tracks the maximum desired replica count. This prevents scale-down when a metric fluctuates. The old --horizontal-pod-autoscaler-downscale-stabilization flag (deprecated) was the equivalent. There's no separate "cooldown" in HPA v2 — the stabilization window serves that purpose. KEDA has a separate cooldownPeriod field that controls scale-to-zero behavior.