Kubernetes Affinity and Anti-Affinity: Advanced Pod Scheduling (2026)
1. Scheduling Hierarchy: nodeSelector → nodeAffinity → podAffinity
Kubernetes provides a layered set of mechanisms to control where pods land. Understanding which tool to reach for first saves time and avoids over-engineering.
- nodeSelector — the simplest approach. A key-value map on the pod spec matched against node labels. Use it when you need a single, exact label match with no fallback logic.
- nodeAffinity — a superset of nodeSelector. Supports rich expressions (
In,NotIn,Gt,Lt,Exists,DoesNotExist) and distinguishes between hard (required) and soft (preferred) constraints. Use it whenever nodeSelector feels too rigid. - podAffinity / podAntiAffinity — scheduling relative to other running pods rather than node labels. Use podAffinity to co-locate services (cache next to app) and podAntiAffinity to spread replicas for high availability.
- topologySpreadConstraints — introduced in Kubernetes 1.19 and GA in 1.24, this is the modern, first-class replacement for naive anti-affinity spreading. It provides fine-grained control over skew across topology domains.
2. nodeAffinity: Required vs Preferred During Scheduling
Node affinity lives under spec.affinity.nodeAffinity and has two scheduling modes:
requiredDuringSchedulingIgnoredDuringExecution— hard constraint. Pod will not schedule if no matching node exists.preferredDuringSchedulingIgnoredDuringExecution— soft constraint. Scheduler tries matching nodes first but falls back to any node if none match.
Both modes accept nodeSelectorTerms (for required) or a list of weighted preference objects (for preferred), each containing matchExpressions.
# nodeAffinity — hard + soft combined
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
affinity:
nodeAffinity:
# Hard: MUST run on nodes with SSD storage
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: storage-type
operator: In
values:
- ssd
# Soft: PREFER high-memory nodes (weight 80 out of 100)
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node-tier
operator: In
values:
- high-memory
containers:
- name: web-app
image: myapp:2.1.0
resources:
requests:
cpu: "500m"
memory: "512Mi"
The IgnoredDuringExecution suffix means if a node's labels change after a pod is already running, the pod is not evicted. A future RequiredDuringExecution mode (in alpha) will evict pods when node labels stop matching.
3. nodeAffinity Operators: In, NotIn, Exists, DoesNotExist, Gt, Lt
Kubernetes supports six operators inside matchExpressions. Each serves a distinct filtering purpose:
# All six nodeAffinity operators demonstrated
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
# In: node label value must be one of the listed values
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
# NotIn: exclude nodes in the DR zone
- key: topology.kubernetes.io/zone
operator: NotIn
values:
- us-east-1d
# Exists: node must have this label (any value)
- key: nvidia.com/gpu
operator: Exists
# DoesNotExist: node must NOT have this label
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
# Gt: node's instance-generation must be > 4
- key: instance-generation
operator: Gt
values:
- "4"
# Lt: node's latency-tier score must be < 3
- key: latency-tier
operator: Lt
values:
- "3"
Gt and Lt compare integer values stored as label strings. The values list must contain exactly one element for these operators. Labels are always strings in Kubernetes, so the scheduler parses them as 64-bit integers for comparison.
Multiple expressions inside a single nodeSelectorTerms item are ANDed. Multiple items in nodeSelectorTerms are ORed — giving you full AND/OR flexibility without complex nesting.
4. Use Case: Zone-Aware Scheduling Across Availability Zones
A common production requirement is restricting sensitive workloads to specific AZs — for compliance, latency, or cost reasons. The standard topology label is topology.kubernetes.io/zone, automatically applied by cloud providers.
# Require pods to run only in us-east-1a or us-east-1b (not 1c which is higher cost)
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-service
namespace: production
spec:
replicas: 4
selector:
matchLabels:
app: payments-service
template:
metadata:
labels:
app: payments-service
tier: backend
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
# Only run in cost-optimized zones
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
# Only on production-grade nodes
- key: node-pool
operator: In
values:
- prod-standard
- prod-high-mem
preferredDuringSchedulingIgnoredDuringExecution:
# Prefer 1a (primary zone) over 1b (secondary)
- weight: 70
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
containers:
- name: payments-service
image: payments:3.4.1
ports:
- containerPort: 8080
This pattern ensures that even as the cluster auto-scales and adds nodes in us-east-1c, the payments service never lands there. The weight of 70 gives a strong preference for 1a while still allowing overflow to 1b.
5. podAffinity: Co-Locating Pods on the Same Node or Zone
Pod affinity schedules a pod near other pods matching a label selector. The topologyKey field defines the granularity — use kubernetes.io/hostname for same-node co-location, or topology.kubernetes.io/zone for same-zone co-location.
A classic pattern is placing a Redis cache sidecar deployment on the same node as the application — eliminating inter-node network hops for cache lookups.
# Cache co-location: schedule redis-cache pods on the same node as app pods
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: redis-cache
template:
metadata:
labels:
app: redis-cache
role: cache
spec:
affinity:
podAffinity:
# Hard: must be on the SAME NODE as the web-app pods
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-app
# Same physical node
topologyKey: kubernetes.io/hostname
containers:
- name: redis
image: redis:7.2-alpine
ports:
- containerPort: 6379
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
memory: "512Mi"
topology.kubernetes.io/zone) instead of hostname-level — it reduces the comparison space and still achieves latency benefits.
6. podAntiAffinity: Spreading Replicas for High Availability
Pod anti-affinity ensures replicas of the same workload do not land on the same node (or zone). This is the foundational HA pattern — if a node fails, not all replicas go down simultaneously.
# HA deployment: hard anti-affinity — no two replicas on the same node
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: api-gateway
template:
metadata:
labels:
app: api-gateway
component: gateway
spec:
affinity:
podAntiAffinity:
# Hard: NEVER schedule two api-gateway pods on the same node
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api-gateway
topologyKey: kubernetes.io/hostname
containers:
- name: api-gateway
image: envoy:v1.29
ports:
- containerPort: 8080
- containerPort: 9901
resources:
requests:
cpu: "1"
memory: "1Gi"
requiredDuringScheduling anti-affinity with kubernetes.io/hostname and replicas: N means you need at least N nodes available. If the cluster shrinks below replica count, new pods will be stuck in Pending. For clusters with variable node counts, use preferredDuringScheduling anti-affinity or topologySpreadConstraints instead.
7. Soft Anti-Affinity: preferredDuringScheduling for Best-Effort Spreading
Soft anti-affinity lets the scheduler spread pods when possible but never blocks scheduling. This is the right choice when availability is desirable but not worth risking pod starvation.
# Soft anti-affinity — prefer different nodes but allow co-location if needed
apiVersion: apps/v1
kind: Deployment
metadata:
name: worker-service
namespace: production
spec:
replicas: 6
selector:
matchLabels:
app: worker-service
template:
metadata:
labels:
app: worker-service
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
# Strong preference: avoid same node (weight 100)
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- worker-service
topologyKey: kubernetes.io/hostname
# Weaker preference: avoid same zone (weight 50)
- weight: 50
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- worker-service
topologyKey: topology.kubernetes.io/zone
containers:
- name: worker
image: worker:1.8.0
The scheduler computes a combined score for each node: weight × (1 if preference met, 0 if not). Nodes that satisfy both preferences score 150, nodes satisfying only the hostname preference score 100, and nodes satisfying neither score 0. The highest-scoring node wins ties with other scoring plugins.
8. topologySpreadConstraints: The Modern Spreading Mechanism
Introduced as GA in Kubernetes 1.24, topologySpreadConstraints is the recommended approach for distributing pods across topology domains in 2026. It is more expressive than anti-affinity and has better scheduler performance characteristics.
Key fields:
maxSkew— maximum allowed difference in pod count between the most-loaded and least-loaded topology domain.topologyKey— the node label that defines the topology domain (zone, node, region, rack).whenUnsatisfiable—DoNotSchedule(hard) orScheduleAnyway(soft).labelSelector— which pods count toward the spread calculation.
# topologySpreadConstraints: spread across zones with maxSkew=1
apiVersion: apps/v1
kind: Deployment
metadata:
name: search-service
namespace: production
spec:
replicas: 6
selector:
matchLabels:
app: search-service
template:
metadata:
labels:
app: search-service
spec:
topologySpreadConstraints:
# At most 1 pod difference between zones
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: search-service
# At most 2 pod difference between nodes within each zone
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: search-service
containers:
- name: search
image: elasticsearch:8.13.0
resources:
requests:
cpu: "2"
memory: "4Gi"
DoNotSchedule and a soft node-level constraint with ScheduleAnyway. This guarantees zone balance while making a best effort at node balance without risking pod starvation.
9. Use Case: 3-Replica Deployment Evenly Spread Across 3 AZs
A three-replica stateless application spread one-per-zone is the minimum viable HA topology for most production services. With topologySpreadConstraints and maxSkew: 1, Kubernetes guarantees exactly one replica per zone in a balanced three-zone cluster.
# Perfect 1-per-zone spread for a 3-replica, 3-AZ deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-api
namespace: production
labels:
app: checkout-api
version: v4.2.0
spec:
replicas: 3
selector:
matchLabels:
app: checkout-api
template:
metadata:
labels:
app: checkout-api
version: v4.2.0
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: checkout-api
# minDomains ensures we have pods in ALL zones, not just 2
minDomains: 3
# Ensure pods don't share a node even within the same zone
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: checkout-api
topologyKey: kubernetes.io/hostname
containers:
- name: checkout-api
image: checkout:4.2.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
The minDomains: 3 field (GA in Kubernetes 1.28) tells the scheduler that the cluster has at least 3 eligible zones. Without it, if only 2 zones have eligible nodes, the constraint relaxes and allows 2 replicas in one zone. Always set minDomains when you have a known, stable zone count.
10. Combining Affinity with Taints and Tolerations: Multi-Tenant Node Isolation
Taints and tolerations prevent pods from landing on nodes; affinity rules attract pods to specific nodes. Together, they implement complete multi-tenant node isolation: dedicated node pools where only authorized workloads run, and those workloads always land on their designated pool.
# Step 1: Taint the dedicated GPU node pool
# kubectl taint nodes -l node-pool=gpu-exclusive dedicated=gpu:NoSchedule
# Step 2: Label the GPU node pool
# kubectl label nodes -l node-pool=gpu-exclusive gpu-workload=true
# Step 3: Deployment that REQUIRES GPU nodes and TOLERATES the taint
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-training-job
namespace: ai-workloads
spec:
replicas: 2
selector:
matchLabels:
app: ml-training-job
template:
metadata:
labels:
app: ml-training-job
spec:
# Tolerate the taint so the pod is allowed on GPU nodes
tolerations:
- key: dedicated
operator: Equal
value: gpu
effect: NoSchedule
affinity:
nodeAffinity:
# Hard: only schedule on confirmed GPU nodes
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu-workload
operator: In
values:
- "true"
- key: nvidia.com/gpu
operator: Exists
podAntiAffinity:
# Soft: avoid packing both replicas on the same node
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: ml-training-job
topologyKey: kubernetes.io/hostname
containers:
- name: trainer
image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
resources:
limits:
nvidia.com/gpu: "1"
The taint alone prevents random pods from landing on GPU nodes. The affinity rule ensures the ML job always lands on a GPU node — without it, the pod could tolerate the taint but still schedule on a non-GPU node if the scheduler found it more optimal.
11. Weight in Preferred Rules: Tuning Scheduler Preference 1–100
Every preferredDuringScheduling entry takes a weight integer from 1 to 100. The scheduler sums weights across all satisfied preferences for each candidate node and adds the result to that node's total priority score.
Weight tuning strategy:
- 100 — use when the preference is nearly as important as a hard rule. You want the scheduler to strongly favor it unless truly no matching node is available.
- 50–80 — balanced preference. The scheduler will choose it when other factors are roughly equal, but won't sacrifice significant other scoring (bin-packing, resource balance) to satisfy it.
- 1–20 — tie-breaker. The preference nudges scheduling only when all other factors are identical. Useful for gradual migration patterns (e.g., preferring new node pool over old).
# Multi-weight preference example: primary zone > secondary zone > tertiary
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
# Strong: prefer primary region
- weight: 100
preference:
matchExpressions:
- key: topology.kubernetes.io/region
operator: In
values:
- us-east-1
# Medium: fall back to us-west-2 over eu
- weight: 60
preference:
matchExpressions:
- key: topology.kubernetes.io/region
operator: In
values:
- us-west-2
# Weak: prefer newer generation instances
- weight: 20
preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- m6i.xlarge
- m6i.2xlarge
- m7i.xlarge
12. Performance Impact: When Complex Affinity Rules Slow Scheduling
Affinity rules have measurable scheduling latency implications at scale. Understanding the cost model helps you decide when to simplify.
Pod affinity/anti-affinity is the most expensive. The scheduler must evaluate the new pod's affinity terms against every running pod in the cluster (or namespace). In a 10,000-pod cluster, a single podAffinity rule can add 10–30ms to scheduling latency. With 50 pods being scheduled simultaneously, this can create a scheduling backlog.
Node affinity, by contrast, only evaluates node labels — O(nodes) not O(pods). It is cheap even in large clusters.
Recommendations for high-scale clusters:
- Prefer
topologySpreadConstraintsoverpodAntiAffinityfor spreading — it uses an optimized indexed data structure internally. - Use
namespaceSelectorin podAffinity terms to limit the scope of pod comparisons to the relevant namespace. - Avoid combining podAffinity + podAntiAffinity on the same deployment unless necessary — each doubles the evaluation work.
- Monitor scheduling latency with
scheduler_scheduling_algorithm_duration_secondsPrometheus metric. - Consider enabling the
InterPodAffinitySymmetricWeightfeature flag and tuninghardPodAffinityWeightin the scheduler config for clusters above 500 nodes.
SchedulerQueueingHints feature gate enabled (beta), the scheduler can skip re-evaluating affinity terms when unrelated events occur — reducing CPU usage by up to 40% in high-churn environments.
13. Troubleshooting: Pod Pending Due to Affinity Rules
When a pod cannot be scheduled due to affinity constraints, it enters Pending state with a descriptive event. The primary diagnostic tool is kubectl describe pod.
# Inspect a pending pod's scheduling events
kubectl describe pod <pod-name> -n <namespace>
# Look for the Events section at the bottom:
# Events:
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Warning FailedScheduling 2m43s default-scheduler 0/12 nodes are available:
# 3 node(s) had untolerated taint {dedicated: gpu},
# 4 node(s) didn't match Pod's node affinity/selector,
# 5 node(s) had Pod affinity conflicts.
# Check which nodes match your nodeAffinity labels
kubectl get nodes --show-labels | grep storage-type=ssd
# Check pod affinity: see how existing pods are distributed
kubectl get pods -n production -l app=web-app -o wide
# Use kubectl explain to inspect affinity field structure
kubectl explain pod.spec.affinity.nodeAffinity
# Check scheduler logs for detailed failure reasons (if you have access)
kubectl logs -n kube-system -l component=kube-scheduler --tail=100 | grep <pod-name>
Common root causes and fixes:
- "didn't match node affinity/selector" — the label you're selecting on doesn't exist on any node. Run
kubectl get nodes --show-labelsand verify the exact key-value. Cloud providers sometimes change topology label formats between Kubernetes versions (e.g.,failure-domain.beta.kubernetes.io/zonedeprecated in favor oftopology.kubernetes.io/zone). - "had Pod affinity conflicts" — your podAffinity rule requires co-location with pods that don't exist yet, or your podAntiAffinity rule is too strict for the available node count. Temporarily switch to
preferredDuringSchedulingto confirm this is the cause. - Pending after scale-out — new nodes may not have been labeled yet. Managed node groups in EKS/GKE/AKE usually label nodes automatically, but verify with
kubectl get nodes --show-labels | grep topologyafter a new node joins. - Pending with topologySpreadConstraints —
minDomainsis set higher than the number of available zones. Checkkubectl get nodes -l topology.kubernetes.io/zone --show-labelsand count distinct zone values.
# Quick diagnostic script: find all pending pods and their scheduling messages
kubectl get pods -A --field-selector=status.phase=Pending -o json | \
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name)"' | \
while read pod; do
ns=$(echo $pod | cut -d/ -f1)
name=$(echo $pod | cut -d/ -f2)
echo "=== $pod ==="
kubectl describe pod $name -n $ns | grep -A 5 "Events:"
done
For deeper investigation, Kubernetes Resource Management covers resource quotas and LimitRanges that can interact with scheduling. The Kubernetes Monitoring with Prometheus guide shows how to alert on scheduling latency metrics before they become production incidents.
Summary
Kubernetes affinity and anti-affinity give you fine-grained control over pod placement beyond simple node selectors. The key takeaways for 2026:
- Use nodeAffinity (not nodeSelector) for all new workloads — the expression syntax handles every use case nodeSelector covers, plus more.
- Use podAntiAffinity with
requiredDuringSchedulingonly when you have guaranteed node count headroom; otherwise usepreferredDuringScheduling. - Prefer topologySpreadConstraints over anti-affinity for replica spreading — it is more performant and expressive in Kubernetes 1.27+.
- Always combine taints/tolerations with nodeAffinity for complete workload isolation — tolerations allow, affinity attracts.
- Monitor scheduling latency and simplify affinity rules in clusters above 500 nodes where pod affinity evaluation becomes measurably expensive.
For related topics, see Kubernetes Deployments for rolling update strategies, HPA Scaling for combining autoscaling with affinity constraints, and Security Best Practices for namespace isolation patterns that pair well with the multi-tenant affinity patterns covered here.