Kubernetes Pod Scheduling: Node Selectors and Affinity Rules
Our database pods kept landing on the same node. When that node went down, we lost all database instances. No redundancy, total outage.
I learned about Kubernetes scheduling. Now our pods are distributed across nodes, with database pods on dedicated high-memory nodes. No more single points of failure.
Table of Contents
The Outage
Tuesday, 2 PM: Node 3 crashes (hardware failure)
Tuesday, 2:01 PM: All 3 PostgreSQL pods down
Tuesday, 2:02 PM: Application can’t connect to database
Tuesday, 2:03 PM: Complete outage
All database pods were on the same node. Kubernetes default scheduler didn’t spread them out.
Default Scheduling
Kubernetes scheduler picks nodes based on:
- Resource requests - Does node have enough CPU/memory?
- Node conditions - Is node ready?
- Predicates - Does pod fit on node?
- Priorities - Which node is best?
But it doesn’t guarantee pod distribution.
Node Selectors
Simplest way to control scheduling:
Label nodes:
kubectl label nodes node-1 disktype=ssd
kubectl label nodes node-2 disktype=ssd
kubectl label nodes node-3 disktype=hdd
Use in pod spec:
apiVersion: v1
kind: Pod
metadata:
name: postgres-pod
spec:
nodeSelector:
disktype: ssd
containers:
- name: postgres
image: postgres:9.5
Pod only runs on nodes with disktype=ssd.
Dedicated Database Nodes
We have 6 nodes:
- 3 high-memory nodes (32GB RAM) for databases
- 3 standard nodes (8GB RAM) for applications
Label database nodes:
kubectl label nodes node-1 node-type=database
kubectl label nodes node-2 node-type=database
kubectl label nodes node-3 node-type=database
PostgreSQL deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
spec:
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
nodeSelector:
node-type: database
containers:
- name: postgres
image: postgres:9.5
resources:
requests:
memory: "4Gi"
cpu: "1000m"
limits:
memory: "8Gi"
cpu: "2000m"
Now PostgreSQL only runs on database nodes.
Pod Anti-Affinity
Spread pods across nodes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
spec:
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- postgres
topologyKey: kubernetes.io/hostname
containers:
- name: postgres
image: postgres:9.5
This ensures no two PostgreSQL pods run on the same node.
Pod Affinity
Run pods together:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
template:
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis
topologyKey: kubernetes.io/hostname
containers:
- name: web
image: web-app:latest
Web app pods run on same nodes as Redis pods (reduce latency).
Preferred vs Required
Required (hard constraint):
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: postgres
topologyKey: kubernetes.io/hostname
Pod won’t schedule if constraint can’t be met.
Preferred (soft constraint):
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: redis
topologyKey: kubernetes.io/hostname
Scheduler tries to satisfy, but will schedule anyway if it can’t.
Taints and Tolerations
Prevent pods from running on certain nodes.
Taint database nodes:
kubectl taint nodes node-1 dedicated=database:NoSchedule
kubectl taint nodes node-2 dedicated=database:NoSchedule
kubectl taint nodes node-3 dedicated=database:NoSchedule
Now regular pods can’t schedule on these nodes.
Add toleration to database pods:
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
spec:
template:
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "database"
effect: "NoSchedule"
containers:
- name: postgres
image: postgres:9.5
Only pods with matching toleration can run on tainted nodes.
Taint Effects
Three effects:
- NoSchedule - Don’t schedule new pods
- PreferNoSchedule - Try not to schedule
- NoExecute - Evict existing pods
Example with NoExecute:
kubectl taint nodes node-1 maintenance=true:NoExecute
All pods without matching toleration are evicted immediately.
Real-World Setup
Our final configuration:
Database nodes (high-memory):
kubectl label nodes node-1 node-type=database
kubectl label nodes node-2 node-type=database
kubectl label nodes node-3 node-type=database
kubectl taint nodes node-1 dedicated=database:NoSchedule
kubectl taint nodes node-2 dedicated=database:NoSchedule
kubectl taint nodes node-3 dedicated=database:NoSchedule
PostgreSQL deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
spec:
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
nodeSelector:
node-type: database
tolerations:
- key: "dedicated"
operator: "Equal"
value: "database"
effect: "NoSchedule"
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- postgres
topologyKey: kubernetes.io/hostname
containers:
- name: postgres
image: postgres:9.5
resources:
requests:
memory: "4Gi"
cpu: "1000m"
This ensures:
- PostgreSQL runs only on database nodes
- One PostgreSQL pod per node (anti-affinity)
- Regular pods can’t run on database nodes (taint)
Topology Spread Constraints
Kubernetes 1.16+ feature (we’re on 1.3, but good to know):
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
Evenly distributes pods across topology domains.
Checking Pod Placement
See where pods are running:
kubectl get pods -o wide
Output:
NAME READY STATUS NODE
postgres-7d8f9c-abc12 1/1 Running node-1
postgres-7d8f9c-def34 1/1 Running node-2
postgres-7d8f9c-ghi56 1/1 Running node-3
Perfect! One pod per node.
Debugging Scheduling Issues
Pod stuck in Pending:
kubectl describe pod postgres-7d8f9c-abc12
Look for events:
Events:
Type Reason Message
---- ------ -------
Warning FailedScheduling 0/6 nodes are available: 3 node(s) didn't match node selector, 3 node(s) had taints that the pod didn't tolerate.
This tells you why pod can’t schedule.
Node Affinity
More expressive than nodeSelector:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- database
- cache
Supports operators: In, NotIn, Exists, DoesNotExist, Gt, Lt
Combining Constraints
You can combine multiple constraints:
spec:
nodeSelector:
disktype: ssd
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: zone
operator: In
values:
- us-west-1a
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: postgres
topologyKey: kubernetes.io/hostname
tolerations:
- key: dedicated
value: database
effect: NoSchedule
Lessons Learned
- Plan node labels - Label nodes by role, hardware, zone
- Use anti-affinity for HA - Spread critical pods across nodes
- Taint dedicated nodes - Prevent resource contention
- Test failover - Verify pods reschedule correctly
- Monitor placement - Ensure pods are where you expect
Results
Before:
- All database pods on one node
- Node failure = complete outage
- No control over placement
After:
- Database pods spread across 3 nodes
- Node failure = 1/3 capacity loss, no outage
- Full control over pod placement
Conclusion
Kubernetes scheduling is powerful but requires configuration. Default scheduler doesn’t guarantee high availability.
Key takeaways:
- Use node selectors for simple constraints
- Use affinity/anti-affinity for pod distribution
- Use taints/tolerations for dedicated nodes
- Always test failover scenarios
- Monitor pod placement
Don’t wait for an outage to learn about scheduling. Configure it properly from the start.