Kubernetes Pod Scheduling: Node Selectors and Affinity Rules

Our database pods kept landing on the same node. When that node went down, we lost all database instances. No redundancy, total outage.

I learned about Kubernetes scheduling. Now our pods are distributed across nodes, with database pods on dedicated high-memory nodes. No more single points of failure.

The Outage

Tuesday, 2 PM: Node 3 crashes (hardware failure)
Tuesday, 2:01 PM: All 3 PostgreSQL pods down
Tuesday, 2:02 PM: Application can’t connect to database
Tuesday, 2:03 PM: Complete outage

All database pods were on the same node. Kubernetes default scheduler didn’t spread them out.

Default Scheduling

Kubernetes scheduler picks nodes based on:

Resource requests - Does node have enough CPU/memory?
Node conditions - Is node ready?
Predicates - Does pod fit on node?
Priorities - Which node is best?

But it doesn’t guarantee pod distribution.

Node Selectors

Simplest way to control scheduling:

Label nodes:

kubectl label nodes node-1 disktype=ssd
kubectl label nodes node-2 disktype=ssd
kubectl label nodes node-3 disktype=hdd

Use in pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: postgres-pod
spec:
  nodeSelector:
    disktype: ssd
  containers:
  - name: postgres
    image: postgres:9.5

Pod only runs on nodes with disktype=ssd.

Dedicated Database Nodes

We have 6 nodes:

3 high-memory nodes (32GB RAM) for databases
3 standard nodes (8GB RAM) for applications

Label database nodes:

kubectl label nodes node-1 node-type=database
kubectl label nodes node-2 node-type=database
kubectl label nodes node-3 node-type=database

PostgreSQL deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      nodeSelector:
        node-type: database
      containers:
      - name: postgres
        image: postgres:9.5
        resources:
          requests:
            memory: "4Gi"
            cpu: "1000m"
          limits:
            memory: "8Gi"
            cpu: "2000m"

Now PostgreSQL only runs on database nodes.

Pod Anti-Affinity

Spread pods across nodes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - postgres
            topologyKey: kubernetes.io/hostname
      containers:
      - name: postgres
        image: postgres:9.5

This ensures no two PostgreSQL pods run on the same node.

Pod Affinity

Run pods together:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  template:
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - redis
            topologyKey: kubernetes.io/hostname
      containers:
      - name: web
        image: web-app:latest

Web app pods run on same nodes as Redis pods (reduce latency).

Preferred vs Required

Required (hard constraint):

requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
    matchLabels:
      app: postgres
  topologyKey: kubernetes.io/hostname

Pod won’t schedule if constraint can’t be met.

Preferred (soft constraint):

preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
  podAffinityTerm:
    labelSelector:
      matchLabels:
        app: redis
    topologyKey: kubernetes.io/hostname

Scheduler tries to satisfy, but will schedule anyway if it can’t.

Taints and Tolerations

Prevent pods from running on certain nodes.

Taint database nodes:

kubectl taint nodes node-1 dedicated=database:NoSchedule
kubectl taint nodes node-2 dedicated=database:NoSchedule
kubectl taint nodes node-3 dedicated=database:NoSchedule

Now regular pods can’t schedule on these nodes.

Add toleration to database pods:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  template:
    spec:
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "database"
        effect: "NoSchedule"
      containers:
      - name: postgres
        image: postgres:9.5

Only pods with matching toleration can run on tainted nodes.

Taint Effects

Three effects:

NoSchedule - Don’t schedule new pods
PreferNoSchedule - Try not to schedule
NoExecute - Evict existing pods

Example with NoExecute:

kubectl taint nodes node-1 maintenance=true:NoExecute

All pods without matching toleration are evicted immediately.

Real-World Setup

Our final configuration:

Database nodes (high-memory):

kubectl label nodes node-1 node-type=database
kubectl label nodes node-2 node-type=database
kubectl label nodes node-3 node-type=database

kubectl taint nodes node-1 dedicated=database:NoSchedule
kubectl taint nodes node-2 dedicated=database:NoSchedule
kubectl taint nodes node-3 dedicated=database:NoSchedule

PostgreSQL deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      nodeSelector:
        node-type: database
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "database"
        effect: "NoSchedule"
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - postgres
            topologyKey: kubernetes.io/hostname
      containers:
      - name: postgres
        image: postgres:9.5
        resources:
          requests:
            memory: "4Gi"
            cpu: "1000m"

This ensures:

PostgreSQL runs only on database nodes
One PostgreSQL pod per node (anti-affinity)
Regular pods can’t run on database nodes (taint)

Topology Spread Constraints

Kubernetes 1.16+ feature (we’re on 1.3, but good to know):

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: kubernetes.io/hostname
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: web

Evenly distributes pods across topology domains.

Checking Pod Placement

See where pods are running:

kubectl get pods -o wide

Output:

NAME                        READY   STATUS    NODE
postgres-7d8f9c-abc12       1/1     Running   node-1
postgres-7d8f9c-def34       1/1     Running   node-2
postgres-7d8f9c-ghi56       1/1     Running   node-3

Perfect! One pod per node.

Debugging Scheduling Issues

Pod stuck in Pending:

kubectl describe pod postgres-7d8f9c-abc12

Look for events:

Events:
  Type     Reason            Message
  ----     ------            -------
  Warning  FailedScheduling  0/6 nodes are available: 3 node(s) didn't match node selector, 3 node(s) had taints that the pod didn't tolerate.

This tells you why pod can’t schedule.

Node Affinity

More expressive than nodeSelector:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node-type
          operator: In
          values:
          - database
          - cache

Supports operators: In, NotIn, Exists, DoesNotExist, Gt, Lt

Combining Constraints

You can combine multiple constraints:

spec:
  nodeSelector:
    disktype: ssd
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - us-west-1a
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: postgres
        topologyKey: kubernetes.io/hostname
  tolerations:
  - key: dedicated
    value: database
    effect: NoSchedule

Lessons Learned

Plan node labels - Label nodes by role, hardware, zone
Use anti-affinity for HA - Spread critical pods across nodes
Taint dedicated nodes - Prevent resource contention
Test failover - Verify pods reschedule correctly
Monitor placement - Ensure pods are where you expect

Results

Before:

All database pods on one node
Node failure = complete outage
No control over placement

After:

Database pods spread across 3 nodes
Node failure = 1/3 capacity loss, no outage
Full control over pod placement

Conclusion

Kubernetes scheduling is powerful but requires configuration. Default scheduler doesn’t guarantee high availability.

Key takeaways:

Use node selectors for simple constraints
Use affinity/anti-affinity for pod distribution
Use taints/tolerations for dedicated nodes
Always test failover scenarios
Monitor pod placement

Don’t wait for an outage to learn about scheduling. Configure it properly from the start.

Table of Contents