Advanced Kubernetes Scheduling: Node Affinity, Taints, and Custom Schedulers

All our GPU pods landed on the same node. Other GPU nodes sat idle. Default scheduler didn’t understand our workload. We needed smarter scheduling.

I implemented advanced scheduling strategies. Now pods distribute evenly, GPU utilization is 85%, and we handle 3x more ML workloads.

The Problem

Default scheduler issues:

GPU pods on same node
CPU-intensive pods together (noisy neighbors)
No consideration for cost optimization
Can’t prioritize critical workloads

We needed control.

Node Affinity

Prefer specific nodes:

apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu
            operator: In
            values:
            - nvidia-tesla-v100
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - us-east-1a
  containers:
  - name: training
    image: ml-training:latest

Required: Must match
Preferred: Try to match (soft requirement)

Pod Anti-Affinity

Spread pods across nodes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-app
            topologyKey: kubernetes.io/hostname
      containers:
      - name: web-app
        image: web-app:latest

Each pod on different node!

Pod Affinity

Co-locate related pods:

apiVersion: v1
kind: Pod
metadata:
  name: cache
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - web-app
        topologyKey: kubernetes.io/hostname
  containers:
  - name: redis
    image: redis:latest

Cache pod runs on same node as web-app!

Taints and Tolerations

Reserve nodes for specific workloads:

Taint node:

kubectl taint nodes gpu-node-1 gpu=true:NoSchedule

Tolerate taint:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-job
spec:
  tolerations:
  - key: gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  containers:
  - name: training
    image: ml-training:latest

Only pods with toleration can schedule on tainted nodes!

Taint Effects

NoSchedule: Don’t schedule new pods
PreferNoSchedule: Try not to schedule
NoExecute: Evict existing pods

# NoExecute example
kubectl taint nodes node-1 maintenance=true:NoExecute

Existing pods without toleration are evicted!

Priority Classes

Prioritize critical workloads:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000
globalDefault: false
description: "High priority for production workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
description: "Low priority for batch jobs"

Use in pod:

apiVersion: v1
kind: Pod
metadata:
  name: critical-app
spec:
  priorityClassName: high-priority
  containers:
  - name: app
    image: critical-app:latest

High-priority pods can preempt low-priority pods!

Topology Spread Constraints

Even distribution:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web-app
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: web-app
      containers:
      - name: web-app
        image: web-app:latest

Spreads pods evenly across zones and nodes!

Resource Requests and Limits

Influence scheduling:

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
  - name: app
    image: app:latest
    resources:
      requests:
        memory: "1Gi"
        cpu: "500m"
      limits:
        memory: "2Gi"
        cpu: "1000m"

Scheduler only places pod on nodes with available resources.

Custom Scheduler

Build custom scheduler:

package main

import (
    "context"
    "fmt"
    "math/rand"
    
    v1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
    "k8s.io/client-go/rest"
)

func main() {
    config, _ := rest.InClusterConfig()
    clientset, _ := kubernetes.NewForConfig(config)
    
    for {
        pods, _ := clientset.CoreV1().Pods("").List(context.TODO(), metav1.ListOptions{
            FieldSelector: "spec.schedulerName=custom-scheduler,spec.nodeName=",
        })
        
        for _, pod := range pods.Items {
            node := selectNode(clientset, &pod)
            bindPodToNode(clientset, &pod, node)
        }
    }
}

func selectNode(clientset *kubernetes.Clientset, pod *v1.Pod) string {
    nodes, _ := clientset.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{})
    
    // Custom logic: select node with least pods
    minPods := 999999
    selectedNode := ""
    
    for _, node := range nodes.Items {
        pods, _ := clientset.CoreV1().Pods("").List(context.TODO(), metav1.ListOptions{
            FieldSelector: fmt.Sprintf("spec.nodeName=%s", node.Name),
        })
        
        if len(pods.Items) < minPods {
            minPods = len(pods.Items)
            selectedNode = node.Name
        }
    }
    
    return selectedNode
}

func bindPodToNode(clientset *kubernetes.Clientset, pod *v1.Pod, node string) {
    binding := &v1.Binding{
        ObjectMeta: metav1.ObjectMeta{
            Name:      pod.Name,
            Namespace: pod.Namespace,
        },
        Target: v1.ObjectReference{
            Kind: "Node",
            Name: node,
        },
    }
    
    clientset.CoreV1().Pods(pod.Namespace).Bind(context.TODO(), binding, metav1.CreateOptions{})
}

Use custom scheduler:

apiVersion: v1
kind: Pod
metadata:
  name: custom-scheduled-pod
spec:
  schedulerName: custom-scheduler
  containers:
  - name: app
    image: app:latest

Scheduler Extender

Extend default scheduler:

apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-extender-config
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1alpha1
    kind: KubeSchedulerConfiguration
    extenders:
    - urlPrefix: "http://scheduler-extender:8080"
      filterVerb: "filter"
      prioritizeVerb: "prioritize"
      weight: 1
      enableHttps: false

Descheduler

Rebalance pods:

apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler-policy
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      RemoveDuplicates:
        enabled: true
      LowNodeUtilization:
        enabled: true
        params:
          nodeResourceUtilizationThresholds:
            thresholds:
              cpu: 20
              memory: 20
            targetThresholds:
              cpu: 50
              memory: 50

Real-World Examples

GPU workload distribution:

apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: workload
              operator: In
              values:
              - ml-training
          topologyKey: kubernetes.io/hostname
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  containers:
  - name: training
    image: ml-training:latest
    resources:
      limits:
        nvidia.com/gpu: 1

Cost optimization (spot instances):

apiVersion: v1
kind: Pod
metadata:
  name: batch-job
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: node-lifecycle
            operator: In
            values:
            - spot
  tolerations:
  - key: spot
    operator: Equal
    value: "true"
    effect: NoSchedule
  containers:
  - name: job
    image: batch-job:latest

Results

Before:

Uneven pod distribution
GPU utilization: 30%
Noisy neighbor problems
No workload prioritization

After:

Even distribution across nodes
GPU utilization: 85%
Isolated workloads
Critical pods always scheduled

Lessons Learned

Use anti-affinity - Spread for HA
Taint special nodes - Reserve for specific workloads
Set priorities - Critical workloads first
Monitor scheduling - Watch pending pods
Test custom schedulers - Before production

Conclusion

Advanced scheduling strategies optimize resource utilization and ensure workload placement meets your requirements.

Key takeaways:

Node affinity for hardware requirements
Pod anti-affinity for HA
Taints/tolerations for dedicated nodes
Priority classes for critical workloads
Custom schedulers for special needs

Master Kubernetes scheduling. Your workloads will run optimally.

Table of Contents