Advanced Kubernetes Scheduling: Node Affinity, Taints, and Custom Schedulers
All our GPU pods landed on the same node. Other GPU nodes sat idle. Default scheduler didn’t understand our workload. We needed smarter scheduling.
I implemented advanced scheduling strategies. Now pods distribute evenly, GPU utilization is 85%, and we handle 3x more ML workloads.
Table of Contents
The Problem
Default scheduler issues:
- GPU pods on same node
- CPU-intensive pods together (noisy neighbors)
- No consideration for cost optimization
- Can’t prioritize critical workloads
We needed control.
Node Affinity
Prefer specific nodes:
apiVersion: v1
kind: Pod
metadata:
name: ml-training
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu
operator: In
values:
- nvidia-tesla-v100
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: zone
operator: In
values:
- us-east-1a
containers:
- name: training
image: ml-training:latest
Required: Must match
Preferred: Try to match (soft requirement)
Pod Anti-Affinity
Spread pods across nodes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-app
topologyKey: kubernetes.io/hostname
containers:
- name: web-app
image: web-app:latest
Each pod on different node!
Pod Affinity
Co-locate related pods:
apiVersion: v1
kind: Pod
metadata:
name: cache
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-app
topologyKey: kubernetes.io/hostname
containers:
- name: redis
image: redis:latest
Cache pod runs on same node as web-app!
Taints and Tolerations
Reserve nodes for specific workloads:
Taint node:
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
Tolerate taint:
apiVersion: v1
kind: Pod
metadata:
name: gpu-job
spec:
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: training
image: ml-training:latest
Only pods with toleration can schedule on tainted nodes!
Taint Effects
NoSchedule: Don’t schedule new pods
PreferNoSchedule: Try not to schedule
NoExecute: Evict existing pods
# NoExecute example
kubectl taint nodes node-1 maintenance=true:NoExecute
Existing pods without toleration are evicted!
Priority Classes
Prioritize critical workloads:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000
globalDefault: false
description: "High priority for production workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100
globalDefault: false
description: "Low priority for batch jobs"
Use in pod:
apiVersion: v1
kind: Pod
metadata:
name: critical-app
spec:
priorityClassName: high-priority
containers:
- name: app
image: critical-app:latest
High-priority pods can preempt low-priority pods!
Topology Spread Constraints
Even distribution:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 6
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-app
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web-app
containers:
- name: web-app
image: web-app:latest
Spreads pods evenly across zones and nodes!
Resource Requests and Limits
Influence scheduling:
apiVersion: v1
kind: Pod
metadata:
name: resource-demo
spec:
containers:
- name: app
image: app:latest
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
Scheduler only places pod on nodes with available resources.
Custom Scheduler
Build custom scheduler:
package main
import (
"context"
"fmt"
"math/rand"
v1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
)
func main() {
config, _ := rest.InClusterConfig()
clientset, _ := kubernetes.NewForConfig(config)
for {
pods, _ := clientset.CoreV1().Pods("").List(context.TODO(), metav1.ListOptions{
FieldSelector: "spec.schedulerName=custom-scheduler,spec.nodeName=",
})
for _, pod := range pods.Items {
node := selectNode(clientset, &pod)
bindPodToNode(clientset, &pod, node)
}
}
}
func selectNode(clientset *kubernetes.Clientset, pod *v1.Pod) string {
nodes, _ := clientset.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{})
// Custom logic: select node with least pods
minPods := 999999
selectedNode := ""
for _, node := range nodes.Items {
pods, _ := clientset.CoreV1().Pods("").List(context.TODO(), metav1.ListOptions{
FieldSelector: fmt.Sprintf("spec.nodeName=%s", node.Name),
})
if len(pods.Items) < minPods {
minPods = len(pods.Items)
selectedNode = node.Name
}
}
return selectedNode
}
func bindPodToNode(clientset *kubernetes.Clientset, pod *v1.Pod, node string) {
binding := &v1.Binding{
ObjectMeta: metav1.ObjectMeta{
Name: pod.Name,
Namespace: pod.Namespace,
},
Target: v1.ObjectReference{
Kind: "Node",
Name: node,
},
}
clientset.CoreV1().Pods(pod.Namespace).Bind(context.TODO(), binding, metav1.CreateOptions{})
}
Use custom scheduler:
apiVersion: v1
kind: Pod
metadata:
name: custom-scheduled-pod
spec:
schedulerName: custom-scheduler
containers:
- name: app
image: app:latest
Scheduler Extender
Extend default scheduler:
apiVersion: v1
kind: ConfigMap
metadata:
name: scheduler-extender-config
data:
config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
extenders:
- urlPrefix: "http://scheduler-extender:8080"
filterVerb: "filter"
prioritizeVerb: "prioritize"
weight: 1
enableHttps: false
Descheduler
Rebalance pods:
apiVersion: v1
kind: ConfigMap
metadata:
name: descheduler-policy
data:
policy.yaml: |
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
RemoveDuplicates:
enabled: true
LowNodeUtilization:
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
cpu: 20
memory: 20
targetThresholds:
cpu: 50
memory: 50
Real-World Examples
GPU workload distribution:
apiVersion: v1
kind: Pod
metadata:
name: ml-training
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: workload
operator: In
values:
- ml-training
topologyKey: kubernetes.io/hostname
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: training
image: ml-training:latest
resources:
limits:
nvidia.com/gpu: 1
Cost optimization (spot instances):
apiVersion: v1
kind: Pod
metadata:
name: batch-job
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-lifecycle
operator: In
values:
- spot
tolerations:
- key: spot
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: job
image: batch-job:latest
Results
Before:
- Uneven pod distribution
- GPU utilization: 30%
- Noisy neighbor problems
- No workload prioritization
After:
- Even distribution across nodes
- GPU utilization: 85%
- Isolated workloads
- Critical pods always scheduled
Lessons Learned
- Use anti-affinity - Spread for HA
- Taint special nodes - Reserve for specific workloads
- Set priorities - Critical workloads first
- Monitor scheduling - Watch pending pods
- Test custom schedulers - Before production
Conclusion
Advanced scheduling strategies optimize resource utilization and ensure workload placement meets your requirements.
Key takeaways:
- Node affinity for hardware requirements
- Pod anti-affinity for HA
- Taints/tolerations for dedicated nodes
- Priority classes for critical workloads
- Custom schedulers for special needs
Master Kubernetes scheduling. Your workloads will run optimally.