Kubernetes Horizontal Pod Autoscaling: Handling Traffic Spikes

Black Friday traffic spike crashed our services. We had 3 pods running, traffic increased 10x, and everything fell over.

I implemented Horizontal Pod Autoscaling (HPA). Now our services automatically scale from 3 to 30 pods during traffic spikes. No more crashes.

The Black Friday Disaster

Traffic pattern:

Normal: 500 req/s (3 pods)
Black Friday: 5000 req/s (still 3 pods)
Result: 503 errors, angry customers

We manually scaled to 20 pods, but it took 15 minutes. By then, we’d lost sales.

Horizontal Pod Autoscaler

HPA automatically scales pods based on metrics:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: user-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  minReplicas: 3
  maxReplicas: 30
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

When CPU > 70%, HPA adds pods. When CPU < 70%, HPA removes pods.

CPU-Based Autoscaling

Simple and effective:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 5
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Target 60% CPU utilization. Leaves headroom for spikes.

Memory-Based Autoscaling

For memory-intensive services:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: data-processor-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: data-processor
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Multiple Metrics

Scale on CPU OR memory:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: user-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  minReplicas: 3
  maxReplicas: 30
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

HPA scales if EITHER metric exceeds target.

Custom Metrics

Scale on requests per second:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 5
  maxReplicas: 50
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Requires metrics server and custom metrics API.

Setting Up Metrics Server

Install metrics-server:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Verify:

kubectl top nodes
kubectl top pods

Resource Requests Required

HPA needs resource requests defined:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: user-service
        image: user-service:latest
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi

Without requests, HPA can’t calculate utilization.

Scaling Behavior

Control scale-up/down speed:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: user-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  minReplicas: 3
  maxReplicas: 30
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      selectPolicy: Min

Scale up: Aggressive (double pods every 15s)
Scale down: Conservative (wait 5 minutes, then reduce by 50%)

Monitoring HPA

Check HPA status:

kubectl get hpa

NAME                REFERENCE              TARGETS   MINPODS   MAXPODS   REPLICAS
user-service-hpa    Deployment/user-service   45%/70%   3         30        5

Describe for details:

kubectl describe hpa user-service-hpa

Output:

Metrics:
  resource cpu on pods  (as a percentage of request):  45% (45m) / 70%
Min replicas:  3
Max replicas:  30
Deployment pods:  5 current / 5 desired
Events:
  Normal  SuccessfulRescale  2m   horizontal-pod-autoscaler  New size: 5; reason: cpu resource utilization (percentage of request) above target

Testing Autoscaling

Generate load:

# Install hey
go get -u github.com/rakyll/hey

# Generate load
hey -z 5m -c 100 http://api-gateway/users

Watch pods scale:

watch kubectl get pods

Real-World Example

Our API gateway during traffic spike:

Time    | Traffic | CPU  | Pods | Status
--------|---------|------|------|--------
10:00   | 500/s   | 40%  | 3    | Normal
10:15   | 2000/s  | 85%  | 6    | Scaling up
10:20   | 4000/s  | 75%  | 12   | Scaling up
10:25   | 5000/s  | 70%  | 15   | Stable
10:45   | 3000/s  | 60%  | 15   | Stable (cooldown)
11:00   | 1000/s  | 45%  | 10   | Scaling down
11:15   | 500/s   | 35%  | 5    | Scaling down
11:30   | 500/s   | 40%  | 3    | Back to normal

HPA handled the spike automatically!

Cost Optimization

Set appropriate min/max:

# Development
minReplicas: 1
maxReplicas: 5

# Staging
minReplicas: 2
maxReplicas: 10

# Production
minReplicas: 5
maxReplicas: 50

Don’t over-provision. Let HPA scale as needed.

Combining with Cluster Autoscaler

HPA scales pods. Cluster Autoscaler scales nodes.

# Cluster Autoscaler config
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler
  namespace: kube-system
data:
  min-nodes: "3"
  max-nodes: "20"

When HPA adds pods and nodes are full, Cluster Autoscaler adds nodes.

Prometheus Custom Metrics

Scale on custom metrics from Prometheus:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 5
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metric:
        name: http_requests_per_second
        selector:
          matchLabels:
            service: api-gateway
      target:
        type: AverageValue
        averageValue: "1000"

Requires Prometheus Adapter.

Best Practices

Set resource requests - Required for HPA
Conservative scale-down - Avoid flapping
Aggressive scale-up - Handle spikes quickly
Monitor HPA events - Understand scaling behavior
Test under load - Verify HPA works

Common Issues

1. HPA not scaling:

# Check metrics
kubectl top pods

# Check HPA
kubectl describe hpa user-service-hpa

Usually missing resource requests or metrics-server not running.

2. Flapping (constant scale up/down):

Increase stabilization window:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 600  # 10 minutes

3. Slow scale-up:

Reduce stabilization window and increase scale-up rate:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 200  # Triple pods
      periodSeconds: 15

Vertical Pod Autoscaler

For right-sizing resources (not covered here):

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: user-service-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  updatePolicy:
    updateMode: "Auto"

VPA adjusts CPU/memory requests. Use with HPA carefully.

Results

After implementing HPA:

Before:

Manual scaling during traffic spikes
15-minute response time
Frequent outages during high traffic
Over-provisioned during low traffic

After:

Automatic scaling
2-minute response time
No outages
Cost reduced by 40% (fewer idle pods)

Lessons Learned

Start conservative - Don’t scale too aggressively
Monitor closely - Watch HPA behavior
Test thoroughly - Load test before production
Set appropriate limits - Don’t let HPA scale infinitely
Combine with alerts - Know when HPA is scaling

Conclusion

HPA is essential for production Kubernetes. It handles traffic spikes automatically and reduces costs during low traffic.

Key takeaways:

Use HPA for all production services
Set resource requests/limits
Start with CPU-based scaling
Add custom metrics as needed
Test under load

Our services now handle 10x traffic spikes without manual intervention. HPA saved us during Black Friday.

If you’re running Kubernetes in production, implement HPA. Your on-call team will thank you.

Table of Contents