Kubernetes in Production: 6 Months Later
Six months ago, we moved our main application to Kubernetes. It’s been a journey. Here’s what we learned.
Why We Chose Kubernetes
We were running Docker Swarm, and it was fine for simple use cases. But we needed:
- Better auto-scaling
- More sophisticated deployment strategies
- Service mesh capabilities
- Better monitoring integration
Kubernetes checked all these boxes.
The Migration
We didn’t do a big-bang migration. Instead:
Month 1: Set up cluster, deploy non-critical services
Month 2: Deploy staging environment
Month 3: Migrate 20% of production traffic
Month 4: Migrate 50% of production traffic
Month 5: Migrate 100% of production traffic
Month 6: Decommission old infrastructure
This gradual approach saved us from major disasters.
What Went Well
1. Auto-Scaling Actually Works
Horizontal Pod Autoscaler is magic:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 70
During traffic spikes, pods scale up automatically. During quiet periods, they scale down. Our AWS bill dropped 30%.
2. Rolling Updates Are Smooth
Deploying new versions is painless:
kubectl set image deployment/api api=api:v2.0
Kubernetes gradually replaces old pods with new ones. Zero downtime. If something breaks, rollback is one command:
kubectl rollout undo deployment/api
3. Self-Healing
Pods crash? Kubernetes restarts them. Nodes die? Kubernetes reschedules pods. We’ve had nodes fail, and users didn’t notice.
4. Resource Utilization
Before Kubernetes, our servers ran at 20-30% CPU. Now they run at 60-70%. Better bin packing means we need fewer servers.
What Went Wrong
1. Networking Is Hard
We spent two weeks debugging intermittent connection timeouts. Turned out to be a CNI plugin issue. Switched from Flannel to Calico, problem solved.
Lesson: Choose your CNI plugin carefully.
2. Persistent Storage Is Painful
StatefulSets and PersistentVolumes are complex. We had data loss during a node failure because we misconfigured volume reclaim policies.
Now we use managed databases (RDS, ElastiCache) instead of running databases in Kubernetes.
3. Monitoring Complexity
We’re running:
- Prometheus for metrics
- Grafana for dashboards
- ELK stack for logs
- Jaeger for tracing
That’s a lot of moving parts. Setting it all up took a month.
4. Learning Curve
Our team of 5 developers spent 3 months getting comfortable with Kubernetes. That’s a significant investment.
Production Incidents
Incident 1: OOMKilled Pods
Pods were getting OOMKilled randomly. We hadn’t set memory limits:
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Now we set limits for everything.
Incident 2: DNS Failures
CoreDNS was overwhelmed during traffic spikes. Increased replicas and added caching:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
data:
Corefile: |
.:53 {
cache 30
kubernetes cluster.local
}
Incident 3: Certificate Expiry
Let’s Encrypt certificates expired because cert-manager failed silently. Added monitoring for certificate expiry.
Best Practices We Learned
1. Always Set Resource Limits
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
2. Use Liveness and Readiness Probes
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
3. Use Namespaces
Separate dev, staging, and prod into different namespaces. Use RBAC to control access.
4. Version Everything
Tag images with version numbers, not latest. Use Helm charts with version numbers.
5. Monitor Everything
- Cluster health
- Pod health
- Resource usage
- Application metrics
- Certificate expiry
Tools We Use
- Helm: Package manager for Kubernetes
- Prometheus: Metrics collection
- Grafana: Dashboards
- ELK: Log aggregation
- cert-manager: Automatic TLS certificates
- Istio: Service mesh (still evaluating)
Cost Comparison
Before (EC2 + Docker Swarm):
- 20 m4.large instances
- Cost: $3,000/month
After (EKS + Kubernetes):
- 12 m4.large instances (better utilization)
- EKS control plane: $144/month
- Cost: $2,000/month
Savings: $1,000/month (33%)
Would We Do It Again?
Yes, but with caveats:
Do use Kubernetes if:
- You have multiple microservices
- You need auto-scaling
- You have a team that can learn it
- You’re running on cloud (EKS, GKE, AKS)
Don’t use Kubernetes if:
- You have a simple monolith
- Your team is small (< 3 people)
- You’re running on bare metal
- You don’t have time to learn it
What’s Next
We’re exploring:
- Istio for service mesh
- GitOps with Flux
- Multi-cluster setup for disaster recovery
- Serverless with Knative
Advice for Teams Considering Kubernetes
- Start small: Deploy non-critical services first
- Use managed Kubernetes: EKS, GKE, or AKS. Don’t run your own control plane.
- Invest in training: Send your team to workshops or courses
- Set up monitoring early: You can’t manage what you can’t measure
- Plan for 3-6 months: It takes time to get comfortable
Kubernetes is powerful but complex. Make sure the benefits outweigh the costs for your use case.
Questions? Ask away. Happy to share more details about our setup.