Kubernetes in Production: 6 Months Later

Six months ago, we moved our main application to Kubernetes. It’s been a journey. Here’s what we learned.

Why We Chose Kubernetes

We were running Docker Swarm, and it was fine for simple use cases. But we needed:

Better auto-scaling
More sophisticated deployment strategies
Service mesh capabilities
Better monitoring integration

Kubernetes checked all these boxes.

The Migration

We didn’t do a big-bang migration. Instead:

Month 1: Set up cluster, deploy non-critical services
Month 2: Deploy staging environment
Month 3: Migrate 20% of production traffic
Month 4: Migrate 50% of production traffic
Month 5: Migrate 100% of production traffic
Month 6: Decommission old infrastructure

This gradual approach saved us from major disasters.

What Went Well

1. Auto-Scaling Actually Works

Horizontal Pod Autoscaler is magic:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 70

During traffic spikes, pods scale up automatically. During quiet periods, they scale down. Our AWS bill dropped 30%.

2. Rolling Updates Are Smooth

Deploying new versions is painless:

kubectl set image deployment/api api=api:v2.0

Kubernetes gradually replaces old pods with new ones. Zero downtime. If something breaks, rollback is one command:

kubectl rollout undo deployment/api

3. Self-Healing

Pods crash? Kubernetes restarts them. Nodes die? Kubernetes reschedules pods. We’ve had nodes fail, and users didn’t notice.

4. Resource Utilization

Before Kubernetes, our servers ran at 20-30% CPU. Now they run at 60-70%. Better bin packing means we need fewer servers.

What Went Wrong

1. Networking Is Hard

We spent two weeks debugging intermittent connection timeouts. Turned out to be a CNI plugin issue. Switched from Flannel to Calico, problem solved.

Lesson: Choose your CNI plugin carefully.

2. Persistent Storage Is Painful

StatefulSets and PersistentVolumes are complex. We had data loss during a node failure because we misconfigured volume reclaim policies.

Now we use managed databases (RDS, ElastiCache) instead of running databases in Kubernetes.

3. Monitoring Complexity

We’re running:

Prometheus for metrics
Grafana for dashboards
ELK stack for logs
Jaeger for tracing

That’s a lot of moving parts. Setting it all up took a month.

4. Learning Curve

Our team of 5 developers spent 3 months getting comfortable with Kubernetes. That’s a significant investment.

Production Incidents

Incident 1: OOMKilled Pods

Pods were getting OOMKilled randomly. We hadn’t set memory limits:

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Now we set limits for everything.

Incident 2: DNS Failures

CoreDNS was overwhelmed during traffic spikes. Increased replicas and added caching:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
data:
  Corefile: |
    .:53 {
        cache 30
        kubernetes cluster.local
    }

Incident 3: Certificate Expiry

Let’s Encrypt certificates expired because cert-manager failed silently. Added monitoring for certificate expiry.

Best Practices We Learned

1. Always Set Resource Limits

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

2. Use Liveness and Readiness Probes

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

3. Use Namespaces

Separate dev, staging, and prod into different namespaces. Use RBAC to control access.

4. Version Everything

Tag images with version numbers, not latest. Use Helm charts with version numbers.

5. Monitor Everything

Cluster health
Pod health
Resource usage
Application metrics
Certificate expiry

Tools We Use

Helm: Package manager for Kubernetes
Prometheus: Metrics collection
Grafana: Dashboards
ELK: Log aggregation
cert-manager: Automatic TLS certificates
Istio: Service mesh (still evaluating)

Cost Comparison

Before (EC2 + Docker Swarm):

20 m4.large instances
Cost: $3,000/month

After (EKS + Kubernetes):

12 m4.large instances (better utilization)
EKS control plane: $144/month
Cost: $2,000/month

Savings: $1,000/month (33%)

Would We Do It Again?

Yes, but with caveats:

Do use Kubernetes if:

You have multiple microservices
You need auto-scaling
You have a team that can learn it
You’re running on cloud (EKS, GKE, AKS)

Don’t use Kubernetes if:

You have a simple monolith
Your team is small (< 3 people)
You’re running on bare metal
You don’t have time to learn it

What’s Next

We’re exploring:

Istio for service mesh
GitOps with Flux
Multi-cluster setup for disaster recovery
Serverless with Knative

Advice for Teams Considering Kubernetes

Start small: Deploy non-critical services first
Use managed Kubernetes: EKS, GKE, or AKS. Don’t run your own control plane.
Invest in training: Send your team to workshops or courses
Set up monitoring early: You can’t manage what you can’t measure
Plan for 3-6 months: It takes time to get comfortable

Kubernetes is powerful but complex. Make sure the benefits outweigh the costs for your use case.

Questions? Ask away. Happy to share more details about our setup.