Six months ago, we moved our main application to Kubernetes. It’s been a journey. Here’s what we learned.

Why We Chose Kubernetes

We were running Docker Swarm, and it was fine for simple use cases. But we needed:

  • Better auto-scaling
  • More sophisticated deployment strategies
  • Service mesh capabilities
  • Better monitoring integration

Kubernetes checked all these boxes.

The Migration

We didn’t do a big-bang migration. Instead:

Month 1: Set up cluster, deploy non-critical services
Month 2: Deploy staging environment
Month 3: Migrate 20% of production traffic
Month 4: Migrate 50% of production traffic
Month 5: Migrate 100% of production traffic
Month 6: Decommission old infrastructure

This gradual approach saved us from major disasters.

What Went Well

1. Auto-Scaling Actually Works

Horizontal Pod Autoscaler is magic:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 70

During traffic spikes, pods scale up automatically. During quiet periods, they scale down. Our AWS bill dropped 30%.

2. Rolling Updates Are Smooth

Deploying new versions is painless:

kubectl set image deployment/api api=api:v2.0

Kubernetes gradually replaces old pods with new ones. Zero downtime. If something breaks, rollback is one command:

kubectl rollout undo deployment/api

3. Self-Healing

Pods crash? Kubernetes restarts them. Nodes die? Kubernetes reschedules pods. We’ve had nodes fail, and users didn’t notice.

4. Resource Utilization

Before Kubernetes, our servers ran at 20-30% CPU. Now they run at 60-70%. Better bin packing means we need fewer servers.

What Went Wrong

1. Networking Is Hard

We spent two weeks debugging intermittent connection timeouts. Turned out to be a CNI plugin issue. Switched from Flannel to Calico, problem solved.

Lesson: Choose your CNI plugin carefully.

2. Persistent Storage Is Painful

StatefulSets and PersistentVolumes are complex. We had data loss during a node failure because we misconfigured volume reclaim policies.

Now we use managed databases (RDS, ElastiCache) instead of running databases in Kubernetes.

3. Monitoring Complexity

We’re running:

  • Prometheus for metrics
  • Grafana for dashboards
  • ELK stack for logs
  • Jaeger for tracing

That’s a lot of moving parts. Setting it all up took a month.

4. Learning Curve

Our team of 5 developers spent 3 months getting comfortable with Kubernetes. That’s a significant investment.

Production Incidents

Incident 1: OOMKilled Pods

Pods were getting OOMKilled randomly. We hadn’t set memory limits:

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Now we set limits for everything.

Incident 2: DNS Failures

CoreDNS was overwhelmed during traffic spikes. Increased replicas and added caching:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
data:
  Corefile: |
    .:53 {
        cache 30
        kubernetes cluster.local
    }

Incident 3: Certificate Expiry

Let’s Encrypt certificates expired because cert-manager failed silently. Added monitoring for certificate expiry.

Best Practices We Learned

1. Always Set Resource Limits

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

2. Use Liveness and Readiness Probes

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

3. Use Namespaces

Separate dev, staging, and prod into different namespaces. Use RBAC to control access.

4. Version Everything

Tag images with version numbers, not latest. Use Helm charts with version numbers.

5. Monitor Everything

  • Cluster health
  • Pod health
  • Resource usage
  • Application metrics
  • Certificate expiry

Tools We Use

  • Helm: Package manager for Kubernetes
  • Prometheus: Metrics collection
  • Grafana: Dashboards
  • ELK: Log aggregation
  • cert-manager: Automatic TLS certificates
  • Istio: Service mesh (still evaluating)

Cost Comparison

Before (EC2 + Docker Swarm):

  • 20 m4.large instances
  • Cost: $3,000/month

After (EKS + Kubernetes):

  • 12 m4.large instances (better utilization)
  • EKS control plane: $144/month
  • Cost: $2,000/month

Savings: $1,000/month (33%)

Would We Do It Again?

Yes, but with caveats:

Do use Kubernetes if:

  • You have multiple microservices
  • You need auto-scaling
  • You have a team that can learn it
  • You’re running on cloud (EKS, GKE, AKS)

Don’t use Kubernetes if:

  • You have a simple monolith
  • Your team is small (< 3 people)
  • You’re running on bare metal
  • You don’t have time to learn it

What’s Next

We’re exploring:

  • Istio for service mesh
  • GitOps with Flux
  • Multi-cluster setup for disaster recovery
  • Serverless with Knative

Advice for Teams Considering Kubernetes

  1. Start small: Deploy non-critical services first
  2. Use managed Kubernetes: EKS, GKE, or AKS. Don’t run your own control plane.
  3. Invest in training: Send your team to workshops or courses
  4. Set up monitoring early: You can’t manage what you can’t measure
  5. Plan for 3-6 months: It takes time to get comfortable

Kubernetes is powerful but complex. Make sure the benefits outweigh the costs for your use case.

Questions? Ask away. Happy to share more details about our setup.