Migrating from EC2 to Kubernetes: A Postmortem
We just finished migrating our main application from EC2 to Kubernetes. It took 3 months. Here’s what we learned.
Why Migrate?
Our EC2 setup was getting painful:
- Manual scaling
- Inconsistent deployments
- Poor resource utilization (30% CPU average)
- Long deployment times (30 minutes)
Kubernetes promised:
- Auto-scaling
- Declarative deployments
- Better resource utilization
- Faster deployments
The Plan
Phase 1 (Month 1): Set up Kubernetes cluster, deploy non-critical services
Phase 2 (Month 2): Deploy to staging, run parallel with EC2
Phase 3 (Month 3): Gradual production migration, decommission EC2
Phase 1: Setup
We used kops to create a cluster on AWS:
kops create cluster \
--name=prod.k8s.local \
--state=s3://kops-state-store \
--zones=us-east-1a,us-east-1b,us-east-1c \
--node-count=3 \
--node-size=m4.large \
--master-size=m4.large \
--master-zones=us-east-1a,us-east-1b,us-east-1c
Took 2 weeks to get everything configured:
- VPC setup
- Security groups
- IAM roles
- Monitoring (Prometheus + Grafana)
- Logging (ELK stack)
Phase 2: Staging
Converted our apps to Kubernetes manifests:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: api:v1.0
ports:
- containerPort: 8080
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Ran staging on Kubernetes for a month. Found and fixed issues:
- Memory leaks (OOMKilled pods)
- Networking problems (DNS timeouts)
- Storage issues (PersistentVolume configs)
Phase 3: Production Migration
We didn’t do a big-bang migration. Instead:
Week 1: 10% of traffic to Kubernetes
Week 2: 25% of traffic
Week 3: 50% of traffic
Week 4: 100% of traffic
Used weighted DNS to split traffic between EC2 and Kubernetes.
What Went Wrong
1. Networking Issues
Intermittent connection timeouts. Spent a week debugging. Turned out to be MTU mismatch between EC2 and Kubernetes network.
Fix:
# Set MTU on all nodes
ip link set dev eth0 mtu 1450
2. Storage Problems
Lost data during a node failure. We misconfigured PersistentVolume reclaim policy.
# Wrong
reclaimPolicy: Delete
# Right
reclaimPolicy: Retain
Now we use RDS for databases instead of running them in Kubernetes.
3. DNS Failures
CoreDNS crashed under load. Increased replicas and added resource limits:
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns
spec:
replicas: 5 # Was 2
template:
spec:
containers:
- name: coredns
resources:
limits:
memory: 170Mi
requests:
cpu: 100m
memory: 70Mi
4. Certificate Expiry
Let’s Encrypt certs expired because cert-manager failed silently. Added monitoring:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cert-expiry
spec:
groups:
- name: certificates
rules:
- alert: CertificateExpiringSoon
expr: certmanager_certificate_expiration_timestamp_seconds - time() < 604800
annotations:
summary: "Certificate expiring in less than 7 days"
The Results
Before (EC2):
- 20 m4.large instances
- 30% average CPU utilization
- 30-minute deployments
- Manual scaling
- Cost: $3,000/month
After (Kubernetes):
- 12 m4.large instances
- 60% average CPU utilization
- 5-minute deployments
- Auto-scaling
- Cost: $2,000/month
Savings: $1,000/month (33%)
Lessons Learned
1. Start Small
Don’t migrate everything at once. Start with non-critical services.
2. Monitor Everything
Set up monitoring before migration. You can’t debug what you can’t see.
3. Test Failure Scenarios
Kill pods, kill nodes, simulate network failures. Find problems before production.
4. Use Managed Databases
Don’t run databases in Kubernetes. Use RDS, ElastiCache, etc.
5. Gradual Migration
Use weighted DNS or feature flags to gradually shift traffic.
6. Document Everything
We created runbooks for common issues. Saved us during incidents.
Would We Do It Again?
Yes, but we’d do some things differently:
- Use EKS instead of kops (it didn’t exist when we started)
- Set up monitoring earlier
- Spend more time on staging
- Use Helm from the start
Advice for Others
Do migrate if:
- You have multiple services
- You need auto-scaling
- You have a team that can learn Kubernetes
- You’re on cloud (AWS, GCP, Azure)
Don’t migrate if:
- You have a simple monolith
- Your team is small (< 3 people)
- EC2 is working fine
- You don’t have time to learn
Kubernetes is powerful but complex. Make sure the benefits outweigh the costs.
Questions? Ask away!