Migrating from EC2 to Kubernetes: A Postmortem

We just finished migrating our main application from EC2 to Kubernetes. It took 3 months. Here’s what we learned.

Why Migrate?

Our EC2 setup was getting painful:

Manual scaling
Inconsistent deployments
Poor resource utilization (30% CPU average)
Long deployment times (30 minutes)

Kubernetes promised:

Auto-scaling
Declarative deployments
Better resource utilization
Faster deployments

The Plan

Phase 1 (Month 1): Set up Kubernetes cluster, deploy non-critical services
Phase 2 (Month 2): Deploy to staging, run parallel with EC2
Phase 3 (Month 3): Gradual production migration, decommission EC2

Phase 1: Setup

We used kops to create a cluster on AWS:

kops create cluster \
  --name=prod.k8s.local \
  --state=s3://kops-state-store \
  --zones=us-east-1a,us-east-1b,us-east-1c \
  --node-count=3 \
  --node-size=m4.large \
  --master-size=m4.large \
  --master-zones=us-east-1a,us-east-1b,us-east-1c

Took 2 weeks to get everything configured:

VPC setup
Security groups
IAM roles
Monitoring (Prometheus + Grafana)
Logging (ELK stack)

Phase 2: Staging

Converted our apps to Kubernetes manifests:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
      - name: api
        image: api:v1.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Ran staging on Kubernetes for a month. Found and fixed issues:

Memory leaks (OOMKilled pods)
Networking problems (DNS timeouts)
Storage issues (PersistentVolume configs)

Phase 3: Production Migration

We didn’t do a big-bang migration. Instead:

Week 1: 10% of traffic to Kubernetes
Week 2: 25% of traffic
Week 3: 50% of traffic
Week 4: 100% of traffic

Used weighted DNS to split traffic between EC2 and Kubernetes.

What Went Wrong

1. Networking Issues

Intermittent connection timeouts. Spent a week debugging. Turned out to be MTU mismatch between EC2 and Kubernetes network.

Fix:

# Set MTU on all nodes
ip link set dev eth0 mtu 1450

2. Storage Problems

Lost data during a node failure. We misconfigured PersistentVolume reclaim policy.

# Wrong
reclaimPolicy: Delete

# Right
reclaimPolicy: Retain

Now we use RDS for databases instead of running them in Kubernetes.

3. DNS Failures

CoreDNS crashed under load. Increased replicas and added resource limits:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: coredns
spec:
  replicas: 5  # Was 2
  template:
    spec:
      containers:
      - name: coredns
        resources:
          limits:
            memory: 170Mi
          requests:
            cpu: 100m
            memory: 70Mi

4. Certificate Expiry

Let’s Encrypt certs expired because cert-manager failed silently. Added monitoring:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cert-expiry
spec:
  groups:
  - name: certificates
    rules:
    - alert: CertificateExpiringSoon
      expr: certmanager_certificate_expiration_timestamp_seconds - time() < 604800
      annotations:
        summary: "Certificate expiring in less than 7 days"

The Results

Before (EC2):

20 m4.large instances
30% average CPU utilization
30-minute deployments
Manual scaling
Cost: $3,000/month

After (Kubernetes):

12 m4.large instances
60% average CPU utilization
5-minute deployments
Auto-scaling
Cost: $2,000/month

Savings: $1,000/month (33%)

Lessons Learned

1. Start Small

Don’t migrate everything at once. Start with non-critical services.

2. Monitor Everything

Set up monitoring before migration. You can’t debug what you can’t see.

3. Test Failure Scenarios

Kill pods, kill nodes, simulate network failures. Find problems before production.

4. Use Managed Databases

Don’t run databases in Kubernetes. Use RDS, ElastiCache, etc.

5. Gradual Migration

Use weighted DNS or feature flags to gradually shift traffic.

6. Document Everything

We created runbooks for common issues. Saved us during incidents.

Would We Do It Again?

Yes, but we’d do some things differently:

Use EKS instead of kops (it didn’t exist when we started)
Set up monitoring earlier
Spend more time on staging
Use Helm from the start

Advice for Others

Do migrate if:

You have multiple services
You need auto-scaling
You have a team that can learn Kubernetes
You’re on cloud (AWS, GCP, Azure)

Don’t migrate if:

You have a simple monolith
Your team is small (< 3 people)
EC2 is working fine
You don’t have time to learn

Kubernetes is powerful but complex. Make sure the benefits outweigh the costs.

Questions? Ask away!