How We Cut Our EC2 Costs by 60% Without Sacrificing Performance
Last month our AWS bill hit $12,000 and management wasn’t happy. After two weeks of optimization work, we got it down to $4,800. Here’s how we did it.
The Problem
We were running everything on on-demand instances because “we might need to scale quickly.” Spoiler: we never scaled quickly. We just paid 3x more than we needed to.
Strategy 1: Reserved Instances
This was the low-hanging fruit. We analyzed our usage over 3 months and found that we had a baseline of 15 m4.large instances running 24/7. These were perfect candidates for Reserved Instances.
Before: 15 × $0.10/hour × 730 hours = $1,095/month
After: 15 × $0.065/hour × 730 hours = $712/month (1-year RI, partial upfront)
That’s $383/month saved just by committing to instances we were already running. The ROI was immediate.
The Gotcha
Reserved Instances are per-region and per-instance-type. We bought RIs for us-east-1 m4.large, then realized half our instances were in us-west-2. Oops. Make sure you know where your instances are before buying.
Strategy 2: Right-Sizing
We had a bunch of m4.xlarge instances (4 vCPUs, 16GB RAM) running applications that barely used 1 vCPU and 4GB RAM. Classic over-provisioning.
I wrote a quick script to pull CloudWatch metrics:
#!/bin/bash
# Check CPU utilization for all instances
aws ec2 describe-instances --query 'Reservations[*].Instances[*].[InstanceId]' --output text | while read instance; do
echo "Instance: $instance"
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=$instance \
--start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average \
--query 'Datapoints[*].Average' \
--output text
done
Turns out, 60% of our instances were using less than 20% CPU on average. We downsized:
- 10 m4.xlarge → m4.large (saved $500/month)
- 5 m4.large → m4.medium (saved $180/month)
Strategy 3: Auto-Scaling
We had instances running at night when traffic was basically zero. Set up auto-scaling to scale down to 50% capacity during off-hours (10 PM - 6 AM).
{
"ScheduledActions": [
{
"ScheduledActionName": "scale-down-night",
"Recurrence": "0 22 * * *",
"MinSize": 5,
"MaxSize": 10,
"DesiredCapacity": 5
},
{
"ScheduledActionName": "scale-up-morning",
"Recurrence": "0 6 * * *",
"MinSize": 10,
"MaxSize": 20,
"DesiredCapacity": 10
}
]
}
This saved another $400/month.
Strategy 4: Spot Instances for Non-Critical Workloads
We run batch jobs for data processing. These don’t need to complete immediately, so they’re perfect for Spot Instances.
Moved our batch processing from on-demand m4.large ($0.10/hour) to spot instances (average $0.03/hour). That’s 70% savings on compute for batch jobs.
The catch: Spot instances can be terminated with 2 minutes notice. Make sure your jobs can handle interruption gracefully.
Strategy 5: Terminate Zombie Instances
Found 8 instances that nobody knew about. They were created for testing 6 months ago and forgotten. Terminated them immediately.
Savings: $600/month
Pro tip: Tag everything. We now have a policy that any instance without proper tags gets terminated after 7 days.
The Results
Before: $12,000/month
After: $4,800/month
Savings: 60%
Breakdown:
- Reserved Instances: $2,300/month saved
- Right-sizing: $680/month saved
- Auto-scaling: $400/month saved
- Spot instances: $220/month saved
- Terminated zombies: $600/month saved
- Other optimizations: $1,800/month saved
Lessons Learned
-
Monitor Everything: If you’re not tracking it, you can’t optimize it. Set up CloudWatch dashboards.
-
Start with the Obvious: Reserved Instances and right-sizing are easy wins. Do these first.
-
Automate: Manual scaling doesn’t work. People forget. Automation doesn’t.
-
Review Regularly: Set a calendar reminder to review costs monthly. It’s easy for waste to creep back in.
-
Tag Everything: Seriously. Tags make it possible to track costs by project, team, environment, etc.
Tools We Use
- AWS Cost Explorer: For analyzing spending patterns
- CloudWatch: For monitoring resource utilization
- Custom scripts: For automated right-sizing recommendations
- Slack bot: Sends daily cost reports to our DevOps channel
What’s Next
We’re looking into:
- Containerization with ECS to improve resource utilization
- Moving some workloads to Lambda
- Using S3 lifecycle policies to reduce storage costs
If you’re running on AWS and haven’t optimized costs, you’re probably overpaying. Start with Reserved Instances and right-sizing. You’ll see results immediately.
Anyone else have cost optimization wins to share? I’d love to hear them.