Last month our AWS bill hit $12,000 and management wasn’t happy. After two weeks of optimization work, we got it down to $4,800. Here’s how we did it.

The Problem

We were running everything on on-demand instances because “we might need to scale quickly.” Spoiler: we never scaled quickly. We just paid 3x more than we needed to.

Strategy 1: Reserved Instances

This was the low-hanging fruit. We analyzed our usage over 3 months and found that we had a baseline of 15 m4.large instances running 24/7. These were perfect candidates for Reserved Instances.

Before: 15 × $0.10/hour × 730 hours = $1,095/month
After: 15 × $0.065/hour × 730 hours = $712/month (1-year RI, partial upfront)

That’s $383/month saved just by committing to instances we were already running. The ROI was immediate.

The Gotcha

Reserved Instances are per-region and per-instance-type. We bought RIs for us-east-1 m4.large, then realized half our instances were in us-west-2. Oops. Make sure you know where your instances are before buying.

Strategy 2: Right-Sizing

We had a bunch of m4.xlarge instances (4 vCPUs, 16GB RAM) running applications that barely used 1 vCPU and 4GB RAM. Classic over-provisioning.

I wrote a quick script to pull CloudWatch metrics:

#!/bin/bash
# Check CPU utilization for all instances

aws ec2 describe-instances --query 'Reservations[*].Instances[*].[InstanceId]' --output text | while read instance; do
    echo "Instance: $instance"
    aws cloudwatch get-metric-statistics \
        --namespace AWS/EC2 \
        --metric-name CPUUtilization \
        --dimensions Name=InstanceId,Value=$instance \
        --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
        --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
        --period 3600 \
        --statistics Average \
        --query 'Datapoints[*].Average' \
        --output text
done

Turns out, 60% of our instances were using less than 20% CPU on average. We downsized:

  • 10 m4.xlarge → m4.large (saved $500/month)
  • 5 m4.large → m4.medium (saved $180/month)

Strategy 3: Auto-Scaling

We had instances running at night when traffic was basically zero. Set up auto-scaling to scale down to 50% capacity during off-hours (10 PM - 6 AM).

{
  "ScheduledActions": [
    {
      "ScheduledActionName": "scale-down-night",
      "Recurrence": "0 22 * * *",
      "MinSize": 5,
      "MaxSize": 10,
      "DesiredCapacity": 5
    },
    {
      "ScheduledActionName": "scale-up-morning",
      "Recurrence": "0 6 * * *",
      "MinSize": 10,
      "MaxSize": 20,
      "DesiredCapacity": 10
    }
  ]
}

This saved another $400/month.

Strategy 4: Spot Instances for Non-Critical Workloads

We run batch jobs for data processing. These don’t need to complete immediately, so they’re perfect for Spot Instances.

Moved our batch processing from on-demand m4.large ($0.10/hour) to spot instances (average $0.03/hour). That’s 70% savings on compute for batch jobs.

The catch: Spot instances can be terminated with 2 minutes notice. Make sure your jobs can handle interruption gracefully.

Strategy 5: Terminate Zombie Instances

Found 8 instances that nobody knew about. They were created for testing 6 months ago and forgotten. Terminated them immediately.

Savings: $600/month

Pro tip: Tag everything. We now have a policy that any instance without proper tags gets terminated after 7 days.

The Results

Before: $12,000/month
After: $4,800/month
Savings: 60%

Breakdown:

  • Reserved Instances: $2,300/month saved
  • Right-sizing: $680/month saved
  • Auto-scaling: $400/month saved
  • Spot instances: $220/month saved
  • Terminated zombies: $600/month saved
  • Other optimizations: $1,800/month saved

Lessons Learned

  1. Monitor Everything: If you’re not tracking it, you can’t optimize it. Set up CloudWatch dashboards.

  2. Start with the Obvious: Reserved Instances and right-sizing are easy wins. Do these first.

  3. Automate: Manual scaling doesn’t work. People forget. Automation doesn’t.

  4. Review Regularly: Set a calendar reminder to review costs monthly. It’s easy for waste to creep back in.

  5. Tag Everything: Seriously. Tags make it possible to track costs by project, team, environment, etc.

Tools We Use

  • AWS Cost Explorer: For analyzing spending patterns
  • CloudWatch: For monitoring resource utilization
  • Custom scripts: For automated right-sizing recommendations
  • Slack bot: Sends daily cost reports to our DevOps channel

What’s Next

We’re looking into:

  • Containerization with ECS to improve resource utilization
  • Moving some workloads to Lambda
  • Using S3 lifecycle policies to reduce storage costs

If you’re running on AWS and haven’t optimized costs, you’re probably overpaying. Start with Reserved Instances and right-sizing. You’ll see results immediately.

Anyone else have cost optimization wins to share? I’d love to hear them.