Debugging a Production Java Memory Leak at 2 AM

Got paged at 2 AM last night. Production was down. OutOfMemoryError. Fun times.

Here’s how I debugged it, in case you ever find yourself in the same situation.

The Symptoms

Our main API server (Java 8, running on EC2) was crashing every 6 hours with OOM errors. Restarting it would fix the issue temporarily, but it would crash again like clockwork.

Classic memory leak.

Step 1: Get a Heap Dump

First thing: capture a heap dump before the next crash.

# Add this to your JVM args
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/heapdump.hprof

Waited for the next crash (didn’t take long), and got my heap dump.

Step 2: Analyze with MAT

Downloaded the heap dump and opened it in Eclipse Memory Analyzer Tool (MAT). This tool is a lifesaver.

The “Leak Suspects” report immediately showed the problem:

One instance of "com.example.CacheManager" loaded by "sun.misc.Launcher$AppClassLoader @ 0x7f8a1c0e8"
occupies 1,847,293,952 (89.23%) bytes.

89% of heap used by one object. That’s not normal.

Step 3: Dig Into the Code

Looked at CacheManager:

public class CacheManager {
    private static final Map<String, Object> cache = new HashMap<>();
    
    public static void put(String key, Object value) {
        cache.put(key, value);  // Never removes anything!
    }
    
    public static Object get(String key) {
        return cache.get(key);
    }
}

There it is. We’re caching API responses but never evicting old entries. Every request adds to the cache, and it grows forever.

Step 4: The Fix

Quick fix for production:

public class CacheManager {
    // Use a bounded cache
    private static final Map<String, Object> cache = 
        Collections.synchronizedMap(new LinkedHashMap<String, Object>(1000, 0.75f, true) {
            protected boolean removeEldestEntry(Map.Entry eldest) {
                return size() > 1000;
            }
        });
    
    // ... rest of the code
}

This limits the cache to 1000 entries using LRU eviction.

Deployed the fix, and memory usage stabilized.

Step 5: The Proper Fix

The quick fix worked, but we really should use a proper caching library:

// Using Guava Cache
private static final LoadingCache<String, Object> cache = CacheBuilder.newBuilder()
    .maximumSize(1000)
    .expireAfterWrite(10, TimeUnit.MINUTES)
    .build(new CacheLoader<String, Object>() {
        public Object load(String key) {
            return fetchFromAPI(key);
        }
    });

This gives us:

Size-based eviction
Time-based eviction
Thread-safe
Better performance

Lessons Learned

1. Always Set Max Heap Size

We were running with default heap settings. Now we explicitly set:

-Xms2g -Xmx2g

This makes OOM errors happen faster and more predictably.

2. Monitor Memory Usage

We added CloudWatch alarms for heap usage:

# Alert if heap usage > 80% for 5 minutes
aws cloudwatch put-metric-alarm \
  --alarm-name high-heap-usage \
  --metric-name HeapMemoryUsage \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold

3. Use Proper Data Structures

Don’t roll your own cache. Use Guava, Caffeine, or EHCache. They handle all the edge cases you’ll forget about.

4. Load Test

We load tested after the fix and found the optimal cache size for our workload (2000 entries, 15-minute TTL).

5. Document Your Caching Strategy

Added comments explaining:

Why we’re caching
What the eviction policy is
What the expected hit rate is

Tools I Used

Eclipse MAT: For heap dump analysis
VisualVM: For live monitoring (after the fact)
jmap: For capturing heap dumps
jstat: For monitoring GC activity

Prevention

To prevent this in the future:

Code review caught similar issues in other classes
Added unit tests for cache eviction
Set up monitoring for all caches
Documented caching best practices for the team

The Aftermath

Total downtime: 3 hours
Sleep lost: 4 hours
Lessons learned: Priceless

Memory leaks are sneaky. They don’t show up in development because you’re not running long enough. Always think about what happens when your code runs for days or weeks.

And for the love of all that is holy, use a proper caching library. Don’t be like me at 2 AM.

Anyone else have fun production debugging stories? Share them in the comments. Misery loves company.