Debugging a Production Java Memory Leak at 2 AM
Got paged at 2 AM last night. Production was down. OutOfMemoryError. Fun times.
Here’s how I debugged it, in case you ever find yourself in the same situation.
The Symptoms
Our main API server (Java 8, running on EC2) was crashing every 6 hours with OOM errors. Restarting it would fix the issue temporarily, but it would crash again like clockwork.
Classic memory leak.
Step 1: Get a Heap Dump
First thing: capture a heap dump before the next crash.
# Add this to your JVM args
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/heapdump.hprof
Waited for the next crash (didn’t take long), and got my heap dump.
Step 2: Analyze with MAT
Downloaded the heap dump and opened it in Eclipse Memory Analyzer Tool (MAT). This tool is a lifesaver.
The “Leak Suspects” report immediately showed the problem:
One instance of "com.example.CacheManager" loaded by "sun.misc.Launcher$AppClassLoader @ 0x7f8a1c0e8"
occupies 1,847,293,952 (89.23%) bytes.
89% of heap used by one object. That’s not normal.
Step 3: Dig Into the Code
Looked at CacheManager:
public class CacheManager {
private static final Map<String, Object> cache = new HashMap<>();
public static void put(String key, Object value) {
cache.put(key, value); // Never removes anything!
}
public static Object get(String key) {
return cache.get(key);
}
}
There it is. We’re caching API responses but never evicting old entries. Every request adds to the cache, and it grows forever.
Step 4: The Fix
Quick fix for production:
public class CacheManager {
// Use a bounded cache
private static final Map<String, Object> cache =
Collections.synchronizedMap(new LinkedHashMap<String, Object>(1000, 0.75f, true) {
protected boolean removeEldestEntry(Map.Entry eldest) {
return size() > 1000;
}
});
// ... rest of the code
}
This limits the cache to 1000 entries using LRU eviction.
Deployed the fix, and memory usage stabilized.
Step 5: The Proper Fix
The quick fix worked, but we really should use a proper caching library:
// Using Guava Cache
private static final LoadingCache<String, Object> cache = CacheBuilder.newBuilder()
.maximumSize(1000)
.expireAfterWrite(10, TimeUnit.MINUTES)
.build(new CacheLoader<String, Object>() {
public Object load(String key) {
return fetchFromAPI(key);
}
});
This gives us:
- Size-based eviction
- Time-based eviction
- Thread-safe
- Better performance
Lessons Learned
1. Always Set Max Heap Size
We were running with default heap settings. Now we explicitly set:
-Xms2g -Xmx2g
This makes OOM errors happen faster and more predictably.
2. Monitor Memory Usage
We added CloudWatch alarms for heap usage:
# Alert if heap usage > 80% for 5 minutes
aws cloudwatch put-metric-alarm \
--alarm-name high-heap-usage \
--metric-name HeapMemoryUsage \
--threshold 80 \
--comparison-operator GreaterThanThreshold
3. Use Proper Data Structures
Don’t roll your own cache. Use Guava, Caffeine, or EHCache. They handle all the edge cases you’ll forget about.
4. Load Test
We load tested after the fix and found the optimal cache size for our workload (2000 entries, 15-minute TTL).
5. Document Your Caching Strategy
Added comments explaining:
- Why we’re caching
- What the eviction policy is
- What the expected hit rate is
Tools I Used
- Eclipse MAT: For heap dump analysis
- VisualVM: For live monitoring (after the fact)
- jmap: For capturing heap dumps
- jstat: For monitoring GC activity
Prevention
To prevent this in the future:
- Code review caught similar issues in other classes
- Added unit tests for cache eviction
- Set up monitoring for all caches
- Documented caching best practices for the team
The Aftermath
Total downtime: 3 hours
Sleep lost: 4 hours
Lessons learned: Priceless
Memory leaks are sneaky. They don’t show up in development because you’re not running long enough. Always think about what happens when your code runs for days or weeks.
And for the love of all that is holy, use a proper caching library. Don’t be like me at 2 AM.
Anyone else have fun production debugging stories? Share them in the comments. Misery loves company.