Go Performance Profiling: Using pprof to Optimize CPU and Memory
Our Go API was using 8 CPU cores at 80% load. Memory usage was 4GB and growing. Response time was 500ms. We needed to optimize but didn’t know where to start.
I used pprof to profile the application. Found the bottlenecks. Optimized them. Now: 3 cores at 40%, 1.2GB memory, 150ms response time. Here’s how.
Table of Contents
The Problem
Slow API without knowing why:
- High CPU usage
- High memory usage
- Slow response times
- No idea where the bottleneck is
We needed data.
CPU Profiling
Enable pprof in your app:
package main
import (
"net/http"
_ "net/http/pprof"
)
func main() {
// Your application code
// pprof endpoint
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
// Start your server
startServer()
}
Collect CPU profile:
# Profile for 30 seconds
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
Interactive mode:
(pprof) top
Showing nodes accounting for 2.5s, 83.33% of 3s total
flat flat% sum% cum cum%
1.2s 40.00% 40.00% 1.5s 50.00% encoding/json.(*encodeState).string
0.8s 26.67% 66.67% 0.8s 26.67% runtime.mallocgc
0.5s 16.67% 83.33% 0.5s 16.67% regexp.(*Regexp).FindStringSubmatch
flat: Time spent in function
cum: Time spent in function + callees
Visualize Profile
Generate graph:
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
Opens browser with interactive flame graph!
Memory Profiling
Heap profile:
go tool pprof http://localhost:6060/debug/pprof/heap
(pprof) top
Showing nodes accounting for 512MB, 85% of 600MB total
flat flat% sum% cum cum%
256MB 42.67% 42.67% 256MB 42.67% main.loadData
128MB 21.33% 64.00% 128MB 21.33% encoding/json.Unmarshal
128MB 21.33% 85.33% 128MB 21.33% bytes.Buffer.Grow
Allocation profile:
go tool pprof http://localhost:6060/debug/pprof/allocs
Shows all allocations (including freed memory).
Goroutine Profiling
Check for goroutine leaks:
curl http://localhost:6060/debug/pprof/goroutine?debug=1
Or:
go tool pprof http://localhost:6060/debug/pprof/goroutine
Real-World Example
Problem: JSON encoding is slow
Profile shows:
1.2s encoding/json.(*encodeState).string
Before:
func getUsers(w http.ResponseWriter, r *http.Request) {
users := []User{}
db.Find(&users)
// Slow!
json.NewEncoder(w).Encode(users)
}
After (use faster JSON library):
import "github.com/json-iterator/go"
var json = jsoniter.ConfigCompatibleWithStandardLibrary
func getUsers(w http.ResponseWriter, r *http.Request) {
users := []User{}
db.Find(&users)
// 3x faster!
json.NewEncoder(w).Encode(users)
}
Result: 1.2s → 0.4s (67% faster)
Benchmarking
Write benchmarks:
func BenchmarkGetUsers(b *testing.B) {
for i := 0; i < b.N; i++ {
getUsers()
}
}
Run:
go test -bench=. -benchmem
Output:
BenchmarkGetUsers-8 1000 1200000 ns/op 512000 B/op 1000 allocs/op
ns/op: Nanoseconds per operation
B/op: Bytes allocated per operation
allocs/op: Allocations per operation
Memory Optimization
Problem: Too many allocations
// Before: 1000 allocations
func processData(items []string) []string {
result := []string{}
for _, item := range items {
result = append(result, strings.ToUpper(item))
}
return result
}
After: Pre-allocate slice
// After: 1 allocation
func processData(items []string) []string {
result := make([]string, 0, len(items))
for _, item := range items {
result = append(result, strings.ToUpper(item))
}
return result
}
Benchmark:
Before: 1000 allocs/op, 128KB/op
After: 1 allocs/op, 8KB/op
String Concatenation
Bad:
func buildString(items []string) string {
result := ""
for _, item := range items {
result += item + "," // Creates new string each time!
}
return result
}
Good:
func buildString(items []string) string {
var builder strings.Builder
builder.Grow(len(items) * 10) // Pre-allocate
for _, item := range items {
builder.WriteString(item)
builder.WriteString(",")
}
return builder.String()
}
Benchmark:
Bad: 10000 allocs/op, 5MB/op, 50ms
Good: 1 allocs/op, 100KB/op, 1ms
Reduce Allocations
Use sync.Pool for frequently allocated objects:
var bufferPool = sync.Pool{
New: func() interface{} {
return new(bytes.Buffer)
},
}
func processRequest(data []byte) []byte {
// Get buffer from pool
buf := bufferPool.Get().(*bytes.Buffer)
defer bufferPool.Put(buf)
buf.Reset()
buf.Write(data)
// Process...
return buf.Bytes()
}
Escape Analysis
Check if variables escape to heap:
go build -gcflags="-m" main.go
Output:
./main.go:10:6: moved to heap: user
./main.go:15:6: can inline getUser
Heap allocation is slower than stack!
Avoid escaping:
// Bad: escapes to heap
func getUser() *User {
user := User{Name: "Alice"}
return &user // Escapes!
}
// Good: stays on stack
func getUser() User {
return User{Name: "Alice"}
}
Continuous Profiling
Production profiling:
import (
"os"
"runtime/pprof"
"time"
)
func startContinuousProfiling() {
go func() {
for {
// CPU profile
f, _ := os.Create(fmt.Sprintf("cpu-%d.prof", time.Now().Unix()))
pprof.StartCPUProfile(f)
time.Sleep(30 * time.Second)
pprof.StopCPUProfile()
f.Close()
// Heap profile
f, _ = os.Create(fmt.Sprintf("heap-%d.prof", time.Now().Unix()))
pprof.WriteHeapProfile(f)
f.Close()
time.Sleep(5 * time.Minute)
}
}()
}
Trace Analysis
Detailed execution trace:
import "runtime/trace"
func main() {
f, _ := os.Create("trace.out")
defer f.Close()
trace.Start(f)
defer trace.Stop()
// Your code
}
Analyze:
go tool trace trace.out
Shows:
- Goroutine execution
- Network blocking
- Syscalls
- GC events
Real Production Optimizations
1. Reduce JSON marshaling:
// Before: Marshal on every request
func getUser(w http.ResponseWriter, r *http.Request) {
user := getFromDB()
json.NewEncoder(w).Encode(user) // Slow!
}
// After: Cache marshaled JSON
var userCache = make(map[int][]byte)
func getUser(w http.ResponseWriter, r *http.Request) {
id := getUserID(r)
if cached, ok := userCache[id]; ok {
w.Write(cached)
return
}
user := getFromDB()
data, _ := json.Marshal(user)
userCache[id] = data
w.Write(data)
}
Result: 500ms → 50ms (10x faster)
2. Use string interning:
// Before: Many duplicate strings
type Event struct {
Type string // "click", "view", "click", "view"...
}
// After: Intern strings
var stringPool = make(map[string]string)
func intern(s string) string {
if interned, ok := stringPool[s]; ok {
return interned
}
stringPool[s] = s
return s
}
type Event struct {
Type string
}
func NewEvent(typ string) Event {
return Event{Type: intern(typ)}
}
Result: 4GB memory → 1.2GB (70% reduction)
3. Optimize regex:
// Before: Compile regex on every call
func validateEmail(email string) bool {
re := regexp.MustCompile(`^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$`)
return re.MatchString(email)
}
// After: Compile once
var emailRegex = regexp.MustCompile(`^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$`)
func validateEmail(email string) bool {
return emailRegex.MatchString(email)
}
Result: 100µs → 1µs (100x faster)
Monitoring in Production
Prometheus metrics:
import "github.com/prometheus/client_golang/prometheus"
var (
heapAlloc = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "go_heap_alloc_bytes",
})
numGoroutines = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "go_goroutines",
})
)
func updateMetrics() {
var m runtime.MemStats
runtime.ReadMemStats(&m)
heapAlloc.Set(float64(m.Alloc))
numGoroutines.Set(float64(runtime.NumGoroutine()))
}
Results
Before:
- CPU: 8 cores @ 80%
- Memory: 4GB
- Response time: 500ms
- Goroutines: 10,000
After:
- CPU: 3 cores @ 40% (60% reduction)
- Memory: 1.2GB (70% reduction)
- Response time: 150ms (70% faster)
- Goroutines: 1,000 (90% reduction)
Lessons Learned
- Profile before optimizing - Don’t guess
- Focus on hot paths - 80/20 rule
- Benchmark everything - Verify improvements
- Pre-allocate slices - Avoid reallocations
- Use sync.Pool - For frequently allocated objects
Conclusion
Go’s profiling tools make it easy to find and fix performance bottlenecks. Always profile before optimizing.
Key takeaways:
- Use pprof for CPU and memory profiling
- Write benchmarks to verify improvements
- Pre-allocate slices and use strings.Builder
- Avoid unnecessary allocations
- Monitor production performance
Profile your Go applications. Find the bottlenecks. Optimize them.