Go Performance Profiling: Using pprof to Optimize CPU and Memory

Our Go API was using 8 CPU cores at 80% load. Memory usage was 4GB and growing. Response time was 500ms. We needed to optimize but didn’t know where to start.

I used pprof to profile the application. Found the bottlenecks. Optimized them. Now: 3 cores at 40%, 1.2GB memory, 150ms response time. Here’s how.

The Problem

Slow API without knowing why:

High CPU usage
High memory usage
Slow response times
No idea where the bottleneck is

We needed data.

CPU Profiling

Enable pprof in your app:

package main

import (
    "net/http"
    _ "net/http/pprof"
)

func main() {
    // Your application code
    
    // pprof endpoint
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()
    
    // Start your server
    startServer()
}

Collect CPU profile:

# Profile for 30 seconds
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

Interactive mode:

(pprof) top
Showing nodes accounting for 2.5s, 83.33% of 3s total
      flat  flat%   sum%        cum   cum%
     1.2s 40.00% 40.00%      1.5s 50.00%  encoding/json.(*encodeState).string
     0.8s 26.67% 66.67%      0.8s 26.67%  runtime.mallocgc
     0.5s 16.67% 83.33%      0.5s 16.67%  regexp.(*Regexp).FindStringSubmatch

flat: Time spent in function
cum: Time spent in function + callees

Visualize Profile

Generate graph:

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

Opens browser with interactive flame graph!

Memory Profiling

Heap profile:

go tool pprof http://localhost:6060/debug/pprof/heap

(pprof) top
Showing nodes accounting for 512MB, 85% of 600MB total
      flat  flat%   sum%        cum   cum%
    256MB 42.67% 42.67%     256MB 42.67%  main.loadData
    128MB 21.33% 64.00%     128MB 21.33%  encoding/json.Unmarshal
    128MB 21.33% 85.33%     128MB 21.33%  bytes.Buffer.Grow

Allocation profile:

go tool pprof http://localhost:6060/debug/pprof/allocs

Shows all allocations (including freed memory).

Goroutine Profiling

Check for goroutine leaks:

curl http://localhost:6060/debug/pprof/goroutine?debug=1

Or:

go tool pprof http://localhost:6060/debug/pprof/goroutine

Real-World Example

Problem: JSON encoding is slow

Profile shows:

1.2s  encoding/json.(*encodeState).string

Before:

func getUsers(w http.ResponseWriter, r *http.Request) {
    users := []User{}
    db.Find(&users)
    
    // Slow!
    json.NewEncoder(w).Encode(users)
}

After (use faster JSON library):

import "github.com/json-iterator/go"

var json = jsoniter.ConfigCompatibleWithStandardLibrary

func getUsers(w http.ResponseWriter, r *http.Request) {
    users := []User{}
    db.Find(&users)
    
    // 3x faster!
    json.NewEncoder(w).Encode(users)
}

Result: 1.2s → 0.4s (67% faster)

Benchmarking

Write benchmarks:

func BenchmarkGetUsers(b *testing.B) {
    for i := 0; i < b.N; i++ {
        getUsers()
    }
}

Run:

go test -bench=. -benchmem

Output:

BenchmarkGetUsers-8   1000   1200000 ns/op   512000 B/op   1000 allocs/op

ns/op: Nanoseconds per operation
B/op: Bytes allocated per operation
allocs/op: Allocations per operation

Memory Optimization

Problem: Too many allocations

// Before: 1000 allocations
func processData(items []string) []string {
    result := []string{}
    for _, item := range items {
        result = append(result, strings.ToUpper(item))
    }
    return result
}

After: Pre-allocate slice

// After: 1 allocation
func processData(items []string) []string {
    result := make([]string, 0, len(items))
    for _, item := range items {
        result = append(result, strings.ToUpper(item))
    }
    return result
}

Benchmark:

Before: 1000 allocs/op, 128KB/op
After:  1 allocs/op, 8KB/op

String Concatenation

Bad:

func buildString(items []string) string {
    result := ""
    for _, item := range items {
        result += item + ","  // Creates new string each time!
    }
    return result
}

Good:

func buildString(items []string) string {
    var builder strings.Builder
    builder.Grow(len(items) * 10)  // Pre-allocate
    for _, item := range items {
        builder.WriteString(item)
        builder.WriteString(",")
    }
    return builder.String()
}

Benchmark:

Bad:  10000 allocs/op, 5MB/op, 50ms
Good: 1 allocs/op, 100KB/op, 1ms

Reduce Allocations

Use sync.Pool for frequently allocated objects:

var bufferPool = sync.Pool{
    New: func() interface{} {
        return new(bytes.Buffer)
    },
}

func processRequest(data []byte) []byte {
    // Get buffer from pool
    buf := bufferPool.Get().(*bytes.Buffer)
    defer bufferPool.Put(buf)
    
    buf.Reset()
    buf.Write(data)
    // Process...
    
    return buf.Bytes()
}

Escape Analysis

Check if variables escape to heap:

go build -gcflags="-m" main.go

Output:

./main.go:10:6: moved to heap: user
./main.go:15:6: can inline getUser

Heap allocation is slower than stack!

Avoid escaping:

// Bad: escapes to heap
func getUser() *User {
    user := User{Name: "Alice"}
    return &user  // Escapes!
}

// Good: stays on stack
func getUser() User {
    return User{Name: "Alice"}
}

Continuous Profiling

Production profiling:

import (
    "os"
    "runtime/pprof"
    "time"
)

func startContinuousProfiling() {
    go func() {
        for {
            // CPU profile
            f, _ := os.Create(fmt.Sprintf("cpu-%d.prof", time.Now().Unix()))
            pprof.StartCPUProfile(f)
            time.Sleep(30 * time.Second)
            pprof.StopCPUProfile()
            f.Close()
            
            // Heap profile
            f, _ = os.Create(fmt.Sprintf("heap-%d.prof", time.Now().Unix()))
            pprof.WriteHeapProfile(f)
            f.Close()
            
            time.Sleep(5 * time.Minute)
        }
    }()
}

Trace Analysis

Detailed execution trace:

import "runtime/trace"

func main() {
    f, _ := os.Create("trace.out")
    defer f.Close()
    
    trace.Start(f)
    defer trace.Stop()
    
    // Your code
}

Analyze:

go tool trace trace.out

Shows:

Goroutine execution
Network blocking
Syscalls
GC events

Real Production Optimizations

1. Reduce JSON marshaling:

// Before: Marshal on every request
func getUser(w http.ResponseWriter, r *http.Request) {
    user := getFromDB()
    json.NewEncoder(w).Encode(user)  // Slow!
}

// After: Cache marshaled JSON
var userCache = make(map[int][]byte)

func getUser(w http.ResponseWriter, r *http.Request) {
    id := getUserID(r)
    if cached, ok := userCache[id]; ok {
        w.Write(cached)
        return
    }
    
    user := getFromDB()
    data, _ := json.Marshal(user)
    userCache[id] = data
    w.Write(data)
}

Result: 500ms → 50ms (10x faster)

2. Use string interning:

// Before: Many duplicate strings
type Event struct {
    Type string  // "click", "view", "click", "view"...
}

// After: Intern strings
var stringPool = make(map[string]string)

func intern(s string) string {
    if interned, ok := stringPool[s]; ok {
        return interned
    }
    stringPool[s] = s
    return s
}

type Event struct {
    Type string
}

func NewEvent(typ string) Event {
    return Event{Type: intern(typ)}
}

Result: 4GB memory → 1.2GB (70% reduction)

3. Optimize regex:

// Before: Compile regex on every call
func validateEmail(email string) bool {
    re := regexp.MustCompile(`^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$`)
    return re.MatchString(email)
}

// After: Compile once
var emailRegex = regexp.MustCompile(`^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$`)

func validateEmail(email string) bool {
    return emailRegex.MatchString(email)
}

Result: 100µs → 1µs (100x faster)

Monitoring in Production

Prometheus metrics:

import "github.com/prometheus/client_golang/prometheus"

var (
    heapAlloc = prometheus.NewGauge(prometheus.GaugeOpts{
        Name: "go_heap_alloc_bytes",
    })
    
    numGoroutines = prometheus.NewGauge(prometheus.GaugeOpts{
        Name: "go_goroutines",
    })
)

func updateMetrics() {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    heapAlloc.Set(float64(m.Alloc))
    numGoroutines.Set(float64(runtime.NumGoroutine()))
}

Results