We had a production incident last week. Our user service was running at 95% CPU for 2 hours before anyone noticed. Users were experiencing slow responses, but we had no visibility.

I spent the weekend setting up Prometheus. Now we have real-time metrics for all 5 microservices. Yesterday it caught a memory leak before it became critical.

Table of Contents

The Wake-Up Call

Friday, 3 PM: Users complaining about slow login
Friday, 3:30 PM: We SSH into servers, discover high CPU
Friday, 4 PM: Restart services, everything back to normal
Friday, 4:30 PM: Boss asks “Why didn’t we know about this earlier?”

We had no answer. We needed monitoring.

Why Prometheus?

I evaluated three options:

  1. Nagios - Too complex, agent-based
  2. Graphite - Good but requires StatsD setup
  3. Prometheus - Pull-based, simple, built for microservices

Prometheus won because:

  • No agents needed
  • Built-in time-series database
  • Powerful query language (PromQL)
  • Easy integration with Go/Python services

Installing Prometheus

Downloaded Prometheus 1.0.1:

wget https://github.com/prometheus/prometheus/releases/download/v1.0.1/prometheus-1.0.1.linux-amd64.tar.gz
tar xvfz prometheus-1.0.1.linux-amd64.tar.gz
cd prometheus-1.0.1.linux-amd64

Basic config (prometheus.yml):

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Start Prometheus:

./prometheus -config.file=prometheus.yml

Access UI at http://localhost:9090

Instrumenting Go Services

Our user service (Go):

package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
    
    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

func metricsMiddleware(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, r.URL.Path))
        defer timer.ObserveDuration()
        
        next(w, r)
        
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
    }
}

func main() {
    http.HandleFunc("/users", metricsMiddleware(handleUsers))
    http.Handle("/metrics", promhttp.Handler())
    
    http.ListenAndServe(":8080", nil)
}

func handleUsers(w http.ResponseWriter, r *http.Request) {
    w.Write([]byte("Users endpoint"))
}

Now /metrics endpoint exposes Prometheus metrics.

Instrumenting Python Services

Our API gateway (Flask):

from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest

app = Flask(__name__)

# Metrics
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    duration = time.time() - request.start_time
    
    http_request_duration_seconds.labels(
        method=request.method,
        endpoint=request.endpoint
    ).observe(duration)
    
    http_requests_total.labels(
        method=request.method,
        endpoint=request.endpoint,
        status=response.status_code
    ).inc()
    
    return response

@app.route('/metrics')
def metrics():
    return generate_latest()

@app.route('/api/users')
def users():
    return {'users': []}

if __name__ == '__main__':
    app.run(port=5000)

Configuring Scrape Targets

Updated prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'user-service'
    static_configs:
      - targets: ['localhost:8080']
  
  - job_name: 'api-gateway'
    static_configs:
      - targets: ['localhost:5000']
  
  - job_name: 'order-service'
    static_configs:
      - targets: ['localhost:8081']
  
  - job_name: 'payment-service'
    static_configs:
      - targets: ['localhost:8082']
  
  - job_name: 'notification-service'
    static_configs:
      - targets: ['localhost:8083']

Reload Prometheus:

kill -HUP $(pgrep prometheus)

Useful Queries

Request rate:

rate(http_requests_total[5m])

95th percentile latency:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Error rate:

rate(http_requests_total{status=~"5.."}[5m])

CPU usage (using node_exporter):

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Adding Node Exporter

Monitor system metrics:

wget https://github.com/prometheus/node_exporter/releases/download/v0.12.0/node_exporter-0.12.0.linux-amd64.tar.gz
tar xvfz node_exporter-0.12.0.linux-amd64.tar.gz
cd node_exporter-0.12.0.linux-amd64
./node_exporter

Add to prometheus.yml:

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

Now we have CPU, memory, disk, network metrics.

Setting Up Alerts

Created alerts.yml:

groups:
  - name: service_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value }} req/s"
      
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.job }}"
          description: "95th percentile latency is {{ $value }}s"
      
      - alert: HighCPU
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}%"

Update prometheus.yml:

rule_files:
  - "alerts.yml"

First Incident Caught

Monday, 10 AM: Prometheus alert fires
Alert: HighCPU on user-service
Action: Checked metrics, found memory leak
Resolution: Deployed fix before users noticed

This is exactly what we needed!

Dashboards

Created simple dashboard using Prometheus UI:

Service Health:

up{job=~".*-service"}

Request Rate by Service:

sum by (job) (rate(http_requests_total[5m]))

Latency Heatmap:

rate(http_request_duration_seconds_bucket[5m])

Production Setup

Moved to production with:

  1. Persistent storage:
./prometheus -config.file=prometheus.yml -storage.local.path=/var/lib/prometheus
  1. Systemd service (/etc/systemd/system/prometheus.service):
[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/opt/prometheus/prometheus -config.file=/etc/prometheus/prometheus.yml -storage.local.path=/var/lib/prometheus
Restart=always

[Install]
WantedBy=multi-user.target
  1. Retention policy:
-storage.local.retention=720h  # 30 days

Lessons Learned

  1. Start simple - Basic metrics are better than no metrics
  2. Instrument early - Add metrics when writing code, not after incidents
  3. Use labels wisely - Don’t create too many unique label combinations
  4. Set meaningful alerts - Too many alerts = alert fatigue
  5. Monitor the monitor - Make sure Prometheus itself is healthy

What’s Next

Planning to add:

  • Grafana for better dashboards
  • Alertmanager for alert routing
  • Service discovery (instead of static configs)
  • More custom metrics (business metrics, not just technical)

Conclusion

Prometheus transformed our operations. We went from blind to having full visibility in one weekend.

Key takeaways:

  1. Monitoring is essential for microservices
  2. Prometheus is easy to set up
  3. Start with basic metrics (requests, latency, errors)
  4. Add alerts for critical issues
  5. Iterate and improve over time

No more surprise incidents. We now know what’s happening in production, in real-time.

If you’re running microservices without monitoring, stop reading and set up Prometheus. Your future self will thank you.