Getting Started with Prometheus: Monitoring Our Microservices
We had a production incident last week. Our user service was running at 95% CPU for 2 hours before anyone noticed. Users were experiencing slow responses, but we had no visibility.
I spent the weekend setting up Prometheus. Now we have real-time metrics for all 5 microservices. Yesterday it caught a memory leak before it became critical.
Table of Contents
The Wake-Up Call
Friday, 3 PM: Users complaining about slow login
Friday, 3:30 PM: We SSH into servers, discover high CPU
Friday, 4 PM: Restart services, everything back to normal
Friday, 4:30 PM: Boss asks “Why didn’t we know about this earlier?”
We had no answer. We needed monitoring.
Why Prometheus?
I evaluated three options:
- Nagios - Too complex, agent-based
- Graphite - Good but requires StatsD setup
- Prometheus - Pull-based, simple, built for microservices
Prometheus won because:
- No agents needed
- Built-in time-series database
- Powerful query language (PromQL)
- Easy integration with Go/Python services
Installing Prometheus
Downloaded Prometheus 1.0.1:
wget https://github.com/prometheus/prometheus/releases/download/v1.0.1/prometheus-1.0.1.linux-amd64.tar.gz
tar xvfz prometheus-1.0.1.linux-amd64.tar.gz
cd prometheus-1.0.1.linux-amd64
Basic config (prometheus.yml):
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Start Prometheus:
./prometheus -config.file=prometheus.yml
Access UI at http://localhost:9090
Instrumenting Go Services
Our user service (Go):
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func metricsMiddleware(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, r.URL.Path))
defer timer.ObserveDuration()
next(w, r)
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}
}
func main() {
http.HandleFunc("/users", metricsMiddleware(handleUsers))
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
func handleUsers(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("Users endpoint"))
}
Now /metrics endpoint exposes Prometheus metrics.
Instrumenting Python Services
Our API gateway (Flask):
from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest
app = Flask(__name__)
# Metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
http_request_duration_seconds = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
duration = time.time() - request.start_time
http_request_duration_seconds.labels(
method=request.method,
endpoint=request.endpoint
).observe(duration)
http_requests_total.labels(
method=request.method,
endpoint=request.endpoint,
status=response.status_code
).inc()
return response
@app.route('/metrics')
def metrics():
return generate_latest()
@app.route('/api/users')
def users():
return {'users': []}
if __name__ == '__main__':
app.run(port=5000)
Configuring Scrape Targets
Updated prometheus.yml:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'user-service'
static_configs:
- targets: ['localhost:8080']
- job_name: 'api-gateway'
static_configs:
- targets: ['localhost:5000']
- job_name: 'order-service'
static_configs:
- targets: ['localhost:8081']
- job_name: 'payment-service'
static_configs:
- targets: ['localhost:8082']
- job_name: 'notification-service'
static_configs:
- targets: ['localhost:8083']
Reload Prometheus:
kill -HUP $(pgrep prometheus)
Useful Queries
Request rate:
rate(http_requests_total[5m])
95th percentile latency:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Error rate:
rate(http_requests_total{status=~"5.."}[5m])
CPU usage (using node_exporter):
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Adding Node Exporter
Monitor system metrics:
wget https://github.com/prometheus/node_exporter/releases/download/v0.12.0/node_exporter-0.12.0.linux-amd64.tar.gz
tar xvfz node_exporter-0.12.0.linux-amd64.tar.gz
cd node_exporter-0.12.0.linux-amd64
./node_exporter
Add to prometheus.yml:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
Now we have CPU, memory, disk, network metrics.
Setting Up Alerts
Created alerts.yml:
groups:
- name: service_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value }} req/s"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.job }}"
description: "95th percentile latency is {{ $value }}s"
- alert: HighCPU
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
Update prometheus.yml:
rule_files:
- "alerts.yml"
First Incident Caught
Monday, 10 AM: Prometheus alert fires
Alert: HighCPU on user-service
Action: Checked metrics, found memory leak
Resolution: Deployed fix before users noticed
This is exactly what we needed!
Dashboards
Created simple dashboard using Prometheus UI:
Service Health:
up{job=~".*-service"}
Request Rate by Service:
sum by (job) (rate(http_requests_total[5m]))
Latency Heatmap:
rate(http_request_duration_seconds_bucket[5m])
Production Setup
Moved to production with:
- Persistent storage:
./prometheus -config.file=prometheus.yml -storage.local.path=/var/lib/prometheus
- Systemd service (
/etc/systemd/system/prometheus.service):
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/opt/prometheus/prometheus -config.file=/etc/prometheus/prometheus.yml -storage.local.path=/var/lib/prometheus
Restart=always
[Install]
WantedBy=multi-user.target
- Retention policy:
-storage.local.retention=720h # 30 days
Lessons Learned
- Start simple - Basic metrics are better than no metrics
- Instrument early - Add metrics when writing code, not after incidents
- Use labels wisely - Don’t create too many unique label combinations
- Set meaningful alerts - Too many alerts = alert fatigue
- Monitor the monitor - Make sure Prometheus itself is healthy
What’s Next
Planning to add:
- Grafana for better dashboards
- Alertmanager for alert routing
- Service discovery (instead of static configs)
- More custom metrics (business metrics, not just technical)
Conclusion
Prometheus transformed our operations. We went from blind to having full visibility in one weekend.
Key takeaways:
- Monitoring is essential for microservices
- Prometheus is easy to set up
- Start with basic metrics (requests, latency, errors)
- Add alerts for critical issues
- Iterate and improve over time
No more surprise incidents. We now know what’s happening in production, in real-time.
If you’re running microservices without monitoring, stop reading and set up Prometheus. Your future self will thank you.