Prometheus gave us metrics, but the built-in UI was basic. We needed better visualization for our team and management.

I set up Grafana and built dashboards for all our services. Now everyone can see system health at a glance. Our CEO checks the dashboard daily.

Table of Contents

Why Grafana?

Prometheus UI is good for ad-hoc queries, but:

  • No dashboard persistence
  • Limited visualization options
  • Not suitable for NOC displays
  • No user management

Grafana solves all these problems.

Installing Grafana

Downloaded Grafana 4.1.2:

wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana_4.1.2_amd64.deb
sudo dpkg -i grafana_4.1.2_amd64.deb
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

Access at http://localhost:3000 (admin/admin)

Adding Prometheus Data Source

  1. Configuration → Data Sources → Add data source
  2. Name: Prometheus
  3. Type: Prometheus
  4. URL: http://localhost:9090
  5. Access: Server (default)
  6. Save & Test

Green checkmark = success!

First Dashboard: Service Overview

Created dashboard with 4 panels:

Panel 1: Request Rate

Query:

sum(rate(http_requests_total[5m])) by (service)

Visualization: Graph
Legend: {{service}}

Shows requests per second for each service.

Panel 2: Error Rate

Query:

sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

Visualization: Graph
Color: Red
Alert threshold: > 10 errors/sec

Panel 3: Latency (p95)

Query:

histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

Visualization: Graph
Unit: seconds (s)

Panel 4: Service Status

Query:

up{job=~".*-service"}

Visualization: Stat
Thresholds: 0 = red, 1 = green

Dashboard Variables

Make dashboards dynamic:

Variable: service

  • Name: service
  • Type: Query
  • Query: label_values(http_requests_total, service)
  • Multi-value: Yes

Use in queries:

rate(http_requests_total{service="$service"}[5m])

Now you can filter by service!

Template Dashboard

Created reusable template for all services:

{
  "dashboard": {
    "title": "$service Overview",
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)"
        }
      ]
    },
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{service=\"$service\"}[5m])"
          }
        ]
      }
    ]
  }
}

Advanced Queries

Success Rate

sum(rate(http_requests_total{service="$service",status!~"5.."}[5m])) 
/ 
sum(rate(http_requests_total{service="$service"}[5m])) 
* 100

Unit: percent (0-100)

Apdex Score

(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="2"}[5m])) / 2
)
/
sum(rate(http_request_duration_seconds_count[5m]))

CPU Usage by Pod

sum(rate(container_cpu_usage_seconds_total{pod=~"$service.*"}[5m])) by (pod)

Alerting in Grafana

Set up alerts on panels:

  1. Edit panel
  2. Alert tab
  3. Create Alert
  4. Conditions:
    • WHEN avg() OF query(A, 5m, now) IS ABOVE 100
  5. Notifications: Slack channel

Alert fires when condition met for 5 minutes.

Notification Channels

Slack Integration

  1. Alerting → Notification channels → New channel
  2. Type: Slack
  3. Webhook URL: https://hooks.slack.com/services/...
  4. Channel: #alerts
  5. Test

Now alerts go to Slack!

Email Notifications

Configure SMTP in /etc/grafana/grafana.ini:

[smtp]
enabled = true
host = smtp.gmail.com:587
user = alerts@company.com
password = ***
from_address = alerts@company.com
from_name = Grafana

Dashboard Organization

Created folder structure:

  • Services - Individual service dashboards
  • Infrastructure - Kubernetes, databases
  • Business - Revenue, signups, active users
  • SLOs - Service level objectives

Our Main Dashboard

NOC Dashboard (displayed on TV):

Row 1: Overall Health

  • Total requests/sec
  • Error rate
  • Average latency
  • Services up/down

Row 2: Service Status

  • User Service (green/red)
  • Order Service (green/red)
  • Payment Service (green/red)
  • Notification Service (green/red)

Row 3: Infrastructure

  • CPU usage
  • Memory usage
  • Disk usage
  • Network traffic

Row 4: Business Metrics

  • Active users
  • Orders/hour
  • Revenue/hour

Auto-refresh: 30 seconds

Heatmaps

Visualize latency distribution:

Query:

sum(increase(http_request_duration_seconds_bucket[5m])) by (le)

Visualization: Heatmap
Data format: Time series buckets

Shows latency patterns over time.

Single Stat Panels

Big numbers for key metrics:

Active Users:

sum(active_users)

Visualization: Stat
Font size: 72pt
Color: Green if > 1000

Revenue Today:

sum(increase(revenue_total[24h]))

Prefix: $
Decimals: 2

Table Panels

List top endpoints by latency:

Query:

topk(10, 
  histogram_quantile(0.95, 
    sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
  )
)

Visualization: Table
Columns: Endpoint, Latency
Sort: Latency DESC

Annotations

Mark deployments on graphs:

  1. Dashboard settings → Annotations
  2. Add annotation query
  3. Data source: Prometheus
  4. Query: deployment_timestamp

Vertical lines show when deployments happened.

Sharing Dashboards

Snapshot:

  • Share → Snapshot
  • Publish to snapshots.raintank.io
  • Get shareable link

Export JSON:

  • Dashboard settings → JSON Model
  • Copy JSON
  • Import on another Grafana instance

Embed:

  • Share → Embed
  • Copy iframe code
  • Embed in wiki/docs

Dashboard Best Practices

  1. Use consistent colors - Red for errors, green for success
  2. Group related panels - Use rows
  3. Add descriptions - Help text for each panel
  4. Set appropriate time ranges - Last 1h, 6h, 24h
  5. Use variables - Make dashboards reusable
  6. Set refresh rate - 30s for NOC, 5m for analysis
  7. Add links - Link to runbooks, logs

Our Dashboard Library

Created 15 dashboards:

Services (5):

  • User Service
  • Order Service
  • Payment Service
  • Notification Service
  • API Gateway

Infrastructure (4):

  • Kubernetes Cluster
  • PostgreSQL
  • Redis
  • Nginx

Business (3):

  • Revenue Dashboard
  • User Engagement
  • Conversion Funnel

SLOs (3):

  • Availability SLO
  • Latency SLO
  • Error Budget

Mobile App

Grafana has mobile app:

  • iOS/Android
  • View dashboards on phone
  • Get alert notifications
  • Great for on-call

Provisioning Dashboards

Automate dashboard creation:

/etc/grafana/provisioning/dashboards/dashboard.yml:

apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    options:
      path: /var/lib/grafana/dashboards

Put JSON files in /var/lib/grafana/dashboards/

Dashboards auto-load on startup!

Results

Before Grafana:

  • Prometheus UI only
  • No persistent dashboards
  • Hard to share metrics
  • No alerting

After Grafana:

  • 15 beautiful dashboards
  • Team has visibility
  • Alerts to Slack
  • CEO checks dashboard daily

Lessons Learned

  1. Start with service overview - Request rate, errors, latency
  2. Use variables - Make dashboards reusable
  3. Set up alerting - Don’t just visualize, alert
  4. Organize dashboards - Use folders
  5. Share with team - Everyone should see metrics

Conclusion

Grafana transformed our monitoring. Metrics are now accessible to everyone, not just ops team.

Key takeaways:

  1. Grafana makes Prometheus data beautiful
  2. Use variables for flexible dashboards
  3. Set up alerting early
  4. Create dashboards for different audiences
  5. Automate dashboard provisioning

If you have Prometheus, add Grafana. Your team will love you.