Building Grafana Dashboards for Microservices Monitoring

Prometheus gave us metrics, but the built-in UI was basic. We needed better visualization for our team and management.

I set up Grafana and built dashboards for all our services. Now everyone can see system health at a glance. Our CEO checks the dashboard daily.

Why Grafana?

Prometheus UI is good for ad-hoc queries, but:

No dashboard persistence
Limited visualization options
Not suitable for NOC displays
No user management

Grafana solves all these problems.

Installing Grafana

Downloaded Grafana 4.1.2:

wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana_4.1.2_amd64.deb
sudo dpkg -i grafana_4.1.2_amd64.deb
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

Access at http://localhost:3000 (admin/admin)

Adding Prometheus Data Source

Configuration → Data Sources → Add data source
Name: Prometheus
Type: Prometheus
URL: http://localhost:9090
Access: Server (default)
Save & Test

Green checkmark = success!

First Dashboard: Service Overview

Created dashboard with 4 panels:

Panel 1: Request Rate

Query:

sum(rate(http_requests_total[5m])) by (service)

Visualization: Graph
Legend: {{service}}

Shows requests per second for each service.

Panel 2: Error Rate

Query:

sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

Visualization: Graph
Color: Red
Alert threshold: > 10 errors/sec

Panel 3: Latency (p95)

Query:

histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

Visualization: Graph
Unit: seconds (s)

Panel 4: Service Status

Query:

up{job=~".*-service"}

Visualization: Stat
Thresholds: 0 = red, 1 = green

Dashboard Variables

Make dashboards dynamic:

Variable: service

Name: service
Type: Query
Query: label_values(http_requests_total, service)
Multi-value: Yes

Use in queries:

rate(http_requests_total{service="$service"}[5m])

Now you can filter by service!

Template Dashboard

Created reusable template for all services:

{
  "dashboard": {
    "title": "$service Overview",
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)"
        }
      ]
    },
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{service=\"$service\"}[5m])"
          }
        ]
      }
    ]
  }
}

Advanced Queries

Success Rate

sum(rate(http_requests_total{service="$service",status!~"5.."}[5m])) 
/ 
sum(rate(http_requests_total{service="$service"}[5m])) 
* 100

Unit: percent (0-100)

Apdex Score

(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="2"}[5m])) / 2
)
/
sum(rate(http_request_duration_seconds_count[5m]))

CPU Usage by Pod

sum(rate(container_cpu_usage_seconds_total{pod=~"$service.*"}[5m])) by (pod)

Alerting in Grafana

Set up alerts on panels:

Edit panel
Alert tab
Create Alert
Conditions:
- WHEN avg() OF query(A, 5m, now) IS ABOVE 100
Notifications: Slack channel

Alert fires when condition met for 5 minutes.

Notification Channels

Slack Integration

Alerting → Notification channels → New channel
Type: Slack
Webhook URL: https://hooks.slack.com/services/...
Channel: #alerts
Test

Now alerts go to Slack!

Email Notifications

Configure SMTP in /etc/grafana/grafana.ini:

[smtp]
enabled = true
host = smtp.gmail.com:587
user = alerts@company.com
password = ***
from_address = alerts@company.com
from_name = Grafana

Dashboard Organization

Created folder structure:

Services - Individual service dashboards
Infrastructure - Kubernetes, databases
Business - Revenue, signups, active users
SLOs - Service level objectives

Our Main Dashboard

NOC Dashboard (displayed on TV):

Row 1: Overall Health

Total requests/sec
Error rate
Average latency
Services up/down

Row 2: Service Status

User Service (green/red)
Order Service (green/red)
Payment Service (green/red)
Notification Service (green/red)

Row 3: Infrastructure

CPU usage
Memory usage
Disk usage
Network traffic

Row 4: Business Metrics

Active users
Orders/hour
Revenue/hour

Auto-refresh: 30 seconds

Heatmaps

Visualize latency distribution:

Query:

sum(increase(http_request_duration_seconds_bucket[5m])) by (le)

Visualization: Heatmap
Data format: Time series buckets

Shows latency patterns over time.

Single Stat Panels

Big numbers for key metrics:

Active Users:

sum(active_users)

Visualization: Stat
Font size: 72pt
Color: Green if > 1000

Revenue Today:

sum(increase(revenue_total[24h]))

Prefix: $
Decimals: 2

Table Panels

List top endpoints by latency:

Query:

topk(10, 
  histogram_quantile(0.95, 
    sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
  )
)

Visualization: Table
Columns: Endpoint, Latency
Sort: Latency DESC

Annotations

Mark deployments on graphs:

Dashboard settings → Annotations
Add annotation query
Data source: Prometheus
Query: deployment_timestamp

Vertical lines show when deployments happened.

Snapshot:

Share → Snapshot
Publish to snapshots.raintank.io
Get shareable link

Export JSON:

Dashboard settings → JSON Model
Copy JSON
Import on another Grafana instance

Embed:

Share → Embed
Copy iframe code
Embed in wiki/docs

Dashboard Best Practices

Use consistent colors - Red for errors, green for success
Group related panels - Use rows
Add descriptions - Help text for each panel
Set appropriate time ranges - Last 1h, 6h, 24h
Use variables - Make dashboards reusable
Set refresh rate - 30s for NOC, 5m for analysis
Add links - Link to runbooks, logs

Our Dashboard Library

Created 15 dashboards:

Services (5):

User Service
Order Service
Payment Service
Notification Service
API Gateway

Infrastructure (4):

Kubernetes Cluster
PostgreSQL
Redis
Nginx

Business (3):

Revenue Dashboard
User Engagement
Conversion Funnel

SLOs (3):

Availability SLO
Latency SLO
Error Budget

Mobile App

Grafana has mobile app:

iOS/Android
View dashboards on phone
Get alert notifications
Great for on-call

Provisioning Dashboards

Automate dashboard creation:

/etc/grafana/provisioning/dashboards/dashboard.yml:

apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    options:
      path: /var/lib/grafana/dashboards

Put JSON files in /var/lib/grafana/dashboards/

Dashboards auto-load on startup!

Results

Before Grafana:

Prometheus UI only
No persistent dashboards
Hard to share metrics
No alerting

After Grafana:

15 beautiful dashboards
Team has visibility
Alerts to Slack
CEO checks dashboard daily

Lessons Learned

Start with service overview - Request rate, errors, latency
Use variables - Make dashboards reusable
Set up alerting - Don’t just visualize, alert
Organize dashboards - Use folders
Share with team - Everyone should see metrics

Conclusion

Grafana transformed our monitoring. Metrics are now accessible to everyone, not just ops team.

Key takeaways:

Grafana makes Prometheus data beautiful
Use variables for flexible dashboards
Set up alerting early
Create dashboards for different audiences
Automate dashboard provisioning

If you have Prometheus, add Grafana. Your team will love you.

Table of Contents

Why Grafana?

Installing Grafana

Adding Prometheus Data Source

First Dashboard: Service Overview

Panel 1: Request Rate

Panel 2: Error Rate

Panel 3: Latency (p95)

Panel 4: Service Status

Dashboard Variables

Template Dashboard

Advanced Queries

Success Rate

Apdex Score

CPU Usage by Pod

Alerting in Grafana

Notification Channels

Slack Integration

Email Notifications

Dashboard Organization

Our Main Dashboard

Heatmaps

Single Stat Panels

Table Panels

Annotations

Sharing Dashboards

Dashboard Best Practices

Our Dashboard Library

Mobile App

Provisioning Dashboards

Results

Lessons Learned

Conclusion