Building Grafana Dashboards for Microservices Monitoring
Prometheus gave us metrics, but the built-in UI was basic. We needed better visualization for our team and management.
I set up Grafana and built dashboards for all our services. Now everyone can see system health at a glance. Our CEO checks the dashboard daily.
Table of Contents
Why Grafana?
Prometheus UI is good for ad-hoc queries, but:
- No dashboard persistence
- Limited visualization options
- Not suitable for NOC displays
- No user management
Grafana solves all these problems.
Installing Grafana
Downloaded Grafana 4.1.2:
wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana_4.1.2_amd64.deb
sudo dpkg -i grafana_4.1.2_amd64.deb
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
Access at http://localhost:3000 (admin/admin)
Adding Prometheus Data Source
- Configuration → Data Sources → Add data source
- Name: Prometheus
- Type: Prometheus
- URL:
http://localhost:9090 - Access: Server (default)
- Save & Test
Green checkmark = success!
First Dashboard: Service Overview
Created dashboard with 4 panels:
Panel 1: Request Rate
Query:
sum(rate(http_requests_total[5m])) by (service)
Visualization: Graph
Legend: {{service}}
Shows requests per second for each service.
Panel 2: Error Rate
Query:
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
Visualization: Graph
Color: Red
Alert threshold: > 10 errors/sec
Panel 3: Latency (p95)
Query:
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Visualization: Graph
Unit: seconds (s)
Panel 4: Service Status
Query:
up{job=~".*-service"}
Visualization: Stat
Thresholds: 0 = red, 1 = green
Dashboard Variables
Make dashboards dynamic:
Variable: service
- Name:
service - Type: Query
- Query:
label_values(http_requests_total, service) - Multi-value: Yes
Use in queries:
rate(http_requests_total{service="$service"}[5m])
Now you can filter by service!
Template Dashboard
Created reusable template for all services:
{
"dashboard": {
"title": "$service Overview",
"templating": {
"list": [
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total, service)"
}
]
},
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total{service=\"$service\"}[5m])"
}
]
}
]
}
}
Advanced Queries
Success Rate
sum(rate(http_requests_total{service="$service",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="$service"}[5m]))
* 100
Unit: percent (0-100)
Apdex Score
(
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
+
sum(rate(http_request_duration_seconds_bucket{le="2"}[5m])) / 2
)
/
sum(rate(http_request_duration_seconds_count[5m]))
CPU Usage by Pod
sum(rate(container_cpu_usage_seconds_total{pod=~"$service.*"}[5m])) by (pod)
Alerting in Grafana
Set up alerts on panels:
- Edit panel
- Alert tab
- Create Alert
- Conditions:
- WHEN
avg()OFquery(A, 5m, now)IS ABOVE100
- WHEN
- Notifications: Slack channel
Alert fires when condition met for 5 minutes.
Notification Channels
Slack Integration
- Alerting → Notification channels → New channel
- Type: Slack
- Webhook URL:
https://hooks.slack.com/services/... - Channel:
#alerts - Test
Now alerts go to Slack!
Email Notifications
Configure SMTP in /etc/grafana/grafana.ini:
[smtp]
enabled = true
host = smtp.gmail.com:587
user = alerts@company.com
password = ***
from_address = alerts@company.com
from_name = Grafana
Dashboard Organization
Created folder structure:
- Services - Individual service dashboards
- Infrastructure - Kubernetes, databases
- Business - Revenue, signups, active users
- SLOs - Service level objectives
Our Main Dashboard
NOC Dashboard (displayed on TV):
Row 1: Overall Health
- Total requests/sec
- Error rate
- Average latency
- Services up/down
Row 2: Service Status
- User Service (green/red)
- Order Service (green/red)
- Payment Service (green/red)
- Notification Service (green/red)
Row 3: Infrastructure
- CPU usage
- Memory usage
- Disk usage
- Network traffic
Row 4: Business Metrics
- Active users
- Orders/hour
- Revenue/hour
Auto-refresh: 30 seconds
Heatmaps
Visualize latency distribution:
Query:
sum(increase(http_request_duration_seconds_bucket[5m])) by (le)
Visualization: Heatmap
Data format: Time series buckets
Shows latency patterns over time.
Single Stat Panels
Big numbers for key metrics:
Active Users:
sum(active_users)
Visualization: Stat
Font size: 72pt
Color: Green if > 1000
Revenue Today:
sum(increase(revenue_total[24h]))
Prefix: $
Decimals: 2
Table Panels
List top endpoints by latency:
Query:
topk(10,
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)
)
Visualization: Table
Columns: Endpoint, Latency
Sort: Latency DESC
Annotations
Mark deployments on graphs:
- Dashboard settings → Annotations
- Add annotation query
- Data source: Prometheus
- Query:
deployment_timestamp
Vertical lines show when deployments happened.
Sharing Dashboards
Snapshot:
- Share → Snapshot
- Publish to snapshots.raintank.io
- Get shareable link
Export JSON:
- Dashboard settings → JSON Model
- Copy JSON
- Import on another Grafana instance
Embed:
- Share → Embed
- Copy iframe code
- Embed in wiki/docs
Dashboard Best Practices
- Use consistent colors - Red for errors, green for success
- Group related panels - Use rows
- Add descriptions - Help text for each panel
- Set appropriate time ranges - Last 1h, 6h, 24h
- Use variables - Make dashboards reusable
- Set refresh rate - 30s for NOC, 5m for analysis
- Add links - Link to runbooks, logs
Our Dashboard Library
Created 15 dashboards:
Services (5):
- User Service
- Order Service
- Payment Service
- Notification Service
- API Gateway
Infrastructure (4):
- Kubernetes Cluster
- PostgreSQL
- Redis
- Nginx
Business (3):
- Revenue Dashboard
- User Engagement
- Conversion Funnel
SLOs (3):
- Availability SLO
- Latency SLO
- Error Budget
Mobile App
Grafana has mobile app:
- iOS/Android
- View dashboards on phone
- Get alert notifications
- Great for on-call
Provisioning Dashboards
Automate dashboard creation:
/etc/grafana/provisioning/dashboards/dashboard.yml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
options:
path: /var/lib/grafana/dashboards
Put JSON files in /var/lib/grafana/dashboards/
Dashboards auto-load on startup!
Results
Before Grafana:
- Prometheus UI only
- No persistent dashboards
- Hard to share metrics
- No alerting
After Grafana:
- 15 beautiful dashboards
- Team has visibility
- Alerts to Slack
- CEO checks dashboard daily
Lessons Learned
- Start with service overview - Request rate, errors, latency
- Use variables - Make dashboards reusable
- Set up alerting - Don’t just visualize, alert
- Organize dashboards - Use folders
- Share with team - Everyone should see metrics
Conclusion
Grafana transformed our monitoring. Metrics are now accessible to everyone, not just ops team.
Key takeaways:
- Grafana makes Prometheus data beautiful
- Use variables for flexible dashboards
- Set up alerting early
- Create dashboards for different audiences
- Automate dashboard provisioning
If you have Prometheus, add Grafana. Your team will love you.