Prometheus Alertmanager: Smart Alert Routing and Deduplication
We had Prometheus collecting metrics, but alerts were a mess. Every alert went to everyone. 3 AM pages for non-critical issues. Alert fatigue was real.
I configured Alertmanager properly. Now alerts route to the right team, duplicates are suppressed, and we only get paged for real emergencies.
Table of Contents
The Problem
Before Alertmanager:
- All alerts to #general Slack channel
- No deduplication (same alert 50 times)
- No routing (everyone gets everything)
- No inhibition (cascading alerts)
- 3 AM pages for warnings
Team was ignoring alerts.
Installing Alertmanager
Alertmanager 0.15:
wget https://github.com/prometheus/alertmanager/releases/download/v0.15.0/alertmanager-0.15.0.linux-amd64.tar.gz
tar xvf alertmanager-0.15.0.linux-amd64.tar.gz
cd alertmanager-0.15.0.linux-amd64
./alertmanager --config.file=alertmanager.yml
Basic Configuration
alertmanager.yml:
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
Routing Rules
Route alerts to different teams:
route:
receiver: 'default'
group_by: ['alertname']
routes:
# Database alerts to DB team
- match:
service: database
receiver: 'database-team'
# Critical alerts to PagerDuty
- match:
severity: critical
receiver: 'pagerduty'
continue: true # Also send to other receivers
# Frontend alerts to frontend team
- match_re:
service: ^(web|api)$
receiver: 'frontend-team'
receivers:
- name: 'database-team'
slack_configs:
- channel: '#db-alerts'
- name: 'frontend-team'
slack_configs:
- channel: '#frontend-alerts'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
Grouping Alerts
Group related alerts:
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # Wait 30s for more alerts
group_interval: 5m # Send grouped alerts every 5m
repeat_interval: 4h # Resend if still firing after 4h
Example: 10 pods down in same cluster → 1 grouped alert instead of 10.
Inhibition Rules
Suppress dependent alerts:
inhibit_rules:
# If cluster is down, don't alert on individual services
- source_match:
alertname: 'ClusterDown'
target_match:
alertname: 'ServiceDown'
equal: ['cluster']
# If node is down, don't alert on pods on that node
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: '(PodDown|ContainerDown)'
equal: ['node']
Prevents alert storms.
Silencing Alerts
Silence during maintenance:
# Silence for 2 hours
amtool silence add alertname=HighMemory --duration=2h --comment="Planned maintenance"
# Silence specific instance
amtool silence add instance=web-01 --duration=1h
# List active silences
amtool silence query
# Expire silence
amtool silence expire <silence-id>
Or use Web UI: http://alertmanager:9093
Alert Templates
Custom Slack message:
receivers:
- name: 'slack'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity }}
*Instance:* {{ .Labels.instance }}
*Description:* {{ .Annotations.description }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
Prometheus Alert Rules
Define alerts in Prometheus:
alerts.yml:
groups:
- name: example
rules:
# High error rate
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
service: api
annotations:
description: "Error rate is {{ $value }} on {{ $labels.instance }}"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
# High memory usage
- alert: HighMemory
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
for: 10m
labels:
severity: warning
service: infrastructure
annotations:
description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
# Service down
- alert: ServiceDown
expr: up{job="web"} == 0
for: 1m
labels:
severity: critical
service: web
annotations:
description: "{{ $labels.instance }} is down"
Load in Prometheus:
# prometheus.yml
rule_files:
- 'alerts.yml'
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
PagerDuty Integration
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.description }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
num_firing: '{{ .Alerts.Firing | len }}'
Email Integration
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'password'
receivers:
- name: 'email'
email_configs:
- to: 'team@example.com'
headers:
Subject: '[ALERT] {{ .GroupLabels.alertname }}'
Webhook Integration
Custom webhook:
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://internal-service/alerts'
send_resolved: true
Webhook receives JSON:
{
"status": "firing",
"alerts": [
{
"labels": {
"alertname": "HighErrorRate",
"severity": "critical"
},
"annotations": {
"description": "Error rate is 0.08"
},
"startsAt": "2018-01-15T11:00:00Z"
}
]
}
High Availability
Run multiple Alertmanagers:
# prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager-01:9093'
- 'alertmanager-02:9093'
- 'alertmanager-03:9093'
Alertmanagers gossip to deduplicate.
Our Production Setup
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/...'
route:
receiver: 'default'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical to PagerDuty + Slack
- match:
severity: critical
receiver: 'critical'
continue: true
# Database team
- match:
team: database
receiver: 'database-team'
# Infrastructure team
- match:
team: infrastructure
receiver: 'infra-team'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
- name: 'critical'
pagerduty_configs:
- service_key: 'PAGERDUTY_KEY'
slack_configs:
- channel: '#critical-alerts'
color: 'danger'
- name: 'database-team'
slack_configs:
- channel: '#db-alerts'
- name: 'infra-team'
slack_configs:
- channel: '#infra-alerts'
inhibit_rules:
- source_match:
alertname: 'ClusterDown'
target_match_re:
alertname: '(ServiceDown|PodDown)'
equal: ['cluster']
Results
Before:
- All alerts to everyone
- 50+ duplicate alerts
- Alert fatigue
- 3 AM pages for warnings
After:
- Alerts routed to right team
- Grouped and deduplicated
- Only critical alerts page
- 90% reduction in noise
Lessons Learned
- Route alerts properly - Right alert to right team
- Group related alerts - Reduce noise
- Use inhibition - Prevent cascading alerts
- Silence during maintenance - Avoid false alarms
- Test alert rules - Before deploying to production
Conclusion
Alertmanager transforms Prometheus alerts from noise to actionable signals.
Key takeaways:
- Route alerts to appropriate teams
- Group and deduplicate alerts
- Inhibit dependent alerts
- Integrate with Slack/PagerDuty
- Silence during maintenance
Configure Alertmanager properly. Your on-call team will thank you.