We had Prometheus collecting metrics, but alerts were a mess. Every alert went to everyone. 3 AM pages for non-critical issues. Alert fatigue was real.

I configured Alertmanager properly. Now alerts route to the right team, duplicates are suppressed, and we only get paged for real emergencies.

Table of Contents

The Problem

Before Alertmanager:

  • All alerts to #general Slack channel
  • No deduplication (same alert 50 times)
  • No routing (everyone gets everything)
  • No inhibition (cascading alerts)
  • 3 AM pages for warnings

Team was ignoring alerts.

Installing Alertmanager

Alertmanager 0.15:

wget https://github.com/prometheus/alertmanager/releases/download/v0.15.0/alertmanager-0.15.0.linux-amd64.tar.gz
tar xvf alertmanager-0.15.0.linux-amd64.tar.gz
cd alertmanager-0.15.0.linux-amd64
./alertmanager --config.file=alertmanager.yml

Basic Configuration

alertmanager.yml:

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

receivers:
- name: 'default'
  slack_configs:
  - channel: '#alerts'
    text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Routing Rules

Route alerts to different teams:

route:
  receiver: 'default'
  group_by: ['alertname']
  
  routes:
  # Database alerts to DB team
  - match:
      service: database
    receiver: 'database-team'
    
  # Critical alerts to PagerDuty
  - match:
      severity: critical
    receiver: 'pagerduty'
    continue: true  # Also send to other receivers
    
  # Frontend alerts to frontend team
  - match_re:
      service: ^(web|api)$
    receiver: 'frontend-team'

receivers:
- name: 'database-team'
  slack_configs:
  - channel: '#db-alerts'

- name: 'frontend-team'
  slack_configs:
  - channel: '#frontend-alerts'

- name: 'pagerduty'
  pagerduty_configs:
  - service_key: 'YOUR_PAGERDUTY_KEY'

Grouping Alerts

Group related alerts:

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s       # Wait 30s for more alerts
  group_interval: 5m    # Send grouped alerts every 5m
  repeat_interval: 4h   # Resend if still firing after 4h

Example: 10 pods down in same cluster → 1 grouped alert instead of 10.

Inhibition Rules

Suppress dependent alerts:

inhibit_rules:
# If cluster is down, don't alert on individual services
- source_match:
    alertname: 'ClusterDown'
  target_match:
    alertname: 'ServiceDown'
  equal: ['cluster']

# If node is down, don't alert on pods on that node
- source_match:
    alertname: 'NodeDown'
  target_match_re:
    alertname: '(PodDown|ContainerDown)'
  equal: ['node']

Prevents alert storms.

Silencing Alerts

Silence during maintenance:

# Silence for 2 hours
amtool silence add alertname=HighMemory --duration=2h --comment="Planned maintenance"

# Silence specific instance
amtool silence add instance=web-01 --duration=1h

# List active silences
amtool silence query

# Expire silence
amtool silence expire <silence-id>

Or use Web UI: http://alertmanager:9093

Alert Templates

Custom Slack message:

receivers:
- name: 'slack'
  slack_configs:
  - channel: '#alerts'
    title: '{{ .GroupLabels.alertname }}'
    text: |
      {{ range .Alerts }}
      *Alert:* {{ .Labels.alertname }}
      *Severity:* {{ .Labels.severity }}
      *Instance:* {{ .Labels.instance }}
      *Description:* {{ .Annotations.description }}
      *Runbook:* {{ .Annotations.runbook_url }}
      {{ end }}
    color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'

Prometheus Alert Rules

Define alerts in Prometheus:

alerts.yml:

groups:
- name: example
  rules:
  # High error rate
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
    for: 5m
    labels:
      severity: critical
      service: api
    annotations:
      description: "Error rate is {{ $value }} on {{ $labels.instance }}"
      runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
  
  # High memory usage
  - alert: HighMemory
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
    for: 10m
    labels:
      severity: warning
      service: infrastructure
    annotations:
      description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
  
  # Service down
  - alert: ServiceDown
    expr: up{job="web"} == 0
    for: 1m
    labels:
      severity: critical
      service: web
    annotations:
      description: "{{ $labels.instance }} is down"

Load in Prometheus:

# prometheus.yml
rule_files:
  - 'alerts.yml'

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']

PagerDuty Integration

receivers:
- name: 'pagerduty'
  pagerduty_configs:
  - service_key: 'YOUR_SERVICE_KEY'
    description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.description }}'
    details:
      firing: '{{ .Alerts.Firing | len }}'
      resolved: '{{ .Alerts.Resolved | len }}'
      num_firing: '{{ .Alerts.Firing | len }}'

Email Integration

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'password'

receivers:
- name: 'email'
  email_configs:
  - to: 'team@example.com'
    headers:
      Subject: '[ALERT] {{ .GroupLabels.alertname }}'

Webhook Integration

Custom webhook:

receivers:
- name: 'webhook'
  webhook_configs:
  - url: 'http://internal-service/alerts'
    send_resolved: true

Webhook receives JSON:

{
  "status": "firing",
  "alerts": [
    {
      "labels": {
        "alertname": "HighErrorRate",
        "severity": "critical"
      },
      "annotations": {
        "description": "Error rate is 0.08"
      },
      "startsAt": "2018-01-15T11:00:00Z"
    }
  ]
}

High Availability

Run multiple Alertmanagers:

# prometheus.yml
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 'alertmanager-01:9093'
      - 'alertmanager-02:9093'
      - 'alertmanager-03:9093'

Alertmanagers gossip to deduplicate.

Our Production Setup

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/...'

route:
  receiver: 'default'
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  routes:
  # Critical to PagerDuty + Slack
  - match:
      severity: critical
    receiver: 'critical'
    continue: true
  
  # Database team
  - match:
      team: database
    receiver: 'database-team'
  
  # Infrastructure team
  - match:
      team: infrastructure
    receiver: 'infra-team'

receivers:
- name: 'default'
  slack_configs:
  - channel: '#alerts'

- name: 'critical'
  pagerduty_configs:
  - service_key: 'PAGERDUTY_KEY'
  slack_configs:
  - channel: '#critical-alerts'
    color: 'danger'

- name: 'database-team'
  slack_configs:
  - channel: '#db-alerts'

- name: 'infra-team'
  slack_configs:
  - channel: '#infra-alerts'

inhibit_rules:
- source_match:
    alertname: 'ClusterDown'
  target_match_re:
    alertname: '(ServiceDown|PodDown)'
  equal: ['cluster']

Results

Before:

  • All alerts to everyone
  • 50+ duplicate alerts
  • Alert fatigue
  • 3 AM pages for warnings

After:

  • Alerts routed to right team
  • Grouped and deduplicated
  • Only critical alerts page
  • 90% reduction in noise

Lessons Learned

  1. Route alerts properly - Right alert to right team
  2. Group related alerts - Reduce noise
  3. Use inhibition - Prevent cascading alerts
  4. Silence during maintenance - Avoid false alarms
  5. Test alert rules - Before deploying to production

Conclusion

Alertmanager transforms Prometheus alerts from noise to actionable signals.

Key takeaways:

  1. Route alerts to appropriate teams
  2. Group and deduplicate alerts
  3. Inhibit dependent alerts
  4. Integrate with Slack/PagerDuty
  5. Silence during maintenance

Configure Alertmanager properly. Your on-call team will thank you.