Solving Prometheus High Cardinality: From 10M to 100K Series
Our Prometheus was dying. 10 million time series, queries timing out, OOM crashes every day.
Fixed high cardinality. 10M → 100K series, 50x faster queries. Here’s how.
Table of Contents
The Problem
Symptoms:
- Time series: 10 million
- Memory usage: 64GB (OOM crashes)
- Query time: 30s+ (timeouts)
- Scrape duration: 2min
- Disk usage: 500GB
Root Cause: High cardinality labels
# BAD: User ID in label (millions of unique values)
http_requests_total{user_id="12345", endpoint="/api/users"}
# BAD: Timestamp in label
cache_hits_total{timestamp="2020-03-20T10:30:45Z"}
# BAD: Full URL in label
api_calls_total{url="https://api.example.com/users/12345/orders/67890"}
Solution 1: Remove High Cardinality Labels
Before:
from prometheus_client import Counter
# BAD
request_counter = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'user_id', 'session_id'] # user_id and session_id are high cardinality
)
request_counter.labels(
method='GET',
endpoint='/api/users',
user_id='12345',
session_id='abc-def-ghi'
).inc()
After:
# GOOD
request_counter = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status'] # Low cardinality only
)
request_counter.labels(
method='GET',
endpoint='/api/users',
status='200'
).inc()
Results:
- Time series: 10M → 500K (-95%)
- Memory: 64GB → 8GB (-87%)
Solution 2: Aggregate High Cardinality Data
from prometheus_client import Histogram
# Use histogram for high cardinality values
response_time = Histogram(
'http_response_time_seconds',
'HTTP response time',
['method', 'endpoint'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0]
)
# Record individual values without creating new series
response_time.labels(method='GET', endpoint='/api/users').observe(0.234)
Query Aggregated Data:
# 95th percentile response time
histogram_quantile(0.95,
rate(http_response_time_seconds_bucket[5m])
)
# Average response time
rate(http_response_time_seconds_sum[5m]) /
rate(http_response_time_seconds_count[5m])
Solution 3: Relabel Configs
# prometheus.yml
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
# Drop high cardinality labels
metric_relabel_configs:
- source_labels: [__name__]
regex: 'container_network_.*'
action: drop
# Remove pod_id (high cardinality)
- regex: 'pod_id'
action: labeldrop
# Aggregate by namespace only
- source_labels: [namespace, pod]
target_label: namespace
action: replace
Solution 4: Recording Rules
# rules.yml
groups:
- name: aggregation
interval: 1m
rules:
# Pre-aggregate high cardinality metrics
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job, method, status)
- record: job:http_response_time:p95
expr: histogram_quantile(0.95, sum(rate(http_response_time_seconds_bucket[5m])) by (job, le))
Query Pre-Aggregated Data:
# Use recording rule (fast)
job:http_requests:rate5m{job="api"}
# Instead of raw metric (slow)
sum(rate(http_requests_total{job="api"}[5m])) by (method, status)
Solution 5: Metric Lifecycle
from prometheus_client import Counter, CollectorRegistry
import time
class MetricManager:
def __init__(self):
self.registry = CollectorRegistry()
self.metrics = {}
self.last_used = {}
def get_counter(self, name, labels):
"""Get or create counter with TTL."""
key = f"{name}:{labels}"
# Clean up old metrics
self._cleanup_old_metrics()
if key not in self.metrics:
self.metrics[key] = Counter(
name,
'Description',
list(labels.keys()),
registry=self.registry
)
self.last_used[key] = time.time()
return self.metrics[key].labels(**labels)
def _cleanup_old_metrics(self):
"""Remove metrics not used in 1 hour."""
now = time.time()
to_remove = [
key for key, last_used in self.last_used.items()
if now - last_used > 3600
]
for key in to_remove:
del self.metrics[key]
del self.last_used[key]
Monitoring Cardinality
# Check number of time series per metric
count({__name__=~".+"}) by (__name__)
# Top 10 metrics by cardinality
topk(10, count by (__name__)({__name__=~".+"}))
# Memory usage per metric
topk(10, count by (__name__)({__name__=~".+"})) * on(__name__) group_left
avg_over_time(scrape_duration_seconds[5m])
Alert on High Cardinality:
groups:
- name: cardinality
rules:
- alert: HighCardinality
expr: count({__name__=~".+"}) > 100000
for: 5m
annotations:
summary: "High cardinality detected"
description: "{{ $value }} time series"
Results
Before:
| Metric | Value |
|---|---|
| Time series | 10M |
| Memory | 64GB |
| Query time | 30s |
| Scrape duration | 2min |
| Disk usage | 500GB |
After:
| Metric | Value | Improvement |
|---|---|---|
| Time series | 100K | -99% |
| Memory | 4GB | -94% |
| Query time | 0.5s | -98% |
| Scrape duration | 5s | -96% |
| Disk usage | 50GB | -90% |
Cost Savings:
- Infrastructure: $2000/month → $200/month (-90%)
- No more OOM crashes
- Stable performance
Lessons Learned
- Cardinality kills: 10M series = death
- Labels matter: Only low cardinality
- Histograms for values: Don’t use labels
- Recording rules help: Pre-aggregate
- Monitor cardinality: Prevent issues
Conclusion
Fixed Prometheus high cardinality. 10M → 100K series, 50x faster, 90% cost reduction.
Key takeaways:
- Time series: 10M → 100K (-99%)
- Query time: 30s → 0.5s (-98%)
- Memory: 64GB → 4GB (-94%)
- Cost: $2000 → $200/month (-90%)
- Zero OOM crashes
Monitor your cardinality. It will save you.