Our Prometheus was dying. 10 million time series, queries timing out, OOM crashes every day.

Fixed high cardinality. 10M → 100K series, 50x faster queries. Here’s how.

Table of Contents

The Problem

Symptoms:

  • Time series: 10 million
  • Memory usage: 64GB (OOM crashes)
  • Query time: 30s+ (timeouts)
  • Scrape duration: 2min
  • Disk usage: 500GB

Root Cause: High cardinality labels

# BAD: User ID in label (millions of unique values)
http_requests_total{user_id="12345", endpoint="/api/users"}

# BAD: Timestamp in label
cache_hits_total{timestamp="2020-03-20T10:30:45Z"}

# BAD: Full URL in label
api_calls_total{url="https://api.example.com/users/12345/orders/67890"}

Solution 1: Remove High Cardinality Labels

Before:

from prometheus_client import Counter

# BAD
request_counter = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'user_id', 'session_id']  # user_id and session_id are high cardinality
)

request_counter.labels(
    method='GET',
    endpoint='/api/users',
    user_id='12345',
    session_id='abc-def-ghi'
).inc()

After:

# GOOD
request_counter = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']  # Low cardinality only
)

request_counter.labels(
    method='GET',
    endpoint='/api/users',
    status='200'
).inc()

Results:

  • Time series: 10M → 500K (-95%)
  • Memory: 64GB → 8GB (-87%)

Solution 2: Aggregate High Cardinality Data

from prometheus_client import Histogram

# Use histogram for high cardinality values
response_time = Histogram(
    'http_response_time_seconds',
    'HTTP response time',
    ['method', 'endpoint'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0]
)

# Record individual values without creating new series
response_time.labels(method='GET', endpoint='/api/users').observe(0.234)

Query Aggregated Data:

# 95th percentile response time
histogram_quantile(0.95,
  rate(http_response_time_seconds_bucket[5m])
)

# Average response time
rate(http_response_time_seconds_sum[5m]) /
rate(http_response_time_seconds_count[5m])

Solution 3: Relabel Configs

# prometheus.yml
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    
    # Drop high cardinality labels
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'container_network_.*'
        action: drop
      
      # Remove pod_id (high cardinality)
      - regex: 'pod_id'
        action: labeldrop
      
      # Aggregate by namespace only
      - source_labels: [namespace, pod]
        target_label: namespace
        action: replace

Solution 4: Recording Rules

# rules.yml
groups:
  - name: aggregation
    interval: 1m
    rules:
      # Pre-aggregate high cardinality metrics
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, method, status)
      
      - record: job:http_response_time:p95
        expr: histogram_quantile(0.95, sum(rate(http_response_time_seconds_bucket[5m])) by (job, le))

Query Pre-Aggregated Data:

# Use recording rule (fast)
job:http_requests:rate5m{job="api"}

# Instead of raw metric (slow)
sum(rate(http_requests_total{job="api"}[5m])) by (method, status)

Solution 5: Metric Lifecycle

from prometheus_client import Counter, CollectorRegistry
import time

class MetricManager:
    def __init__(self):
        self.registry = CollectorRegistry()
        self.metrics = {}
        self.last_used = {}
    
    def get_counter(self, name, labels):
        """Get or create counter with TTL."""
        key = f"{name}:{labels}"
        
        # Clean up old metrics
        self._cleanup_old_metrics()
        
        if key not in self.metrics:
            self.metrics[key] = Counter(
                name,
                'Description',
                list(labels.keys()),
                registry=self.registry
            )
        
        self.last_used[key] = time.time()
        return self.metrics[key].labels(**labels)
    
    def _cleanup_old_metrics(self):
        """Remove metrics not used in 1 hour."""
        now = time.time()
        to_remove = [
            key for key, last_used in self.last_used.items()
            if now - last_used > 3600
        ]
        
        for key in to_remove:
            del self.metrics[key]
            del self.last_used[key]

Monitoring Cardinality

# Check number of time series per metric
count({__name__=~".+"}) by (__name__)

# Top 10 metrics by cardinality
topk(10, count by (__name__)({__name__=~".+"}))

# Memory usage per metric
topk(10, count by (__name__)({__name__=~".+"})) * on(__name__) group_left
  avg_over_time(scrape_duration_seconds[5m])

Alert on High Cardinality:

groups:
  - name: cardinality
    rules:
      - alert: HighCardinality
        expr: count({__name__=~".+"}) > 100000
        for: 5m
        annotations:
          summary: "High cardinality detected"
          description: "{{ $value }} time series"

Results

Before:

MetricValue
Time series10M
Memory64GB
Query time30s
Scrape duration2min
Disk usage500GB

After:

MetricValueImprovement
Time series100K-99%
Memory4GB-94%
Query time0.5s-98%
Scrape duration5s-96%
Disk usage50GB-90%

Cost Savings:

  • Infrastructure: $2000/month → $200/month (-90%)
  • No more OOM crashes
  • Stable performance

Lessons Learned

  1. Cardinality kills: 10M series = death
  2. Labels matter: Only low cardinality
  3. Histograms for values: Don’t use labels
  4. Recording rules help: Pre-aggregate
  5. Monitor cardinality: Prevent issues

Conclusion

Fixed Prometheus high cardinality. 10M → 100K series, 50x faster, 90% cost reduction.

Key takeaways:

  1. Time series: 10M → 100K (-99%)
  2. Query time: 30s → 0.5s (-98%)
  3. Memory: 64GB → 4GB (-94%)
  4. Cost: $2000 → $200/month (-90%)
  5. Zero OOM crashes

Monitor your cardinality. It will save you.