Our Prometheus server died. Lost all metrics. No historical data for incident analysis. Retention was only 15 days due to disk space.

I implemented Thanos. Now we have unlimited retention in S3, multiple Prometheus instances, and global queries across all clusters. Last server failure? Didn’t even notice.

Table of Contents

The Problem

Single Prometheus instance:

  • Single point of failure
  • Limited retention (15 days)
  • No global view across clusters
  • Disk space constraints

We needed better.

Thanos Architecture

Components:

  • Sidecar: Uploads blocks to object storage
  • Store Gateway: Queries historical data from storage
  • Querier: Provides global query view
  • Compactor: Downsamples and compacts data
  • Ruler: Evaluates recording/alerting rules

Installing Thanos

Download:

wget https://github.com/thanos-io/thanos/releases/download/v0.8.1/thanos-0.8.1.linux-amd64.tar.gz
tar xvf thanos-0.8.1.linux-amd64.tar.gz
sudo mv thanos /usr/local/bin/

Prometheus Configuration

Enable external labels:

prometheus.yml:

global:
  external_labels:
    cluster: us-east-1
    replica: A

storage:
  tsdb:
    min_block_duration: 2h
    max_block_duration: 2h

Thanos Sidecar

Run alongside Prometheus:

thanos sidecar \
  --tsdb.path /var/lib/prometheus \
  --prometheus.url http://localhost:9090 \
  --objstore.config-file /etc/thanos/bucket.yml \
  --http-address 0.0.0.0:19191 \
  --grpc-address 0.0.0.0:19090

S3 configuration (bucket.yml):

type: S3
config:
  bucket: thanos-metrics
  endpoint: s3.amazonaws.com
  access_key: YOUR_ACCESS_KEY
  secret_key: YOUR_SECRET_KEY

Thanos Query

Global query interface:

thanos query \
  --http-address 0.0.0.0:9090 \
  --store 10.0.1.1:19090 \  # Sidecar 1
  --store 10.0.1.2:19090 \  # Sidecar 2
  --store 10.0.1.3:19091    # Store Gateway

Access at http://localhost:9090

Thanos Store Gateway

Query historical data:

thanos store \
  --data-dir /var/lib/thanos/store \
  --objstore.config-file /etc/thanos/bucket.yml \
  --http-address 0.0.0.0:19192 \
  --grpc-address 0.0.0.0:19091

Thanos Compactor

Downsample and compact:

thanos compact \
  --data-dir /var/lib/thanos/compact \
  --objstore.config-file /etc/thanos/bucket.yml \
  --wait

Runs compaction and downsampling:

  • Raw data: 5m resolution
  • 5m downsampled: kept for 40 days
  • 1h downsampled: kept forever

Kubernetes Deployment

Prometheus with Thanos sidecar:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
spec:
  serviceName: prometheus
  replicas: 2
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:v2.14.0
        args:
        - --config.file=/etc/prometheus/prometheus.yml
        - --storage.tsdb.path=/prometheus
        - --storage.tsdb.min-block-duration=2h
        - --storage.tsdb.max-block-duration=2h
        - --web.enable-lifecycle
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
        - name: data
          mountPath: /prometheus
      
      - name: thanos-sidecar
        image: thanosio/thanos:v0.8.1
        args:
        - sidecar
        - --tsdb.path=/prometheus
        - --prometheus.url=http://localhost:9090
        - --objstore.config-file=/etc/thanos/bucket.yml
        - --grpc-address=0.0.0.0:10901
        volumeMounts:
        - name: data
          mountPath: /prometheus
        - name: thanos-config
          mountPath: /etc/thanos
      
      volumes:
      - name: config
        configMap:
          name: prometheus-config
      - name: thanos-config
        secret:
          secretName: thanos-objstore
  
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 50Gi

Thanos Query:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
spec:
  replicas: 2
  selector:
    matchLabels:
      app: thanos-query
  template:
    metadata:
      labels:
        app: thanos-query
    spec:
      containers:
      - name: thanos-query
        image: thanosio/thanos:v0.8.1
        args:
        - query
        - --http-address=0.0.0.0:9090
        - --store=prometheus-0.prometheus:10901
        - --store=prometheus-1.prometheus:10901
        - --store=thanos-store:10901
        ports:
        - containerPort: 9090
---
apiVersion: v1
kind: Service
metadata:
  name: thanos-query
spec:
  selector:
    app: thanos-query
  ports:
  - port: 9090
    targetPort: 9090

Deduplication

Thanos automatically deduplicates:

# Query returns deduplicated results
up{job="api"}

Even with multiple Prometheus replicas!

Global Queries

Query across all clusters:

# All clusters
sum(rate(http_requests_total[5m])) by (cluster)

# Specific cluster
sum(rate(http_requests_total{cluster="us-east-1"}[5m]))

# Cross-cluster aggregation
sum(rate(http_requests_total[5m]))

Downsampling

Automatic downsampling saves storage:

  • Raw (5s): 2 weeks
  • 5m resolution: 6 weeks
  • 1h resolution: forever

Query automatically uses appropriate resolution!

Monitoring Thanos

Thanos exposes metrics:

# Sidecar upload rate
rate(thanos_objstore_bucket_operations_total{operation="upload"}[5m])

# Query latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{handler="query"}[5m]))

# Store Gateway cache hit rate
thanos_store_index_cache_hits_total / thanos_store_index_cache_requests_total

Backup and Restore

S3 versioning:

aws s3api put-bucket-versioning \
  --bucket thanos-metrics \
  --versioning-configuration Status=Enabled

Lifecycle policy:

{
  "Rules": [
    {
      "Id": "DeleteOldVersions",
      "Status": "Enabled",
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 90
      }
    }
  ]
}

Cost Optimization

S3 storage classes:

type: S3
config:
  bucket: thanos-metrics
  endpoint: s3.amazonaws.com
  storage_class: INTELLIGENT_TIERING

Compaction reduces storage:

  • Before: 10TB/month
  • After: 2TB/month (80% reduction)

High Availability Setup

Multiple Prometheus instances:

Cluster US-East-1:
  - Prometheus A (replica: A)
  - Prometheus B (replica: B)
  - Thanos Sidecar A
  - Thanos Sidecar B

Cluster US-West-1:
  - Prometheus A (replica: A)
  - Prometheus B (replica: B)
  - Thanos Sidecar A
  - Thanos Sidecar B

Global:
  - Thanos Query (3 replicas)
  - Thanos Store Gateway (3 replicas)
  - Thanos Compactor (1 instance)

Grafana Integration

Configure Thanos as datasource:

apiVersion: 1
datasources:
- name: Thanos
  type: prometheus
  access: proxy
  url: http://thanos-query:9090
  isDefault: true

Results

Before:

  • Single Prometheus
  • 15-day retention
  • No HA
  • Lost data on failure

After:

  • Multiple Prometheus instances
  • Unlimited retention
  • Automatic failover
  • Global query view
  • 80% storage cost reduction

Lessons Learned

  1. Start with Thanos early - Migration is harder later
  2. Use downsampling - Saves massive storage
  3. Monitor Thanos itself - It’s critical infrastructure
  4. Test failover - Regularly
  5. S3 lifecycle policies - Control costs

Conclusion

Thanos transforms Prometheus into a highly available, globally queryable monitoring system with unlimited retention.

Key takeaways:

  1. Sidecar uploads to object storage
  2. Query provides global view
  3. Automatic deduplication
  4. Downsampling saves storage
  5. Survives datacenter failures

Don’t lose your metrics. Implement Thanos for Prometheus HA.