ELK Stack: Centralized Logging for Microservices

Debugging issues across 15 microservices was a nightmare. SSH into each server, grep through logs, correlate timestamps manually. Finding a bug took hours.

I set up ELK stack. Now all logs in one place, searchable in seconds. Last production bug? Found root cause in 5 minutes.

The Problem

15 microservices, logs scattered everywhere:

SSH into each server
tail -f /var/log/app.log
Grep for errors
Correlate timestamps across services
No retention (logs rotated after 7 days)

Debugging a request across services: 2-3 hours.

ELK Stack Components

Elasticsearch: Store and search logs
Logstash: Collect and parse logs
Kibana: Visualize and search logs

Installing Elasticsearch

Elasticsearch 5.6:

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.0.deb
sudo dpkg -i elasticsearch-5.6.0.deb

Configure /etc/elasticsearch/elasticsearch.yml:

cluster.name: production-logs
node.name: es-node-1
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["es-node-1", "es-node-2", "es-node-3"]

Start:

sudo systemctl start elasticsearch
sudo systemctl enable elasticsearch

Test:

curl http://localhost:9200

Installing Logstash

Logstash 5.6:

wget https://artifacts.elastic.co/downloads/logstash/logstash-5.6.0.deb
sudo dpkg -i logstash-5.6.0.deb

Configure /etc/logstash/conf.d/logstash.conf:

input {
  beats {
    port => 5044
  }
}

filter {
  if [type] == "app-log" {
    grok {
      match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}" }
    }
    
    date {
      match => [ "timestamp", "ISO8601" ]
    }
    
    if [level] == "ERROR" {
      mutate {
        add_tag => ["error"]
      }
    }
  }
  
  if [type] == "nginx-access" {
    grok {
      match => { "message" => '%{IPORHOST:clientip} - - \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}" %{NUMBER:status} %{NUMBER:bytes}' }
    }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

Start:

sudo systemctl start logstash
sudo systemctl enable logstash

Installing Kibana

Kibana 5.6:

wget https://artifacts.elastic.co/downloads/kibana/kibana-5.6.0-amd64.deb
sudo dpkg -i kibana-5.6.0-amd64.deb

Configure /etc/kibana/kibana.yml:

server.port: 5601
server.host: "0.0.0.0"
elasticsearch.url: "http://localhost:9200"

Start:

sudo systemctl start kibana
sudo systemctl enable kibana

Access at http://localhost:5601

Shipping Logs with Filebeat

Install Filebeat on each server:

wget https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-5.6.0-amd64.deb
sudo dpkg -i filebeat-5.6.0-amd64.deb

Configure /etc/filebeat/filebeat.yml:

filebeat.prospectors:
- type: log
  enabled: true
  paths:
    - /var/log/app/*.log
  fields:
    service: user-service
    environment: production
  fields_under_root: true

output.logstash:
  hosts: ["logstash.example.com:5044"]

Start:

sudo systemctl start filebeat
sudo systemctl enable filebeat

Structured Logging

Changed application logging format:

Before (unstructured):

logger.info("User john logged in from 192.168.1.1")

After (structured JSON):

logger.info("User logged in", extra={
    "event": "user_login",
    "user_id": "john",
    "ip_address": "192.168.1.1",
    "timestamp": datetime.utcnow().isoformat()
})

Output:

{
  "timestamp": "2017-09-12T13:30:00Z",
  "level": "INFO",
  "event": "user_login",
  "user_id": "john",
  "ip_address": "192.168.1.1",
  "service": "user-service"
}

Much easier to parse and search!

Logstash Grok Patterns

Parse JSON logs:

filter {
  json {
    source => "message"
  }
  
  mutate {
    remove_field => ["message"]
  }
}

Parse custom format:

filter {
  grok {
    match => {
      "message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{DATA:service}\] %{LOGLEVEL:level} %{GREEDYDATA:msg}"
    }
  }
}

Kibana Discover

Search logs:

Find all errors:

level:ERROR

Find errors in user-service:

level:ERROR AND service:user-service

Find slow requests:

duration:>1000

Find user activity:

user_id:john AND event:user_login

Time range: Last 15 minutes, 1 hour, 24 hours, 7 days

Kibana Visualizations

Created visualizations:

1. Error Rate Over Time

Type: Line chart
Y-axis: Count
X-axis: @timestamp
Filter: level:ERROR

2. Errors by Service

Type: Pie chart
Slice by: service.keyword
Filter: level:ERROR

3. Top Error Messages

Type: Data table
Rows: message.keyword
Metrics: Count
Filter: level:ERROR

4. Request Duration Heatmap

Type: Heatmap
Y-axis: duration buckets
X-axis: @timestamp

Kibana Dashboard

Combined visualizations into dashboard:

Error rate over time
Errors by service
Top 10 error messages
Request duration p95
Active users
Requests per second

Auto-refresh: 30 seconds

Alerting with ElastAlert

Installed ElastAlert:

pip install elastalert

Configure alert rule (error_spike.yaml):

name: Error Spike Alert
type: spike
index: logs-*

threshold_ref: 10
threshold_cur: 10
spike_height: 2
spike_type: up
timeframe:
  minutes: 5

filter:
- term:
    level: "ERROR"

alert:
- slack:
    slack_webhook_url: "https://hooks.slack.com/services/..."

Run ElastAlert:

elastalert --config config.yaml --rule error_spike.yaml

Tracing Requests Across Services

Added correlation ID:

# API Gateway
correlation_id = str(uuid.uuid4())
logger.info("Request received", extra={
    "correlation_id": correlation_id,
    "path": request.path
})

# Pass to downstream services
headers = {"X-Correlation-ID": correlation_id}
response = requests.get("http://user-service/users", headers=headers)

# User Service
correlation_id = request.headers.get("X-Correlation-ID")
logger.info("Processing request", extra={
    "correlation_id": correlation_id,
    "action": "get_user"
})

Search in Kibana:

correlation_id:"abc-123-def"

See entire request flow across all services!

Log Retention

Configured index lifecycle:

# Delete indices older than 30 days
curator --config curator.yml delete_indices --older-than 30 --time-unit days

Cron job:

0 2 * * * curator --config /etc/curator/curator.yml /etc/curator/delete_old_indices.yml

Performance Tuning

Elasticsearch cluster:

3 nodes
32GB RAM each
SSD storage
Heap size: 16GB

Logstash:

4 workers
Batch size: 125
Pipeline workers: 4

Filebeat:

Bulk max size: 2048
Worker: 2

Real-World Debugging

Scenario: Users reporting slow checkout

Before ELK (2-3 hours):

SSH into API gateway
Grep for checkout requests
SSH into order service
Grep for order creation
SSH into payment service
Find slow payment API call

With ELK (5 minutes):

Search: path:"/checkout" AND duration:>1000
Find correlation_id
Search: correlation_id:"abc-123"
See all logs across services
Identify slow payment API call

Results

Before:

Logs scattered across 15 servers
Debugging takes 2-3 hours
No log retention
No correlation across services

After:

All logs centralized
Debugging takes 5-10 minutes
30-day retention
Full request tracing

Lessons Learned

Structured logging - JSON format is essential
Correlation IDs - Track requests across services
Index lifecycle - Manage disk space
Alerting - Don’t just collect, alert on anomalies
Performance - ELK needs resources (RAM, SSD)

Conclusion

ELK stack transformed our debugging process. Centralized logs are essential for microservices.

Key takeaways:

Elasticsearch for storage and search
Logstash for collection and parsing
Kibana for visualization
Filebeat for log shipping
Structured logging with correlation IDs

If you have more than 3 services, set up centralized logging. Your future self will thank you.

Table of Contents

The Problem

ELK Stack Components

Installing Elasticsearch

Installing Logstash

Installing Kibana

Shipping Logs with Filebeat

Structured Logging

Logstash Grok Patterns

Kibana Discover

Kibana Visualizations

Kibana Dashboard

Alerting with ElastAlert

Tracing Requests Across Services

Log Retention

Performance Tuning

Real-World Debugging

Results

Lessons Learned

Conclusion