Debugging issues across 15 microservices was a nightmare. SSH into each server, grep through logs, correlate timestamps manually. Finding a bug took hours.

I set up ELK stack. Now all logs in one place, searchable in seconds. Last production bug? Found root cause in 5 minutes.

Table of Contents

The Problem

15 microservices, logs scattered everywhere:

  • SSH into each server
  • tail -f /var/log/app.log
  • Grep for errors
  • Correlate timestamps across services
  • No retention (logs rotated after 7 days)

Debugging a request across services: 2-3 hours.

ELK Stack Components

  • Elasticsearch: Store and search logs
  • Logstash: Collect and parse logs
  • Kibana: Visualize and search logs

Installing Elasticsearch

Elasticsearch 5.6:

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.0.deb
sudo dpkg -i elasticsearch-5.6.0.deb

Configure /etc/elasticsearch/elasticsearch.yml:

cluster.name: production-logs
node.name: es-node-1
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["es-node-1", "es-node-2", "es-node-3"]

Start:

sudo systemctl start elasticsearch
sudo systemctl enable elasticsearch

Test:

curl http://localhost:9200

Installing Logstash

Logstash 5.6:

wget https://artifacts.elastic.co/downloads/logstash/logstash-5.6.0.deb
sudo dpkg -i logstash-5.6.0.deb

Configure /etc/logstash/conf.d/logstash.conf:

input {
  beats {
    port => 5044
  }
}

filter {
  if [type] == "app-log" {
    grok {
      match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}" }
    }
    
    date {
      match => [ "timestamp", "ISO8601" ]
    }
    
    if [level] == "ERROR" {
      mutate {
        add_tag => ["error"]
      }
    }
  }
  
  if [type] == "nginx-access" {
    grok {
      match => { "message" => '%{IPORHOST:clientip} - - \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}" %{NUMBER:status} %{NUMBER:bytes}' }
    }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

Start:

sudo systemctl start logstash
sudo systemctl enable logstash

Installing Kibana

Kibana 5.6:

wget https://artifacts.elastic.co/downloads/kibana/kibana-5.6.0-amd64.deb
sudo dpkg -i kibana-5.6.0-amd64.deb

Configure /etc/kibana/kibana.yml:

server.port: 5601
server.host: "0.0.0.0"
elasticsearch.url: "http://localhost:9200"

Start:

sudo systemctl start kibana
sudo systemctl enable kibana

Access at http://localhost:5601

Shipping Logs with Filebeat

Install Filebeat on each server:

wget https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-5.6.0-amd64.deb
sudo dpkg -i filebeat-5.6.0-amd64.deb

Configure /etc/filebeat/filebeat.yml:

filebeat.prospectors:
- type: log
  enabled: true
  paths:
    - /var/log/app/*.log
  fields:
    service: user-service
    environment: production
  fields_under_root: true

output.logstash:
  hosts: ["logstash.example.com:5044"]

Start:

sudo systemctl start filebeat
sudo systemctl enable filebeat

Structured Logging

Changed application logging format:

Before (unstructured):

logger.info("User john logged in from 192.168.1.1")

After (structured JSON):

logger.info("User logged in", extra={
    "event": "user_login",
    "user_id": "john",
    "ip_address": "192.168.1.1",
    "timestamp": datetime.utcnow().isoformat()
})

Output:

{
  "timestamp": "2017-09-12T13:30:00Z",
  "level": "INFO",
  "event": "user_login",
  "user_id": "john",
  "ip_address": "192.168.1.1",
  "service": "user-service"
}

Much easier to parse and search!

Logstash Grok Patterns

Parse JSON logs:

filter {
  json {
    source => "message"
  }
  
  mutate {
    remove_field => ["message"]
  }
}

Parse custom format:

filter {
  grok {
    match => {
      "message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{DATA:service}\] %{LOGLEVEL:level} %{GREEDYDATA:msg}"
    }
  }
}

Kibana Discover

Search logs:

Find all errors:

level:ERROR

Find errors in user-service:

level:ERROR AND service:user-service

Find slow requests:

duration:>1000

Find user activity:

user_id:john AND event:user_login

Time range: Last 15 minutes, 1 hour, 24 hours, 7 days

Kibana Visualizations

Created visualizations:

1. Error Rate Over Time

  • Type: Line chart
  • Y-axis: Count
  • X-axis: @timestamp
  • Filter: level:ERROR

2. Errors by Service

  • Type: Pie chart
  • Slice by: service.keyword
  • Filter: level:ERROR

3. Top Error Messages

  • Type: Data table
  • Rows: message.keyword
  • Metrics: Count
  • Filter: level:ERROR

4. Request Duration Heatmap

  • Type: Heatmap
  • Y-axis: duration buckets
  • X-axis: @timestamp

Kibana Dashboard

Combined visualizations into dashboard:

  • Error rate over time
  • Errors by service
  • Top 10 error messages
  • Request duration p95
  • Active users
  • Requests per second

Auto-refresh: 30 seconds

Alerting with ElastAlert

Installed ElastAlert:

pip install elastalert

Configure alert rule (error_spike.yaml):

name: Error Spike Alert
type: spike
index: logs-*

threshold_ref: 10
threshold_cur: 10
spike_height: 2
spike_type: up
timeframe:
  minutes: 5

filter:
- term:
    level: "ERROR"

alert:
- slack:
    slack_webhook_url: "https://hooks.slack.com/services/..."

Run ElastAlert:

elastalert --config config.yaml --rule error_spike.yaml

Tracing Requests Across Services

Added correlation ID:

# API Gateway
correlation_id = str(uuid.uuid4())
logger.info("Request received", extra={
    "correlation_id": correlation_id,
    "path": request.path
})

# Pass to downstream services
headers = {"X-Correlation-ID": correlation_id}
response = requests.get("http://user-service/users", headers=headers)
# User Service
correlation_id = request.headers.get("X-Correlation-ID")
logger.info("Processing request", extra={
    "correlation_id": correlation_id,
    "action": "get_user"
})

Search in Kibana:

correlation_id:"abc-123-def"

See entire request flow across all services!

Log Retention

Configured index lifecycle:

# Delete indices older than 30 days
curator --config curator.yml delete_indices --older-than 30 --time-unit days

Cron job:

0 2 * * * curator --config /etc/curator/curator.yml /etc/curator/delete_old_indices.yml

Performance Tuning

Elasticsearch cluster:

  • 3 nodes
  • 32GB RAM each
  • SSD storage
  • Heap size: 16GB

Logstash:

  • 4 workers
  • Batch size: 125
  • Pipeline workers: 4

Filebeat:

  • Bulk max size: 2048
  • Worker: 2

Real-World Debugging

Scenario: Users reporting slow checkout

Before ELK (2-3 hours):

  1. SSH into API gateway
  2. Grep for checkout requests
  3. SSH into order service
  4. Grep for order creation
  5. SSH into payment service
  6. Find slow payment API call

With ELK (5 minutes):

  1. Search: path:"/checkout" AND duration:>1000
  2. Find correlation_id
  3. Search: correlation_id:"abc-123"
  4. See all logs across services
  5. Identify slow payment API call

Results

Before:

  • Logs scattered across 15 servers
  • Debugging takes 2-3 hours
  • No log retention
  • No correlation across services

After:

  • All logs centralized
  • Debugging takes 5-10 minutes
  • 30-day retention
  • Full request tracing

Lessons Learned

  1. Structured logging - JSON format is essential
  2. Correlation IDs - Track requests across services
  3. Index lifecycle - Manage disk space
  4. Alerting - Don’t just collect, alert on anomalies
  5. Performance - ELK needs resources (RAM, SSD)

Conclusion

ELK stack transformed our debugging process. Centralized logs are essential for microservices.

Key takeaways:

  1. Elasticsearch for storage and search
  2. Logstash for collection and parsing
  3. Kibana for visualization
  4. Filebeat for log shipping
  5. Structured logging with correlation IDs

If you have more than 3 services, set up centralized logging. Your future self will thank you.