ELK Stack: Centralized Logging for Microservices
Debugging issues across 15 microservices was a nightmare. SSH into each server, grep through logs, correlate timestamps manually. Finding a bug took hours.
I set up ELK stack. Now all logs in one place, searchable in seconds. Last production bug? Found root cause in 5 minutes.
Table of Contents
The Problem
15 microservices, logs scattered everywhere:
- SSH into each server
tail -f /var/log/app.log- Grep for errors
- Correlate timestamps across services
- No retention (logs rotated after 7 days)
Debugging a request across services: 2-3 hours.
ELK Stack Components
- Elasticsearch: Store and search logs
- Logstash: Collect and parse logs
- Kibana: Visualize and search logs
Installing Elasticsearch
Elasticsearch 5.6:
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.0.deb
sudo dpkg -i elasticsearch-5.6.0.deb
Configure /etc/elasticsearch/elasticsearch.yml:
cluster.name: production-logs
node.name: es-node-1
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["es-node-1", "es-node-2", "es-node-3"]
Start:
sudo systemctl start elasticsearch
sudo systemctl enable elasticsearch
Test:
curl http://localhost:9200
Installing Logstash
Logstash 5.6:
wget https://artifacts.elastic.co/downloads/logstash/logstash-5.6.0.deb
sudo dpkg -i logstash-5.6.0.deb
Configure /etc/logstash/conf.d/logstash.conf:
input {
beats {
port => 5044
}
}
filter {
if [type] == "app-log" {
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}" }
}
date {
match => [ "timestamp", "ISO8601" ]
}
if [level] == "ERROR" {
mutate {
add_tag => ["error"]
}
}
}
if [type] == "nginx-access" {
grok {
match => { "message" => '%{IPORHOST:clientip} - - \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}" %{NUMBER:status} %{NUMBER:bytes}' }
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}
Start:
sudo systemctl start logstash
sudo systemctl enable logstash
Installing Kibana
Kibana 5.6:
wget https://artifacts.elastic.co/downloads/kibana/kibana-5.6.0-amd64.deb
sudo dpkg -i kibana-5.6.0-amd64.deb
Configure /etc/kibana/kibana.yml:
server.port: 5601
server.host: "0.0.0.0"
elasticsearch.url: "http://localhost:9200"
Start:
sudo systemctl start kibana
sudo systemctl enable kibana
Access at http://localhost:5601
Shipping Logs with Filebeat
Install Filebeat on each server:
wget https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-5.6.0-amd64.deb
sudo dpkg -i filebeat-5.6.0-amd64.deb
Configure /etc/filebeat/filebeat.yml:
filebeat.prospectors:
- type: log
enabled: true
paths:
- /var/log/app/*.log
fields:
service: user-service
environment: production
fields_under_root: true
output.logstash:
hosts: ["logstash.example.com:5044"]
Start:
sudo systemctl start filebeat
sudo systemctl enable filebeat
Structured Logging
Changed application logging format:
Before (unstructured):
logger.info("User john logged in from 192.168.1.1")
After (structured JSON):
logger.info("User logged in", extra={
"event": "user_login",
"user_id": "john",
"ip_address": "192.168.1.1",
"timestamp": datetime.utcnow().isoformat()
})
Output:
{
"timestamp": "2017-09-12T13:30:00Z",
"level": "INFO",
"event": "user_login",
"user_id": "john",
"ip_address": "192.168.1.1",
"service": "user-service"
}
Much easier to parse and search!
Logstash Grok Patterns
Parse JSON logs:
filter {
json {
source => "message"
}
mutate {
remove_field => ["message"]
}
}
Parse custom format:
filter {
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{DATA:service}\] %{LOGLEVEL:level} %{GREEDYDATA:msg}"
}
}
}
Kibana Discover
Search logs:
Find all errors:
level:ERROR
Find errors in user-service:
level:ERROR AND service:user-service
Find slow requests:
duration:>1000
Find user activity:
user_id:john AND event:user_login
Time range: Last 15 minutes, 1 hour, 24 hours, 7 days
Kibana Visualizations
Created visualizations:
1. Error Rate Over Time
- Type: Line chart
- Y-axis: Count
- X-axis: @timestamp
- Filter: level:ERROR
2. Errors by Service
- Type: Pie chart
- Slice by: service.keyword
- Filter: level:ERROR
3. Top Error Messages
- Type: Data table
- Rows: message.keyword
- Metrics: Count
- Filter: level:ERROR
4. Request Duration Heatmap
- Type: Heatmap
- Y-axis: duration buckets
- X-axis: @timestamp
Kibana Dashboard
Combined visualizations into dashboard:
- Error rate over time
- Errors by service
- Top 10 error messages
- Request duration p95
- Active users
- Requests per second
Auto-refresh: 30 seconds
Alerting with ElastAlert
Installed ElastAlert:
pip install elastalert
Configure alert rule (error_spike.yaml):
name: Error Spike Alert
type: spike
index: logs-*
threshold_ref: 10
threshold_cur: 10
spike_height: 2
spike_type: up
timeframe:
minutes: 5
filter:
- term:
level: "ERROR"
alert:
- slack:
slack_webhook_url: "https://hooks.slack.com/services/..."
Run ElastAlert:
elastalert --config config.yaml --rule error_spike.yaml
Tracing Requests Across Services
Added correlation ID:
# API Gateway
correlation_id = str(uuid.uuid4())
logger.info("Request received", extra={
"correlation_id": correlation_id,
"path": request.path
})
# Pass to downstream services
headers = {"X-Correlation-ID": correlation_id}
response = requests.get("http://user-service/users", headers=headers)
# User Service
correlation_id = request.headers.get("X-Correlation-ID")
logger.info("Processing request", extra={
"correlation_id": correlation_id,
"action": "get_user"
})
Search in Kibana:
correlation_id:"abc-123-def"
See entire request flow across all services!
Log Retention
Configured index lifecycle:
# Delete indices older than 30 days
curator --config curator.yml delete_indices --older-than 30 --time-unit days
Cron job:
0 2 * * * curator --config /etc/curator/curator.yml /etc/curator/delete_old_indices.yml
Performance Tuning
Elasticsearch cluster:
- 3 nodes
- 32GB RAM each
- SSD storage
- Heap size: 16GB
Logstash:
- 4 workers
- Batch size: 125
- Pipeline workers: 4
Filebeat:
- Bulk max size: 2048
- Worker: 2
Real-World Debugging
Scenario: Users reporting slow checkout
Before ELK (2-3 hours):
- SSH into API gateway
- Grep for checkout requests
- SSH into order service
- Grep for order creation
- SSH into payment service
- Find slow payment API call
With ELK (5 minutes):
- Search:
path:"/checkout" AND duration:>1000 - Find correlation_id
- Search:
correlation_id:"abc-123" - See all logs across services
- Identify slow payment API call
Results
Before:
- Logs scattered across 15 servers
- Debugging takes 2-3 hours
- No log retention
- No correlation across services
After:
- All logs centralized
- Debugging takes 5-10 minutes
- 30-day retention
- Full request tracing
Lessons Learned
- Structured logging - JSON format is essential
- Correlation IDs - Track requests across services
- Index lifecycle - Manage disk space
- Alerting - Don’t just collect, alert on anomalies
- Performance - ELK needs resources (RAM, SSD)
Conclusion
ELK stack transformed our debugging process. Centralized logs are essential for microservices.
Key takeaways:
- Elasticsearch for storage and search
- Logstash for collection and parsing
- Kibana for visualization
- Filebeat for log shipping
- Structured logging with correlation IDs
If you have more than 3 services, set up centralized logging. Your future self will thank you.