Distributed Tracing with Zipkin: Debugging Microservices
With 12 microservices in production, debugging has become a nightmare. A single user request touches 5-7 services, and when something goes slow, I have no idea where.
I spent two weeks implementing distributed tracing with Zipkin. Now I can see exactly what’s happening across our entire system.
Table of Contents
The Problem: Lost in Microservices
A typical request flow:
User → API Gateway → Auth Service → User Service → Order Service → Payment Service → Email Service
When a request takes 3 seconds, which service is slow? I had to:
- Check logs in 7 different services
- Correlate timestamps manually
- Hope the request IDs match
This took hours. There had to be a better way.
Enter Distributed Tracing
Distributed tracing tracks requests across service boundaries. Each request gets a unique trace ID that follows it through the entire system.
Key concepts:
- Trace: End-to-end journey of a request
- Span: Single operation within a trace
- Tags: Metadata about a span
- Logs: Events within a span
Why Zipkin?
I evaluated three options:
- Zipkin - Open source, mature, easy to set up
- Jaeger - Newer, more features, more complex
- AWS X-Ray - Managed, but vendor lock-in
I chose Zipkin because:
- Simple to deploy
- Good Python/Go libraries
- Active community
- Works with our existing infrastructure
Setting Up Zipkin
Deploy Zipkin with Docker:
# docker-compose.yml
version: '3'
services:
zipkin:
image: openzipkin/zipkin:2.4
ports:
- "9411:9411"
environment:
- STORAGE_TYPE=elasticsearch
- ES_HOSTS=elasticsearch:9200
elasticsearch:
image: elasticsearch:5.6
ports:
- "9200:9200"
Start it:
docker-compose up -d
Zipkin UI is now at http://localhost:9411.
Instrumenting Python Services
Install the library:
pip install py_zipkin
Add tracing to Flask:
from flask import Flask, request
from py_zipkin.zipkin import zipkin_span, ZipkinAttrs
from py_zipkin.transport import BaseTransportHandler
import requests
app = Flask(__name__)
class HTTPTransport(BaseTransportHandler):
def get_max_payload_bytes(self):
return None
def send(self, encoded_span):
requests.post(
'http://zipkin:9411/api/v1/spans',
data=encoded_span,
headers={'Content-Type': 'application/x-thrift'}
)
def extract_zipkin_attrs():
"""Extract trace context from headers"""
return ZipkinAttrs(
trace_id=request.headers.get('X-B3-TraceId'),
span_id=request.headers.get('X-B3-SpanId'),
parent_span_id=request.headers.get('X-B3-ParentSpanId'),
flags=request.headers.get('X-B3-Flags'),
is_sampled=request.headers.get('X-B3-Sampled')
)
@app.route('/users/<int:user_id>')
def get_user(user_id):
with zipkin_span(
service_name='user-service',
span_name='get_user',
transport_handler=HTTPTransport(),
zipkin_attrs=extract_zipkin_attrs(),
port=5000,
sample_rate=100.0 # Sample 100% of requests
):
user = fetch_user_from_db(user_id)
return jsonify(user)
def fetch_user_from_db(user_id):
with zipkin_span(
service_name='user-service',
span_name='db_query',
annotations={'query': f'SELECT * FROM users WHERE id={user_id}'}
):
# Database query
return db.query(f'SELECT * FROM users WHERE id={user_id}')
Propagating Trace Context
When calling other services, propagate the trace context:
def call_order_service(user_id):
with zipkin_span(
service_name='user-service',
span_name='call_order_service'
) as span:
headers = {
'X-B3-TraceId': span.zipkin_attrs.trace_id,
'X-B3-SpanId': span.zipkin_attrs.span_id,
'X-B3-ParentSpanId': span.zipkin_attrs.parent_span_id,
'X-B3-Sampled': '1'
}
response = requests.get(
f'http://order-service:5000/orders?user_id={user_id}',
headers=headers
)
return response.json()
Now the trace continues across service boundaries.
Instrumenting Go Services
Install the library:
go get github.com/openzipkin/zipkin-go
Add tracing:
package main
import (
"net/http"
"github.com/openzipkin/zipkin-go"
zipkinhttp "github.com/openzipkin/zipkin-go/middleware/http"
"github.com/openzipkin/zipkin-go/reporter/http"
)
func main() {
// Create reporter
reporter := http.NewReporter("http://zipkin:9411/api/v2/spans")
defer reporter.Close()
// Create tracer
endpoint, _ := zipkin.NewEndpoint("payment-service", "localhost:8080")
tracer, _ := zipkin.NewTracer(reporter, zipkin.WithLocalEndpoint(endpoint))
// Wrap HTTP handler
mux := http.NewServeMux()
mux.HandleFunc("/payment", handlePayment)
handler := zipkinhttp.NewServerMiddleware(
tracer,
zipkinhttp.TagResponseSize(true),
)(mux)
http.ListenAndServe(":8080", handler)
}
func handlePayment(w http.ResponseWriter, r *http.Request) {
// Span is automatically created by middleware
span := zipkin.SpanFromContext(r.Context())
// Add custom tags
span.Tag("payment.amount", r.FormValue("amount"))
span.Tag("payment.method", r.FormValue("method"))
// Create child span for database operation
childSpan := tracer.StartSpan("db_insert", zipkin.Parent(span.Context()))
defer childSpan.Finish()
// Database operation
insertPayment(r.FormValue("amount"))
w.Write([]byte("Payment processed"))
}
Viewing Traces in Zipkin UI
After instrumenting services, traces appear in Zipkin:
- Search traces - Filter by service, duration, tags
- View timeline - See which spans took longest
- Inspect details - Tags, logs, errors
Example trace:
Trace ID: 7f8a9b2c3d4e5f6g
Duration: 2.3s
├─ api-gateway (50ms)
│ └─ auth-service (30ms)
│ └─ user-service (1.2s) ← SLOW!
│ ├─ db_query (1.1s) ← Problem here
│ └─ cache_check (10ms)
│ └─ order-service (800ms)
│ └─ payment-service (750ms)
│ └─ email-service (100ms)
Immediately I can see the user-service database query is the bottleneck.
Adding Custom Tags
Tags help filter and analyze traces:
with zipkin_span(
service_name='user-service',
span_name='get_user'
) as span:
span.update_binary_annotations({
'user.id': user_id,
'user.role': user.role,
'cache.hit': cache_hit,
'db.query_time': query_time
})
Now I can search for:
- All requests for a specific user
- All cache misses
- All slow database queries
Sampling Strategy
Tracing every request creates overhead. Use sampling:
# Sample 10% of requests
sample_rate = 10.0
# Sample 100% of slow requests
if request_duration > 1.0:
sample_rate = 100.0
with zipkin_span(
service_name='user-service',
span_name='get_user',
sample_rate=sample_rate
):
# ...
Real-World Debugging Example
Problem: Checkout taking 5+ seconds
Investigation:
- Search Zipkin for slow checkout traces
- Find trace showing 4.5s in payment-service
- Drill down: 4.2s in external payment API call
- Check tags: payment API returning 503 errors
- Implement retry logic with exponential backoff
Result: Checkout time reduced to 1.2s
Without tracing, this would have taken hours to debug.
Integration with Logging
Correlate traces with logs by adding trace ID to log messages:
import logging
logger = logging.getLogger(__name__)
@app.route('/users/<int:user_id>')
def get_user(user_id):
with zipkin_span(
service_name='user-service',
span_name='get_user'
) as span:
trace_id = span.zipkin_attrs.trace_id
logger.info(f'[{trace_id}] Fetching user {user_id}')
try:
user = fetch_user_from_db(user_id)
logger.info(f'[{trace_id}] User found: {user.name}')
return jsonify(user)
except Exception as e:
logger.error(f'[{trace_id}] Error fetching user: {e}')
raise
Now I can find all logs for a specific trace.
Alerting on Slow Traces
Set up alerts for slow traces:
# alert_slow_traces.py
import requests
from datetime import datetime, timedelta
def check_slow_traces():
# Query Zipkin API
end_time = int(datetime.now().timestamp() * 1000)
start_time = int((datetime.now() - timedelta(minutes=5)).timestamp() * 1000)
response = requests.get(
'http://zipkin:9411/api/v2/traces',
params={
'endTs': end_time,
'lookback': 300000, # 5 minutes
'minDuration': 3000000 # 3 seconds in microseconds
}
)
slow_traces = response.json()
if len(slow_traces) > 10:
send_alert(f'Found {len(slow_traces)} slow traces in last 5 minutes')
# Run every 5 minutes
Performance Impact
Tracing adds overhead. Measurements:
| Metric | Without Tracing | With Tracing | Overhead |
|---|---|---|---|
| Latency (p50) | 45ms | 47ms | 4% |
| Latency (p99) | 250ms | 265ms | 6% |
| CPU usage | 15% | 17% | 13% |
| Memory | 120MB | 135MB | 12% |
The overhead is acceptable for the visibility gained.
Best Practices
- Sample intelligently - Don’t trace everything
- Add meaningful tags - User ID, request type, etc.
- Keep spans focused - One operation per span
- Propagate context - Always pass trace headers
- Monitor trace volume - Don’t overwhelm Zipkin
Lessons Learned
What worked:
- Zipkin is easy to set up and use
- Distributed tracing saves hours of debugging
- Custom tags make traces searchable
- Integration with logs is powerful
Challenges:
- Instrumenting all services took time
- Some libraries don’t support tracing
- Sampling strategy requires tuning
- Zipkin storage fills up quickly
What I’d do differently:
- Start with tracing from day one
- Use automatic instrumentation where possible
- Set up retention policies earlier
- Document trace ID format for the team
Conclusion
Distributed tracing transformed how we debug microservices. What used to take hours now takes minutes.
Key takeaways:
- Implement tracing early in microservices journey
- Propagate trace context across all service calls
- Add custom tags for better searchability
- Integrate with logging for complete picture
- Sample intelligently to reduce overhead
Zipkin has become an essential tool in our observability stack. I can’t imagine debugging microservices without it.
If you’re running microservices and don’t have distributed tracing, implement it now. Your future debugging self will thank you.