Distributed Tracing with Zipkin: Debugging Microservices

With 12 microservices in production, debugging has become a nightmare. A single user request touches 5-7 services, and when something goes slow, I have no idea where.

I spent two weeks implementing distributed tracing with Zipkin. Now I can see exactly what’s happening across our entire system.

The Problem: Lost in Microservices

A typical request flow:

User → API Gateway → Auth Service → User Service → Order Service → Payment Service → Email Service

When a request takes 3 seconds, which service is slow? I had to:

Check logs in 7 different services
Correlate timestamps manually
Hope the request IDs match

This took hours. There had to be a better way.

Enter Distributed Tracing

Distributed tracing tracks requests across service boundaries. Each request gets a unique trace ID that follows it through the entire system.

Key concepts:

Trace: End-to-end journey of a request
Span: Single operation within a trace
Tags: Metadata about a span
Logs: Events within a span

Why Zipkin?

I evaluated three options:

Zipkin - Open source, mature, easy to set up
Jaeger - Newer, more features, more complex
AWS X-Ray - Managed, but vendor lock-in

I chose Zipkin because:

Simple to deploy
Good Python/Go libraries
Active community
Works with our existing infrastructure

Setting Up Zipkin

Deploy Zipkin with Docker:

# docker-compose.yml
version: '3'

services:
  zipkin:
    image: openzipkin/zipkin:2.4
    ports:
      - "9411:9411"
    environment:
      - STORAGE_TYPE=elasticsearch
      - ES_HOSTS=elasticsearch:9200
  
  elasticsearch:
    image: elasticsearch:5.6
    ports:
      - "9200:9200"

Start it:

docker-compose up -d

Zipkin UI is now at http://localhost:9411.

Instrumenting Python Services

Install the library:

pip install py_zipkin

Add tracing to Flask:

from flask import Flask, request
from py_zipkin.zipkin import zipkin_span, ZipkinAttrs
from py_zipkin.transport import BaseTransportHandler
import requests

app = Flask(__name__)

class HTTPTransport(BaseTransportHandler):
    def get_max_payload_bytes(self):
        return None
    
    def send(self, encoded_span):
        requests.post(
            'http://zipkin:9411/api/v1/spans',
            data=encoded_span,
            headers={'Content-Type': 'application/x-thrift'}
        )

def extract_zipkin_attrs():
    """Extract trace context from headers"""
    return ZipkinAttrs(
        trace_id=request.headers.get('X-B3-TraceId'),
        span_id=request.headers.get('X-B3-SpanId'),
        parent_span_id=request.headers.get('X-B3-ParentSpanId'),
        flags=request.headers.get('X-B3-Flags'),
        is_sampled=request.headers.get('X-B3-Sampled')
    )

@app.route('/users/<int:user_id>')
def get_user(user_id):
    with zipkin_span(
        service_name='user-service',
        span_name='get_user',
        transport_handler=HTTPTransport(),
        zipkin_attrs=extract_zipkin_attrs(),
        port=5000,
        sample_rate=100.0  # Sample 100% of requests
    ):
        user = fetch_user_from_db(user_id)
        return jsonify(user)

def fetch_user_from_db(user_id):
    with zipkin_span(
        service_name='user-service',
        span_name='db_query',
        annotations={'query': f'SELECT * FROM users WHERE id={user_id}'}
    ):
        # Database query
        return db.query(f'SELECT * FROM users WHERE id={user_id}')

Propagating Trace Context

When calling other services, propagate the trace context:

def call_order_service(user_id):
    with zipkin_span(
        service_name='user-service',
        span_name='call_order_service'
    ) as span:
        headers = {
            'X-B3-TraceId': span.zipkin_attrs.trace_id,
            'X-B3-SpanId': span.zipkin_attrs.span_id,
            'X-B3-ParentSpanId': span.zipkin_attrs.parent_span_id,
            'X-B3-Sampled': '1'
        }
        
        response = requests.get(
            f'http://order-service:5000/orders?user_id={user_id}',
            headers=headers
        )
        return response.json()

Now the trace continues across service boundaries.

Instrumenting Go Services

Install the library:

go get github.com/openzipkin/zipkin-go

Add tracing:

package main

import (
    "net/http"
    "github.com/openzipkin/zipkin-go"
    zipkinhttp "github.com/openzipkin/zipkin-go/middleware/http"
    "github.com/openzipkin/zipkin-go/reporter/http"
)

func main() {
    // Create reporter
    reporter := http.NewReporter("http://zipkin:9411/api/v2/spans")
    defer reporter.Close()
    
    // Create tracer
    endpoint, _ := zipkin.NewEndpoint("payment-service", "localhost:8080")
    tracer, _ := zipkin.NewTracer(reporter, zipkin.WithLocalEndpoint(endpoint))
    
    // Wrap HTTP handler
    mux := http.NewServeMux()
    mux.HandleFunc("/payment", handlePayment)
    
    handler := zipkinhttp.NewServerMiddleware(
        tracer,
        zipkinhttp.TagResponseSize(true),
    )(mux)
    
    http.ListenAndServe(":8080", handler)
}

func handlePayment(w http.ResponseWriter, r *http.Request) {
    // Span is automatically created by middleware
    span := zipkin.SpanFromContext(r.Context())
    
    // Add custom tags
    span.Tag("payment.amount", r.FormValue("amount"))
    span.Tag("payment.method", r.FormValue("method"))
    
    // Create child span for database operation
    childSpan := tracer.StartSpan("db_insert", zipkin.Parent(span.Context()))
    defer childSpan.Finish()
    
    // Database operation
    insertPayment(r.FormValue("amount"))
    
    w.Write([]byte("Payment processed"))
}

Viewing Traces in Zipkin UI

After instrumenting services, traces appear in Zipkin:

Search traces - Filter by service, duration, tags
View timeline - See which spans took longest
Inspect details - Tags, logs, errors

Example trace:

Trace ID: 7f8a9b2c3d4e5f6g
Duration: 2.3s

├─ api-gateway (50ms)
│  └─ auth-service (30ms)
│     └─ user-service (1.2s) ← SLOW!
│        ├─ db_query (1.1s) ← Problem here
│        └─ cache_check (10ms)
│     └─ order-service (800ms)
│        └─ payment-service (750ms)
│           └─ email-service (100ms)

Immediately I can see the user-service database query is the bottleneck.

Adding Custom Tags

Tags help filter and analyze traces:

with zipkin_span(
    service_name='user-service',
    span_name='get_user'
) as span:
    span.update_binary_annotations({
        'user.id': user_id,
        'user.role': user.role,
        'cache.hit': cache_hit,
        'db.query_time': query_time
    })

Now I can search for:

All requests for a specific user
All cache misses
All slow database queries

Sampling Strategy

Tracing every request creates overhead. Use sampling:

# Sample 10% of requests
sample_rate = 10.0

# Sample 100% of slow requests
if request_duration > 1.0:
    sample_rate = 100.0

with zipkin_span(
    service_name='user-service',
    span_name='get_user',
    sample_rate=sample_rate
):
    # ...

Real-World Debugging Example

Problem: Checkout taking 5+ seconds

Investigation:

Search Zipkin for slow checkout traces
Find trace showing 4.5s in payment-service
Drill down: 4.2s in external payment API call
Check tags: payment API returning 503 errors
Implement retry logic with exponential backoff

Result: Checkout time reduced to 1.2s

Without tracing, this would have taken hours to debug.

Integration with Logging

Correlate traces with logs by adding trace ID to log messages:

import logging

logger = logging.getLogger(__name__)

@app.route('/users/<int:user_id>')
def get_user(user_id):
    with zipkin_span(
        service_name='user-service',
        span_name='get_user'
    ) as span:
        trace_id = span.zipkin_attrs.trace_id
        
        logger.info(f'[{trace_id}] Fetching user {user_id}')
        
        try:
            user = fetch_user_from_db(user_id)
            logger.info(f'[{trace_id}] User found: {user.name}')
            return jsonify(user)
        except Exception as e:
            logger.error(f'[{trace_id}] Error fetching user: {e}')
            raise

Now I can find all logs for a specific trace.

Alerting on Slow Traces

Set up alerts for slow traces:

# alert_slow_traces.py
import requests
from datetime import datetime, timedelta

def check_slow_traces():
    # Query Zipkin API
    end_time = int(datetime.now().timestamp() * 1000)
    start_time = int((datetime.now() - timedelta(minutes=5)).timestamp() * 1000)
    
    response = requests.get(
        'http://zipkin:9411/api/v2/traces',
        params={
            'endTs': end_time,
            'lookback': 300000,  # 5 minutes
            'minDuration': 3000000  # 3 seconds in microseconds
        }
    )
    
    slow_traces = response.json()
    
    if len(slow_traces) > 10:
        send_alert(f'Found {len(slow_traces)} slow traces in last 5 minutes')

# Run every 5 minutes

Performance Impact

Tracing adds overhead. Measurements:

Metric	Without Tracing	With Tracing	Overhead
Latency (p50)	45ms	47ms	4%
Latency (p99)	250ms	265ms	6%
CPU usage	15%	17%	13%
Memory	120MB	135MB	12%

The overhead is acceptable for the visibility gained.

Best Practices

Sample intelligently - Don’t trace everything
Add meaningful tags - User ID, request type, etc.
Keep spans focused - One operation per span
Propagate context - Always pass trace headers
Monitor trace volume - Don’t overwhelm Zipkin

Lessons Learned

What worked:

Zipkin is easy to set up and use
Distributed tracing saves hours of debugging
Custom tags make traces searchable
Integration with logs is powerful

Challenges:

Instrumenting all services took time
Some libraries don’t support tracing
Sampling strategy requires tuning
Zipkin storage fills up quickly

What I’d do differently:

Start with tracing from day one
Use automatic instrumentation where possible
Set up retention policies earlier
Document trace ID format for the team

Conclusion

Distributed tracing transformed how we debug microservices. What used to take hours now takes minutes.

Key takeaways:

Implement tracing early in microservices journey
Propagate trace context across all service calls
Add custom tags for better searchability
Integrate with logging for complete picture
Sample intelligently to reduce overhead

Zipkin has become an essential tool in our observability stack. I can’t imagine debugging microservices without it.

If you’re running microservices and don’t have distributed tracing, implement it now. Your future debugging self will thank you.

Table of Contents