With 12 microservices in production, debugging has become a nightmare. A single user request touches 5-7 services, and when something goes slow, I have no idea where.

I spent two weeks implementing distributed tracing with Zipkin. Now I can see exactly what’s happening across our entire system.

Table of Contents

The Problem: Lost in Microservices

A typical request flow:

User → API Gateway → Auth Service → User Service → Order Service → Payment Service → Email Service

When a request takes 3 seconds, which service is slow? I had to:

  1. Check logs in 7 different services
  2. Correlate timestamps manually
  3. Hope the request IDs match

This took hours. There had to be a better way.

Enter Distributed Tracing

Distributed tracing tracks requests across service boundaries. Each request gets a unique trace ID that follows it through the entire system.

Key concepts:

  • Trace: End-to-end journey of a request
  • Span: Single operation within a trace
  • Tags: Metadata about a span
  • Logs: Events within a span

Why Zipkin?

I evaluated three options:

  1. Zipkin - Open source, mature, easy to set up
  2. Jaeger - Newer, more features, more complex
  3. AWS X-Ray - Managed, but vendor lock-in

I chose Zipkin because:

  • Simple to deploy
  • Good Python/Go libraries
  • Active community
  • Works with our existing infrastructure

Setting Up Zipkin

Deploy Zipkin with Docker:

# docker-compose.yml
version: '3'

services:
  zipkin:
    image: openzipkin/zipkin:2.4
    ports:
      - "9411:9411"
    environment:
      - STORAGE_TYPE=elasticsearch
      - ES_HOSTS=elasticsearch:9200
  
  elasticsearch:
    image: elasticsearch:5.6
    ports:
      - "9200:9200"

Start it:

docker-compose up -d

Zipkin UI is now at http://localhost:9411.

Instrumenting Python Services

Install the library:

pip install py_zipkin

Add tracing to Flask:

from flask import Flask, request
from py_zipkin.zipkin import zipkin_span, ZipkinAttrs
from py_zipkin.transport import BaseTransportHandler
import requests

app = Flask(__name__)

class HTTPTransport(BaseTransportHandler):
    def get_max_payload_bytes(self):
        return None
    
    def send(self, encoded_span):
        requests.post(
            'http://zipkin:9411/api/v1/spans',
            data=encoded_span,
            headers={'Content-Type': 'application/x-thrift'}
        )

def extract_zipkin_attrs():
    """Extract trace context from headers"""
    return ZipkinAttrs(
        trace_id=request.headers.get('X-B3-TraceId'),
        span_id=request.headers.get('X-B3-SpanId'),
        parent_span_id=request.headers.get('X-B3-ParentSpanId'),
        flags=request.headers.get('X-B3-Flags'),
        is_sampled=request.headers.get('X-B3-Sampled')
    )

@app.route('/users/<int:user_id>')
def get_user(user_id):
    with zipkin_span(
        service_name='user-service',
        span_name='get_user',
        transport_handler=HTTPTransport(),
        zipkin_attrs=extract_zipkin_attrs(),
        port=5000,
        sample_rate=100.0  # Sample 100% of requests
    ):
        user = fetch_user_from_db(user_id)
        return jsonify(user)

def fetch_user_from_db(user_id):
    with zipkin_span(
        service_name='user-service',
        span_name='db_query',
        annotations={'query': f'SELECT * FROM users WHERE id={user_id}'}
    ):
        # Database query
        return db.query(f'SELECT * FROM users WHERE id={user_id}')

Propagating Trace Context

When calling other services, propagate the trace context:

def call_order_service(user_id):
    with zipkin_span(
        service_name='user-service',
        span_name='call_order_service'
    ) as span:
        headers = {
            'X-B3-TraceId': span.zipkin_attrs.trace_id,
            'X-B3-SpanId': span.zipkin_attrs.span_id,
            'X-B3-ParentSpanId': span.zipkin_attrs.parent_span_id,
            'X-B3-Sampled': '1'
        }
        
        response = requests.get(
            f'http://order-service:5000/orders?user_id={user_id}',
            headers=headers
        )
        return response.json()

Now the trace continues across service boundaries.

Instrumenting Go Services

Install the library:

go get github.com/openzipkin/zipkin-go

Add tracing:

package main

import (
    "net/http"
    "github.com/openzipkin/zipkin-go"
    zipkinhttp "github.com/openzipkin/zipkin-go/middleware/http"
    "github.com/openzipkin/zipkin-go/reporter/http"
)

func main() {
    // Create reporter
    reporter := http.NewReporter("http://zipkin:9411/api/v2/spans")
    defer reporter.Close()
    
    // Create tracer
    endpoint, _ := zipkin.NewEndpoint("payment-service", "localhost:8080")
    tracer, _ := zipkin.NewTracer(reporter, zipkin.WithLocalEndpoint(endpoint))
    
    // Wrap HTTP handler
    mux := http.NewServeMux()
    mux.HandleFunc("/payment", handlePayment)
    
    handler := zipkinhttp.NewServerMiddleware(
        tracer,
        zipkinhttp.TagResponseSize(true),
    )(mux)
    
    http.ListenAndServe(":8080", handler)
}

func handlePayment(w http.ResponseWriter, r *http.Request) {
    // Span is automatically created by middleware
    span := zipkin.SpanFromContext(r.Context())
    
    // Add custom tags
    span.Tag("payment.amount", r.FormValue("amount"))
    span.Tag("payment.method", r.FormValue("method"))
    
    // Create child span for database operation
    childSpan := tracer.StartSpan("db_insert", zipkin.Parent(span.Context()))
    defer childSpan.Finish()
    
    // Database operation
    insertPayment(r.FormValue("amount"))
    
    w.Write([]byte("Payment processed"))
}

Viewing Traces in Zipkin UI

After instrumenting services, traces appear in Zipkin:

  1. Search traces - Filter by service, duration, tags
  2. View timeline - See which spans took longest
  3. Inspect details - Tags, logs, errors

Example trace:

Trace ID: 7f8a9b2c3d4e5f6g
Duration: 2.3s

├─ api-gateway (50ms)
│  └─ auth-service (30ms)
│     └─ user-service (1.2s) ← SLOW!
│        ├─ db_query (1.1s) ← Problem here
│        └─ cache_check (10ms)
│     └─ order-service (800ms)
│        └─ payment-service (750ms)
│           └─ email-service (100ms)

Immediately I can see the user-service database query is the bottleneck.

Adding Custom Tags

Tags help filter and analyze traces:

with zipkin_span(
    service_name='user-service',
    span_name='get_user'
) as span:
    span.update_binary_annotations({
        'user.id': user_id,
        'user.role': user.role,
        'cache.hit': cache_hit,
        'db.query_time': query_time
    })

Now I can search for:

  • All requests for a specific user
  • All cache misses
  • All slow database queries

Sampling Strategy

Tracing every request creates overhead. Use sampling:

# Sample 10% of requests
sample_rate = 10.0

# Sample 100% of slow requests
if request_duration > 1.0:
    sample_rate = 100.0

with zipkin_span(
    service_name='user-service',
    span_name='get_user',
    sample_rate=sample_rate
):
    # ...

Real-World Debugging Example

Problem: Checkout taking 5+ seconds

Investigation:

  1. Search Zipkin for slow checkout traces
  2. Find trace showing 4.5s in payment-service
  3. Drill down: 4.2s in external payment API call
  4. Check tags: payment API returning 503 errors
  5. Implement retry logic with exponential backoff

Result: Checkout time reduced to 1.2s

Without tracing, this would have taken hours to debug.

Integration with Logging

Correlate traces with logs by adding trace ID to log messages:

import logging

logger = logging.getLogger(__name__)

@app.route('/users/<int:user_id>')
def get_user(user_id):
    with zipkin_span(
        service_name='user-service',
        span_name='get_user'
    ) as span:
        trace_id = span.zipkin_attrs.trace_id
        
        logger.info(f'[{trace_id}] Fetching user {user_id}')
        
        try:
            user = fetch_user_from_db(user_id)
            logger.info(f'[{trace_id}] User found: {user.name}')
            return jsonify(user)
        except Exception as e:
            logger.error(f'[{trace_id}] Error fetching user: {e}')
            raise

Now I can find all logs for a specific trace.

Alerting on Slow Traces

Set up alerts for slow traces:

# alert_slow_traces.py
import requests
from datetime import datetime, timedelta

def check_slow_traces():
    # Query Zipkin API
    end_time = int(datetime.now().timestamp() * 1000)
    start_time = int((datetime.now() - timedelta(minutes=5)).timestamp() * 1000)
    
    response = requests.get(
        'http://zipkin:9411/api/v2/traces',
        params={
            'endTs': end_time,
            'lookback': 300000,  # 5 minutes
            'minDuration': 3000000  # 3 seconds in microseconds
        }
    )
    
    slow_traces = response.json()
    
    if len(slow_traces) > 10:
        send_alert(f'Found {len(slow_traces)} slow traces in last 5 minutes')

# Run every 5 minutes

Performance Impact

Tracing adds overhead. Measurements:

MetricWithout TracingWith TracingOverhead
Latency (p50)45ms47ms4%
Latency (p99)250ms265ms6%
CPU usage15%17%13%
Memory120MB135MB12%

The overhead is acceptable for the visibility gained.

Best Practices

  1. Sample intelligently - Don’t trace everything
  2. Add meaningful tags - User ID, request type, etc.
  3. Keep spans focused - One operation per span
  4. Propagate context - Always pass trace headers
  5. Monitor trace volume - Don’t overwhelm Zipkin

Lessons Learned

What worked:

  • Zipkin is easy to set up and use
  • Distributed tracing saves hours of debugging
  • Custom tags make traces searchable
  • Integration with logs is powerful

Challenges:

  • Instrumenting all services took time
  • Some libraries don’t support tracing
  • Sampling strategy requires tuning
  • Zipkin storage fills up quickly

What I’d do differently:

  • Start with tracing from day one
  • Use automatic instrumentation where possible
  • Set up retention policies earlier
  • Document trace ID format for the team

Conclusion

Distributed tracing transformed how we debug microservices. What used to take hours now takes minutes.

Key takeaways:

  1. Implement tracing early in microservices journey
  2. Propagate trace context across all service calls
  3. Add custom tags for better searchability
  4. Integrate with logging for complete picture
  5. Sample intelligently to reduce overhead

Zipkin has become an essential tool in our observability stack. I can’t imagine debugging microservices without it.

If you’re running microservices and don’t have distributed tracing, implement it now. Your future debugging self will thank you.