User reported: “Checkout is slow.” Which of our 20 microservices was the bottleneck? Checking logs across all services took 3 hours. Found it was a slow database query in the inventory service.

I implemented Jaeger distributed tracing. Now I see the entire request flow in one view. Last slow request? Found the bottleneck in 30 seconds.

Table of Contents

The Problem

20 microservices, one slow request:

  • API Gateway → Auth → User → Cart → Inventory → Pricing → Payment → Order
  • Which service is slow?
  • Checking logs manually: 3 hours
  • No visibility into service dependencies

We were blind.

Jaeger Overview

Components:

  • Agent: Receives traces from apps
  • Collector: Processes and stores traces
  • Query: Serves UI and API
  • Storage: Cassandra or Elasticsearch

Installing Jaeger

All-in-one (development):

docker run -d --name jaeger \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  jaegertracing/all-in-one:1.6

Access UI: http://localhost:16686

Production Deployment

Kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-collector
spec:
  replicas: 3
  selector:
    matchLabels:
      app: jaeger-collector
  template:
    metadata:
      labels:
        app: jaeger-collector
    spec:
      containers:
      - name: jaeger-collector
        image: jaegertracing/jaeger-collector:1.6
        env:
        - name: SPAN_STORAGE_TYPE
          value: elasticsearch
        - name: ES_SERVER_URLS
          value: http://elasticsearch:9200
        ports:
        - containerPort: 14268
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger-collector
spec:
  selector:
    app: jaeger-collector
  ports:
  - port: 14268
    targetPort: 14268

Instrumenting Python Service

Install:

pip install jaeger-client opentracing-instrumentation

Initialize tracer:

from jaeger_client import Config

def init_tracer(service_name):
    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'logging': True,
            'reporter_batch_size': 1,
        },
        service_name=service_name,
    )
    return config.initialize_tracer()

tracer = init_tracer('user-service')

Trace a function:

from opentracing import tags

def get_user(user_id):
    with tracer.start_span('get_user') as span:
        span.set_tag(tags.SPAN_KIND, tags.SPAN_KIND_RPC_SERVER)
        span.set_tag('user_id', user_id)
        
        # Database query
        with tracer.start_span('db_query', child_of=span) as db_span:
            db_span.set_tag(tags.DATABASE_TYPE, 'postgresql')
            user = db.query(User).filter_by(id=user_id).first()
        
        if not user:
            span.set_tag(tags.ERROR, True)
            span.log_kv({'event': 'user_not_found', 'user_id': user_id})
            raise ValueError(f'User {user_id} not found')
        
        return user

Flask Integration

from flask import Flask
from flask_opentracing import FlaskTracing

app = Flask(__name__)
tracing = FlaskTracing(init_tracer('api-gateway'), True, app)

@app.route('/api/users/<int:user_id>')
def get_user_api(user_id):
    # Automatically traced!
    user = get_user(user_id)
    return jsonify(user)

Propagating Context

Pass trace context to downstream services:

import requests
from opentracing.propagation import Format

def call_user_service(user_id):
    with tracer.start_span('call_user_service') as span:
        headers = {}
        tracer.inject(span.context, Format.HTTP_HEADERS, headers)
        
        response = requests.get(
            f'http://user-service/users/{user_id}',
            headers=headers
        )
        
        return response.json()

Receive context:

from opentracing.propagation import Format

@app.route('/users/<int:user_id>')
def get_user_endpoint(user_id):
    span_ctx = tracer.extract(Format.HTTP_HEADERS, request.headers)
    
    with tracer.start_span('get_user', child_of=span_ctx) as span:
        user = get_user(user_id)
        return jsonify(user)

Go Service Instrumentation

package main

import (
    "github.com/opentracing/opentracing-go"
    "github.com/uber/jaeger-client-go"
    "github.com/uber/jaeger-client-go/config"
)

func initTracer(serviceName string) opentracing.Tracer {
    cfg := config.Configuration{
        ServiceName: serviceName,
        Sampler: &config.SamplerConfig{
            Type:  "const",
            Param: 1,
        },
        Reporter: &config.ReporterConfig{
            LogSpans: true,
        },
    }
    
    tracer, _, _ := cfg.NewTracer()
    return tracer
}

func getUser(ctx context.Context, userID int) (*User, error) {
    span, ctx := opentracing.StartSpanFromContext(ctx, "get_user")
    defer span.Finish()
    
    span.SetTag("user_id", userID)
    
    // Database query
    dbSpan := opentracing.StartSpan("db_query", opentracing.ChildOf(span.Context()))
    user, err := db.GetUser(userID)
    dbSpan.Finish()
    
    if err != nil {
        span.SetTag("error", true)
        span.LogKV("event", "error", "message", err.Error())
        return nil, err
    }
    
    return user, nil
}

Sampling Strategies

Constant (sample all):

'sampler': {
    'type': 'const',
    'param': 1,  # 1 = 100%, 0 = 0%
}

Probabilistic (sample percentage):

'sampler': {
    'type': 'probabilistic',
    'param': 0.1,  # 10% of traces
}

Rate limiting (max traces per second):

'sampler': {
    'type': 'ratelimiting',
    'param': 100,  # 100 traces/sec
}

Analyzing Traces

Jaeger UI shows:

  • Timeline: Visual representation of spans
  • Duration: Time spent in each service
  • Dependencies: Service call graph
  • Errors: Failed spans highlighted

Example trace:

API Gateway (200ms)
├── Auth Service (50ms)
├── User Service (30ms)
├── Cart Service (20ms)
└── Inventory Service (100ms)  ← Bottleneck!
    └── Database Query (95ms)  ← Root cause!

Custom Tags and Logs

Add context:

span.set_tag('user_id', user_id)
span.set_tag('cart_size', len(cart.items))
span.set_tag('payment_method', 'credit_card')

span.log_kv({
    'event': 'payment_processed',
    'amount': 99.99,
    'currency': 'USD'
})

Error Tracking

try:
    result = process_payment(amount)
except PaymentError as e:
    span.set_tag(tags.ERROR, True)
    span.log_kv({
        'event': 'error',
        'error.kind': type(e).__name__,
        'error.object': str(e),
        'message': 'Payment processing failed'
    })
    raise

Service Dependencies

Jaeger automatically builds dependency graph:

API Gateway
├── Auth Service
├── User Service
│   └── Database
├── Cart Service
│   ├── Redis
│   └── Inventory Service
│       └── Database
└── Payment Service
    └── External Payment API

Performance Analysis

Find slow spans:

  1. Filter by duration: > 1s
  2. Sort by duration
  3. Identify common patterns
  4. Optimize bottlenecks

Example findings:

  • Database queries: 60% of request time
  • External API calls: 25%
  • Business logic: 15%

Integration with Prometheus

Export metrics:

from prometheus_client import Histogram

request_duration = Histogram(
    'request_duration_seconds',
    'Request duration',
    ['service', 'endpoint']
)

def get_user(user_id):
    with tracer.start_span('get_user') as span:
        with request_duration.labels('user-service', 'get_user').time():
            # ... implementation
            pass

Real-World Debugging

Scenario: Checkout taking 5 seconds

Before Jaeger (3 hours):

  1. Check API Gateway logs
  2. Check Auth Service logs
  3. Check User Service logs
  4. … (repeat for 20 services)
  5. Finally find slow query in Inventory Service

With Jaeger (30 seconds):

  1. Search for slow traces (> 5s)
  2. Open trace
  3. See Inventory Service taking 4.8s
  4. See database query taking 4.7s
  5. Optimize query

Results

Before:

  • Debugging: 2-3 hours per issue
  • No visibility into service dependencies
  • Blind to performance bottlenecks

After:

  • Debugging: 5-10 minutes per issue
  • Complete service dependency map
  • Identify bottlenecks instantly

Lessons Learned

  1. Instrument early - Add tracing from day one
  2. Sample wisely - 100% in dev, 10% in prod
  3. Add context - Tags and logs are invaluable
  4. Propagate context - Essential for distributed tracing
  5. Use with metrics - Tracing + metrics = complete picture

Conclusion

Distributed tracing is essential for microservices. Jaeger makes it easy.

Key takeaways:

  1. Trace requests across all services
  2. Identify bottlenecks visually
  3. Add custom tags and logs
  4. Sample appropriately for production
  5. Combine with metrics and logs

Stop debugging blind. Implement distributed tracing today.