Distributed Tracing with Jaeger: Finding the Needle in the Microservices Haystack

User reported: “Checkout is slow.” Which of our 20 microservices was the bottleneck? Checking logs across all services took 3 hours. Found it was a slow database query in the inventory service.

I implemented Jaeger distributed tracing. Now I see the entire request flow in one view. Last slow request? Found the bottleneck in 30 seconds.

The Problem

20 microservices, one slow request:

API Gateway → Auth → User → Cart → Inventory → Pricing → Payment → Order
Which service is slow?
Checking logs manually: 3 hours
No visibility into service dependencies

We were blind.

Jaeger Overview

Components:

Agent: Receives traces from apps
Collector: Processes and stores traces
Query: Serves UI and API
Storage: Cassandra or Elasticsearch

Installing Jaeger

All-in-one (development):

docker run -d --name jaeger \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  jaegertracing/all-in-one:1.6

Access UI: http://localhost:16686

Production Deployment

Kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-collector
spec:
  replicas: 3
  selector:
    matchLabels:
      app: jaeger-collector
  template:
    metadata:
      labels:
        app: jaeger-collector
    spec:
      containers:
      - name: jaeger-collector
        image: jaegertracing/jaeger-collector:1.6
        env:
        - name: SPAN_STORAGE_TYPE
          value: elasticsearch
        - name: ES_SERVER_URLS
          value: http://elasticsearch:9200
        ports:
        - containerPort: 14268
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger-collector
spec:
  selector:
    app: jaeger-collector
  ports:
  - port: 14268
    targetPort: 14268

Instrumenting Python Service

Install:

pip install jaeger-client opentracing-instrumentation

Initialize tracer:

from jaeger_client import Config

def init_tracer(service_name):
    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'logging': True,
            'reporter_batch_size': 1,
        },
        service_name=service_name,
    )
    return config.initialize_tracer()

tracer = init_tracer('user-service')

Trace a function:

from opentracing import tags

def get_user(user_id):
    with tracer.start_span('get_user') as span:
        span.set_tag(tags.SPAN_KIND, tags.SPAN_KIND_RPC_SERVER)
        span.set_tag('user_id', user_id)
        
        # Database query
        with tracer.start_span('db_query', child_of=span) as db_span:
            db_span.set_tag(tags.DATABASE_TYPE, 'postgresql')
            user = db.query(User).filter_by(id=user_id).first()
        
        if not user:
            span.set_tag(tags.ERROR, True)
            span.log_kv({'event': 'user_not_found', 'user_id': user_id})
            raise ValueError(f'User {user_id} not found')
        
        return user

Flask Integration

from flask import Flask
from flask_opentracing import FlaskTracing

app = Flask(__name__)
tracing = FlaskTracing(init_tracer('api-gateway'), True, app)

@app.route('/api/users/<int:user_id>')
def get_user_api(user_id):
    # Automatically traced!
    user = get_user(user_id)
    return jsonify(user)

Propagating Context

Pass trace context to downstream services:

import requests
from opentracing.propagation import Format

def call_user_service(user_id):
    with tracer.start_span('call_user_service') as span:
        headers = {}
        tracer.inject(span.context, Format.HTTP_HEADERS, headers)
        
        response = requests.get(
            f'http://user-service/users/{user_id}',
            headers=headers
        )
        
        return response.json()

Receive context:

from opentracing.propagation import Format

@app.route('/users/<int:user_id>')
def get_user_endpoint(user_id):
    span_ctx = tracer.extract(Format.HTTP_HEADERS, request.headers)
    
    with tracer.start_span('get_user', child_of=span_ctx) as span:
        user = get_user(user_id)
        return jsonify(user)

Go Service Instrumentation

package main

import (
    "github.com/opentracing/opentracing-go"
    "github.com/uber/jaeger-client-go"
    "github.com/uber/jaeger-client-go/config"
)

func initTracer(serviceName string) opentracing.Tracer {
    cfg := config.Configuration{
        ServiceName: serviceName,
        Sampler: &config.SamplerConfig{
            Type:  "const",
            Param: 1,
        },
        Reporter: &config.ReporterConfig{
            LogSpans: true,
        },
    }
    
    tracer, _, _ := cfg.NewTracer()
    return tracer
}

func getUser(ctx context.Context, userID int) (*User, error) {
    span, ctx := opentracing.StartSpanFromContext(ctx, "get_user")
    defer span.Finish()
    
    span.SetTag("user_id", userID)
    
    // Database query
    dbSpan := opentracing.StartSpan("db_query", opentracing.ChildOf(span.Context()))
    user, err := db.GetUser(userID)
    dbSpan.Finish()
    
    if err != nil {
        span.SetTag("error", true)
        span.LogKV("event", "error", "message", err.Error())
        return nil, err
    }
    
    return user, nil
}

Sampling Strategies

Constant (sample all):

'sampler': {
    'type': 'const',
    'param': 1,  # 1 = 100%, 0 = 0%
}

Probabilistic (sample percentage):

'sampler': {
    'type': 'probabilistic',
    'param': 0.1,  # 10% of traces
}

Rate limiting (max traces per second):

'sampler': {
    'type': 'ratelimiting',
    'param': 100,  # 100 traces/sec
}

Analyzing Traces

Jaeger UI shows:

Timeline: Visual representation of spans
Duration: Time spent in each service
Dependencies: Service call graph
Errors: Failed spans highlighted

Example trace:

API Gateway (200ms)
├── Auth Service (50ms)
├── User Service (30ms)
├── Cart Service (20ms)
└── Inventory Service (100ms)  ← Bottleneck!
    └── Database Query (95ms)  ← Root cause!

Custom Tags and Logs

Add context:

span.set_tag('user_id', user_id)
span.set_tag('cart_size', len(cart.items))
span.set_tag('payment_method', 'credit_card')

span.log_kv({
    'event': 'payment_processed',
    'amount': 99.99,
    'currency': 'USD'
})

Error Tracking

try:
    result = process_payment(amount)
except PaymentError as e:
    span.set_tag(tags.ERROR, True)
    span.log_kv({
        'event': 'error',
        'error.kind': type(e).__name__,
        'error.object': str(e),
        'message': 'Payment processing failed'
    })
    raise

Service Dependencies

Jaeger automatically builds dependency graph:

API Gateway
├── Auth Service
├── User Service
│   └── Database
├── Cart Service
│   ├── Redis
│   └── Inventory Service
│       └── Database
└── Payment Service
    └── External Payment API

Performance Analysis

Find slow spans:

Filter by duration: > 1s
Sort by duration
Identify common patterns
Optimize bottlenecks

Example findings:

Database queries: 60% of request time
External API calls: 25%
Business logic: 15%

Integration with Prometheus

Export metrics:

from prometheus_client import Histogram

request_duration = Histogram(
    'request_duration_seconds',
    'Request duration',
    ['service', 'endpoint']
)

def get_user(user_id):
    with tracer.start_span('get_user') as span:
        with request_duration.labels('user-service', 'get_user').time():
            # ... implementation
            pass

Real-World Debugging

Scenario: Checkout taking 5 seconds

Before Jaeger (3 hours):

Check API Gateway logs
Check Auth Service logs
Check User Service logs
… (repeat for 20 services)
Finally find slow query in Inventory Service

With Jaeger (30 seconds):

Search for slow traces (> 5s)
Open trace
See Inventory Service taking 4.8s
See database query taking 4.7s
Optimize query

Results

Before:

Debugging: 2-3 hours per issue
No visibility into service dependencies
Blind to performance bottlenecks

After:

Debugging: 5-10 minutes per issue
Complete service dependency map
Identify bottlenecks instantly

Lessons Learned

Instrument early - Add tracing from day one
Sample wisely - 100% in dev, 10% in prod
Add context - Tags and logs are invaluable
Propagate context - Essential for distributed tracing
Use with metrics - Tracing + metrics = complete picture

Conclusion

Distributed tracing is essential for microservices. Jaeger makes it easy.

Key takeaways:

Trace requests across all services
Identify bottlenecks visually
Add custom tags and logs
Sample appropriately for production
Combine with metrics and logs

Stop debugging blind. Implement distributed tracing today.

Table of Contents