Distributed Tracing with Jaeger: Finding the Needle in the Microservices Haystack
User reported: “Checkout is slow.” Which of our 20 microservices was the bottleneck? Checking logs across all services took 3 hours. Found it was a slow database query in the inventory service.
I implemented Jaeger distributed tracing. Now I see the entire request flow in one view. Last slow request? Found the bottleneck in 30 seconds.
Table of Contents
The Problem
20 microservices, one slow request:
- API Gateway → Auth → User → Cart → Inventory → Pricing → Payment → Order
- Which service is slow?
- Checking logs manually: 3 hours
- No visibility into service dependencies
We were blind.
Jaeger Overview
Components:
- Agent: Receives traces from apps
- Collector: Processes and stores traces
- Query: Serves UI and API
- Storage: Cassandra or Elasticsearch
Installing Jaeger
All-in-one (development):
docker run -d --name jaeger \
-p 5775:5775/udp \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 14268:14268 \
jaegertracing/all-in-one:1.6
Access UI: http://localhost:16686
Production Deployment
Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-collector
spec:
replicas: 3
selector:
matchLabels:
app: jaeger-collector
template:
metadata:
labels:
app: jaeger-collector
spec:
containers:
- name: jaeger-collector
image: jaegertracing/jaeger-collector:1.6
env:
- name: SPAN_STORAGE_TYPE
value: elasticsearch
- name: ES_SERVER_URLS
value: http://elasticsearch:9200
ports:
- containerPort: 14268
---
apiVersion: v1
kind: Service
metadata:
name: jaeger-collector
spec:
selector:
app: jaeger-collector
ports:
- port: 14268
targetPort: 14268
Instrumenting Python Service
Install:
pip install jaeger-client opentracing-instrumentation
Initialize tracer:
from jaeger_client import Config
def init_tracer(service_name):
config = Config(
config={
'sampler': {
'type': 'const',
'param': 1,
},
'logging': True,
'reporter_batch_size': 1,
},
service_name=service_name,
)
return config.initialize_tracer()
tracer = init_tracer('user-service')
Trace a function:
from opentracing import tags
def get_user(user_id):
with tracer.start_span('get_user') as span:
span.set_tag(tags.SPAN_KIND, tags.SPAN_KIND_RPC_SERVER)
span.set_tag('user_id', user_id)
# Database query
with tracer.start_span('db_query', child_of=span) as db_span:
db_span.set_tag(tags.DATABASE_TYPE, 'postgresql')
user = db.query(User).filter_by(id=user_id).first()
if not user:
span.set_tag(tags.ERROR, True)
span.log_kv({'event': 'user_not_found', 'user_id': user_id})
raise ValueError(f'User {user_id} not found')
return user
Flask Integration
from flask import Flask
from flask_opentracing import FlaskTracing
app = Flask(__name__)
tracing = FlaskTracing(init_tracer('api-gateway'), True, app)
@app.route('/api/users/<int:user_id>')
def get_user_api(user_id):
# Automatically traced!
user = get_user(user_id)
return jsonify(user)
Propagating Context
Pass trace context to downstream services:
import requests
from opentracing.propagation import Format
def call_user_service(user_id):
with tracer.start_span('call_user_service') as span:
headers = {}
tracer.inject(span.context, Format.HTTP_HEADERS, headers)
response = requests.get(
f'http://user-service/users/{user_id}',
headers=headers
)
return response.json()
Receive context:
from opentracing.propagation import Format
@app.route('/users/<int:user_id>')
def get_user_endpoint(user_id):
span_ctx = tracer.extract(Format.HTTP_HEADERS, request.headers)
with tracer.start_span('get_user', child_of=span_ctx) as span:
user = get_user(user_id)
return jsonify(user)
Go Service Instrumentation
package main
import (
"github.com/opentracing/opentracing-go"
"github.com/uber/jaeger-client-go"
"github.com/uber/jaeger-client-go/config"
)
func initTracer(serviceName string) opentracing.Tracer {
cfg := config.Configuration{
ServiceName: serviceName,
Sampler: &config.SamplerConfig{
Type: "const",
Param: 1,
},
Reporter: &config.ReporterConfig{
LogSpans: true,
},
}
tracer, _, _ := cfg.NewTracer()
return tracer
}
func getUser(ctx context.Context, userID int) (*User, error) {
span, ctx := opentracing.StartSpanFromContext(ctx, "get_user")
defer span.Finish()
span.SetTag("user_id", userID)
// Database query
dbSpan := opentracing.StartSpan("db_query", opentracing.ChildOf(span.Context()))
user, err := db.GetUser(userID)
dbSpan.Finish()
if err != nil {
span.SetTag("error", true)
span.LogKV("event", "error", "message", err.Error())
return nil, err
}
return user, nil
}
Sampling Strategies
Constant (sample all):
'sampler': {
'type': 'const',
'param': 1, # 1 = 100%, 0 = 0%
}
Probabilistic (sample percentage):
'sampler': {
'type': 'probabilistic',
'param': 0.1, # 10% of traces
}
Rate limiting (max traces per second):
'sampler': {
'type': 'ratelimiting',
'param': 100, # 100 traces/sec
}
Analyzing Traces
Jaeger UI shows:
- Timeline: Visual representation of spans
- Duration: Time spent in each service
- Dependencies: Service call graph
- Errors: Failed spans highlighted
Example trace:
API Gateway (200ms)
├── Auth Service (50ms)
├── User Service (30ms)
├── Cart Service (20ms)
└── Inventory Service (100ms) ← Bottleneck!
└── Database Query (95ms) ← Root cause!
Custom Tags and Logs
Add context:
span.set_tag('user_id', user_id)
span.set_tag('cart_size', len(cart.items))
span.set_tag('payment_method', 'credit_card')
span.log_kv({
'event': 'payment_processed',
'amount': 99.99,
'currency': 'USD'
})
Error Tracking
try:
result = process_payment(amount)
except PaymentError as e:
span.set_tag(tags.ERROR, True)
span.log_kv({
'event': 'error',
'error.kind': type(e).__name__,
'error.object': str(e),
'message': 'Payment processing failed'
})
raise
Service Dependencies
Jaeger automatically builds dependency graph:
API Gateway
├── Auth Service
├── User Service
│ └── Database
├── Cart Service
│ ├── Redis
│ └── Inventory Service
│ └── Database
└── Payment Service
└── External Payment API
Performance Analysis
Find slow spans:
- Filter by duration:
> 1s - Sort by duration
- Identify common patterns
- Optimize bottlenecks
Example findings:
- Database queries: 60% of request time
- External API calls: 25%
- Business logic: 15%
Integration with Prometheus
Export metrics:
from prometheus_client import Histogram
request_duration = Histogram(
'request_duration_seconds',
'Request duration',
['service', 'endpoint']
)
def get_user(user_id):
with tracer.start_span('get_user') as span:
with request_duration.labels('user-service', 'get_user').time():
# ... implementation
pass
Real-World Debugging
Scenario: Checkout taking 5 seconds
Before Jaeger (3 hours):
- Check API Gateway logs
- Check Auth Service logs
- Check User Service logs
- … (repeat for 20 services)
- Finally find slow query in Inventory Service
With Jaeger (30 seconds):
- Search for slow traces (> 5s)
- Open trace
- See Inventory Service taking 4.8s
- See database query taking 4.7s
- Optimize query
Results
Before:
- Debugging: 2-3 hours per issue
- No visibility into service dependencies
- Blind to performance bottlenecks
After:
- Debugging: 5-10 minutes per issue
- Complete service dependency map
- Identify bottlenecks instantly
Lessons Learned
- Instrument early - Add tracing from day one
- Sample wisely - 100% in dev, 10% in prod
- Add context - Tags and logs are invaluable
- Propagate context - Essential for distributed tracing
- Use with metrics - Tracing + metrics = complete picture
Conclusion
Distributed tracing is essential for microservices. Jaeger makes it easy.
Key takeaways:
- Trace requests across all services
- Identify bottlenecks visually
- Add custom tags and logs
- Sample appropriately for production
- Combine with metrics and logs
Stop debugging blind. Implement distributed tracing today.