Circuit Breaker Pattern: Preventing Cascading Failures in Microservices
Payment service went down. Within 2 minutes, all our services were down. Cascading failure. Users couldn’t even browse products. Recovery took 30 minutes.
I implemented circuit breakers. Now when a service fails, others stay healthy. Last payment service outage? Only payment affected, everything else worked fine.
Table of Contents
The Problem
Without circuit breakers:
- Payment service down
- Order service keeps calling it (timeouts)
- Order service threads exhausted
- Order service down
- Cart service calls order service (timeouts)
- Cart service down
- Entire system down in 2 minutes
Cascading failure is devastating.
Circuit Breaker States
Three states:
Closed (normal):
- Requests pass through
- Failures counted
- If failures exceed threshold → Open
Open (failing):
- Requests fail immediately
- No calls to downstream service
- After timeout → Half-Open
Half-Open (testing):
- Limited requests pass through
- If successful → Closed
- If failed → Open
Implementing with resilience4j
Install (Java):
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-circuitbreaker</artifactId>
<version>0.13.2</version>
</dependency>
Basic usage:
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // Open if 50% fail
.waitDurationInOpenState(Duration.ofSeconds(30))
.ringBufferSizeInClosedState(10)
.build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("paymentService", config);
// Wrap function call
Supplier<Payment> supplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> paymentService.processPayment(order));
try {
Payment payment = supplier.get();
} catch (Exception e) {
// Circuit open or call failed
// Use fallback
}
Python Implementation
Using pybreaker:
pip install pybreaker
from pybreaker import CircuitBreaker
# Configure circuit breaker
payment_breaker = CircuitBreaker(
fail_max=5, # Open after 5 failures
timeout_duration=30, # Stay open for 30 seconds
name='payment_service'
)
@payment_breaker
def process_payment(order_id, amount):
response = requests.post(
'http://payment-service/process',
json={'order_id': order_id, 'amount': amount},
timeout=5
)
response.raise_for_status()
return response.json()
# Use with fallback
try:
payment = process_payment(order_id, amount)
except CircuitBreakerError:
# Circuit is open
payment = {'status': 'pending', 'message': 'Payment service unavailable'}
Go Implementation
package main
import (
"github.com/sony/gobreaker"
"time"
)
var paymentBreaker *gobreaker.CircuitBreaker
func init() {
settings := gobreaker.Settings{
Name: "payment",
MaxRequests: 3,
Interval: time.Second * 10,
Timeout: time.Second * 30,
ReadyToTrip: func(counts gobreaker.Counts) bool {
failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
return counts.Requests >= 3 && failureRatio >= 0.6
},
}
paymentBreaker = gobreaker.NewCircuitBreaker(settings)
}
func processPayment(orderID string, amount float64) (*Payment, error) {
result, err := paymentBreaker.Execute(func() (interface{}, error) {
return callPaymentService(orderID, amount)
})
if err != nil {
// Circuit open or call failed
return nil, err
}
return result.(*Payment), nil
}
Fallback Strategies
1. Return cached data:
@payment_breaker
def get_user_profile(user_id):
return requests.get(f'http://user-service/users/{user_id}').json()
def get_user_profile_with_fallback(user_id):
try:
return get_user_profile(user_id)
except CircuitBreakerError:
# Return cached profile
return cache.get(f'user:{user_id}')
2. Return default value:
try:
recommendations = get_recommendations(user_id)
except CircuitBreakerError:
# Return popular items instead
recommendations = get_popular_items()
3. Degrade gracefully:
try:
personalized_content = get_personalized_content(user_id)
except CircuitBreakerError:
# Show generic content
personalized_content = get_generic_content()
Monitoring Circuit Breakers
Expose metrics:
from prometheus_client import Gauge, Counter
circuit_state = Gauge('circuit_breaker_state', 'Circuit breaker state', ['service'])
circuit_failures = Counter('circuit_breaker_failures', 'Circuit breaker failures', ['service'])
def update_metrics():
state_map = {'closed': 0, 'open': 1, 'half_open': 0.5}
circuit_state.labels('payment_service').set(state_map[payment_breaker.current_state])
if payment_breaker.current_state == 'open':
circuit_failures.labels('payment_service').inc()
Prometheus alert:
groups:
- name: circuit_breaker
rules:
- alert: CircuitBreakerOpen
expr: circuit_breaker_state{service="payment_service"} == 1
for: 1m
annotations:
summary: "Circuit breaker open for {{ $labels.service }}"
Combining with Retry
Retry before opening circuit:
from retrying import retry
@retry(stop_max_attempt_number=3, wait_fixed=1000)
@payment_breaker
def process_payment(order_id, amount):
return requests.post(
'http://payment-service/process',
json={'order_id': order_id, 'amount': amount}
).json()
Timeout Configuration
Set appropriate timeouts:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
session = requests.Session()
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
@payment_breaker
def process_payment(order_id, amount):
return session.post(
'http://payment-service/process',
json={'order_id': order_id, 'amount': amount},
timeout=5 # 5 second timeout
).json()
Bulkhead Pattern
Isolate thread pools:
ThreadPoolBulkhead bulkhead = ThreadPoolBulkhead.of("paymentService",
ThreadPoolBulkheadConfig.custom()
.maxThreadPoolSize(10)
.coreThreadPoolSize(5)
.queueCapacity(20)
.build());
Supplier<Payment> supplier = ThreadPoolBulkhead
.decorateSupplier(bulkhead, () -> paymentService.processPayment(order));
Prevents one slow service from exhausting all threads.
Real-World Configuration
Our production setup:
# High-priority service (strict)
payment_breaker = CircuitBreaker(
fail_max=3, # Open after 3 failures
timeout_duration=60, # Stay open for 1 minute
expected_exception=RequestException
)
# Low-priority service (lenient)
recommendation_breaker = CircuitBreaker(
fail_max=10, # Open after 10 failures
timeout_duration=30, # Stay open for 30 seconds
expected_exception=RequestException
)
# External service (very lenient)
analytics_breaker = CircuitBreaker(
fail_max=20,
timeout_duration=10, # Recover quickly
expected_exception=RequestException
)
Testing Circuit Breakers
Simulate failures:
import pytest
from unittest.mock import patch
def test_circuit_breaker_opens():
# Simulate 5 failures
with patch('requests.post', side_effect=RequestException):
for _ in range(5):
try:
process_payment('order-123', 99.99)
except:
pass
# Circuit should be open
assert payment_breaker.current_state == 'open'
# Next call should fail immediately
with pytest.raises(CircuitBreakerError):
process_payment('order-124', 99.99)
def test_circuit_breaker_half_open():
# Open circuit
payment_breaker.open()
# Wait for timeout
time.sleep(31)
# Should be half-open
assert payment_breaker.current_state == 'half_open'
# Successful call should close circuit
with patch('requests.post', return_value=Mock(json=lambda: {'status': 'success'})):
process_payment('order-125', 99.99)
assert payment_breaker.current_state == 'closed'
Dashboard
Grafana dashboard:
{
"panels": [
{
"title": "Circuit Breaker States",
"targets": [
{
"expr": "circuit_breaker_state"
}
]
},
{
"title": "Failure Rate",
"targets": [
{
"expr": "rate(circuit_breaker_failures[5m])"
}
]
}
]
}
Results
Before:
- Cascading failures
- Entire system down in 2 minutes
- 30-minute recovery time
- All users affected
After:
- Isolated failures
- Only affected service down
- 30-second recovery time
- Most users unaffected
Lessons Learned
- Set thresholds carefully - Too sensitive = false positives
- Use fallbacks - Always have a plan B
- Monitor state - Alert when circuits open
- Test regularly - Chaos engineering
- Different configs for different services - One size doesn’t fit all
Conclusion
Circuit breakers prevent cascading failures and enable graceful degradation. Essential for resilient microservices.
Key takeaways:
- Fail fast when downstream service is down
- Prevent resource exhaustion
- Automatic recovery testing
- Use fallbacks for better UX
- Monitor circuit breaker states
Protect your services. Implement circuit breakers today.