Circuit Breaker Pattern: Preventing Cascading Failures in Microservices

Payment service went down. Within 2 minutes, all our services were down. Cascading failure. Users couldn’t even browse products. Recovery took 30 minutes.

I implemented circuit breakers. Now when a service fails, others stay healthy. Last payment service outage? Only payment affected, everything else worked fine.

The Problem

Without circuit breakers:

Payment service down
Order service keeps calling it (timeouts)
Order service threads exhausted
Order service down
Cart service calls order service (timeouts)
Cart service down
Entire system down in 2 minutes

Cascading failure is devastating.

Circuit Breaker States

Three states:

Closed (normal):

Requests pass through
Failures counted
If failures exceed threshold → Open

Open (failing):

Requests fail immediately
No calls to downstream service
After timeout → Half-Open

Half-Open (testing):

Limited requests pass through
If successful → Closed
If failed → Open

Implementing with resilience4j

Install (Java):

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-circuitbreaker</artifactId>
    <version>0.13.2</version>
</dependency>

Basic usage:

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)  // Open if 50% fail
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .ringBufferSizeInClosedState(10)
    .build();

CircuitBreaker circuitBreaker = CircuitBreaker.of("paymentService", config);

// Wrap function call
Supplier<Payment> supplier = CircuitBreaker
    .decorateSupplier(circuitBreaker, () -> paymentService.processPayment(order));

try {
    Payment payment = supplier.get();
} catch (Exception e) {
    // Circuit open or call failed
    // Use fallback
}

Python Implementation

Using pybreaker:

pip install pybreaker

from pybreaker import CircuitBreaker

# Configure circuit breaker
payment_breaker = CircuitBreaker(
    fail_max=5,  # Open after 5 failures
    timeout_duration=30,  # Stay open for 30 seconds
    name='payment_service'
)

@payment_breaker
def process_payment(order_id, amount):
    response = requests.post(
        'http://payment-service/process',
        json={'order_id': order_id, 'amount': amount},
        timeout=5
    )
    response.raise_for_status()
    return response.json()

# Use with fallback
try:
    payment = process_payment(order_id, amount)
except CircuitBreakerError:
    # Circuit is open
    payment = {'status': 'pending', 'message': 'Payment service unavailable'}

Go Implementation

package main

import (
    "github.com/sony/gobreaker"
    "time"
)

var paymentBreaker *gobreaker.CircuitBreaker

func init() {
    settings := gobreaker.Settings{
        Name:        "payment",
        MaxRequests: 3,
        Interval:    time.Second * 10,
        Timeout:     time.Second * 30,
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
            return counts.Requests >= 3 && failureRatio >= 0.6
        },
    }
    
    paymentBreaker = gobreaker.NewCircuitBreaker(settings)
}

func processPayment(orderID string, amount float64) (*Payment, error) {
    result, err := paymentBreaker.Execute(func() (interface{}, error) {
        return callPaymentService(orderID, amount)
    })
    
    if err != nil {
        // Circuit open or call failed
        return nil, err
    }
    
    return result.(*Payment), nil
}

Fallback Strategies

1. Return cached data:

@payment_breaker
def get_user_profile(user_id):
    return requests.get(f'http://user-service/users/{user_id}').json()

def get_user_profile_with_fallback(user_id):
    try:
        return get_user_profile(user_id)
    except CircuitBreakerError:
        # Return cached profile
        return cache.get(f'user:{user_id}')

2. Return default value:

try:
    recommendations = get_recommendations(user_id)
except CircuitBreakerError:
    # Return popular items instead
    recommendations = get_popular_items()

3. Degrade gracefully:

try:
    personalized_content = get_personalized_content(user_id)
except CircuitBreakerError:
    # Show generic content
    personalized_content = get_generic_content()

Monitoring Circuit Breakers

Expose metrics:

from prometheus_client import Gauge, Counter

circuit_state = Gauge('circuit_breaker_state', 'Circuit breaker state', ['service'])
circuit_failures = Counter('circuit_breaker_failures', 'Circuit breaker failures', ['service'])

def update_metrics():
    state_map = {'closed': 0, 'open': 1, 'half_open': 0.5}
    circuit_state.labels('payment_service').set(state_map[payment_breaker.current_state])
    
    if payment_breaker.current_state == 'open':
        circuit_failures.labels('payment_service').inc()

Prometheus alert:

groups:
- name: circuit_breaker
  rules:
  - alert: CircuitBreakerOpen
    expr: circuit_breaker_state{service="payment_service"} == 1
    for: 1m
    annotations:
      summary: "Circuit breaker open for {{ $labels.service }}"

Combining with Retry

Retry before opening circuit:

from retrying import retry

@retry(stop_max_attempt_number=3, wait_fixed=1000)
@payment_breaker
def process_payment(order_id, amount):
    return requests.post(
        'http://payment-service/process',
        json={'order_id': order_id, 'amount': amount}
    ).json()

Timeout Configuration

Set appropriate timeouts:

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

session = requests.Session()
retry_strategy = Retry(
    total=3,
    backoff_factor=1,
    status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)

@payment_breaker
def process_payment(order_id, amount):
    return session.post(
        'http://payment-service/process',
        json={'order_id': order_id, 'amount': amount},
        timeout=5  # 5 second timeout
    ).json()

Bulkhead Pattern

Isolate thread pools:

ThreadPoolBulkhead bulkhead = ThreadPoolBulkhead.of("paymentService",
    ThreadPoolBulkheadConfig.custom()
        .maxThreadPoolSize(10)
        .coreThreadPoolSize(5)
        .queueCapacity(20)
        .build());

Supplier<Payment> supplier = ThreadPoolBulkhead
    .decorateSupplier(bulkhead, () -> paymentService.processPayment(order));

Prevents one slow service from exhausting all threads.

Real-World Configuration

Our production setup:

# High-priority service (strict)
payment_breaker = CircuitBreaker(
    fail_max=3,  # Open after 3 failures
    timeout_duration=60,  # Stay open for 1 minute
    expected_exception=RequestException
)

# Low-priority service (lenient)
recommendation_breaker = CircuitBreaker(
    fail_max=10,  # Open after 10 failures
    timeout_duration=30,  # Stay open for 30 seconds
    expected_exception=RequestException
)

# External service (very lenient)
analytics_breaker = CircuitBreaker(
    fail_max=20,
    timeout_duration=10,  # Recover quickly
    expected_exception=RequestException
)

Testing Circuit Breakers

Simulate failures:

import pytest
from unittest.mock import patch

def test_circuit_breaker_opens():
    # Simulate 5 failures
    with patch('requests.post', side_effect=RequestException):
        for _ in range(5):
            try:
                process_payment('order-123', 99.99)
            except:
                pass
    
    # Circuit should be open
    assert payment_breaker.current_state == 'open'
    
    # Next call should fail immediately
    with pytest.raises(CircuitBreakerError):
        process_payment('order-124', 99.99)

def test_circuit_breaker_half_open():
    # Open circuit
    payment_breaker.open()
    
    # Wait for timeout
    time.sleep(31)
    
    # Should be half-open
    assert payment_breaker.current_state == 'half_open'
    
    # Successful call should close circuit
    with patch('requests.post', return_value=Mock(json=lambda: {'status': 'success'})):
        process_payment('order-125', 99.99)
    
    assert payment_breaker.current_state == 'closed'

Dashboard

Grafana dashboard:

{
  "panels": [
    {
      "title": "Circuit Breaker States",
      "targets": [
        {
          "expr": "circuit_breaker_state"
        }
      ]
    },
    {
      "title": "Failure Rate",
      "targets": [
        {
          "expr": "rate(circuit_breaker_failures[5m])"
        }
      ]
    }
  ]
}

Results

Before:

Cascading failures
Entire system down in 2 minutes
30-minute recovery time
All users affected

After:

Isolated failures
Only affected service down
30-second recovery time
Most users unaffected

Lessons Learned

Set thresholds carefully - Too sensitive = false positives
Use fallbacks - Always have a plan B
Monitor state - Alert when circuits open
Test regularly - Chaos engineering
Different configs for different services - One size doesn’t fit all

Conclusion

Circuit breakers prevent cascading failures and enable graceful degradation. Essential for resilient microservices.

Key takeaways:

Fail fast when downstream service is down
Prevent resource exhaustion
Automatic recovery testing
Use fallbacks for better UX
Monitor circuit breaker states

Protect your services. Implement circuit breakers today.

Table of Contents