Payment service went down. Within 2 minutes, all our services were down. Cascading failure. Users couldn’t even browse products. Recovery took 30 minutes.

I implemented circuit breakers. Now when a service fails, others stay healthy. Last payment service outage? Only payment affected, everything else worked fine.

Table of Contents

The Problem

Without circuit breakers:

  • Payment service down
  • Order service keeps calling it (timeouts)
  • Order service threads exhausted
  • Order service down
  • Cart service calls order service (timeouts)
  • Cart service down
  • Entire system down in 2 minutes

Cascading failure is devastating.

Circuit Breaker States

Three states:

Closed (normal):

  • Requests pass through
  • Failures counted
  • If failures exceed threshold → Open

Open (failing):

  • Requests fail immediately
  • No calls to downstream service
  • After timeout → Half-Open

Half-Open (testing):

  • Limited requests pass through
  • If successful → Closed
  • If failed → Open

Implementing with resilience4j

Install (Java):

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-circuitbreaker</artifactId>
    <version>0.13.2</version>
</dependency>

Basic usage:

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)  // Open if 50% fail
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .ringBufferSizeInClosedState(10)
    .build();

CircuitBreaker circuitBreaker = CircuitBreaker.of("paymentService", config);

// Wrap function call
Supplier<Payment> supplier = CircuitBreaker
    .decorateSupplier(circuitBreaker, () -> paymentService.processPayment(order));

try {
    Payment payment = supplier.get();
} catch (Exception e) {
    // Circuit open or call failed
    // Use fallback
}

Python Implementation

Using pybreaker:

pip install pybreaker
from pybreaker import CircuitBreaker

# Configure circuit breaker
payment_breaker = CircuitBreaker(
    fail_max=5,  # Open after 5 failures
    timeout_duration=30,  # Stay open for 30 seconds
    name='payment_service'
)

@payment_breaker
def process_payment(order_id, amount):
    response = requests.post(
        'http://payment-service/process',
        json={'order_id': order_id, 'amount': amount},
        timeout=5
    )
    response.raise_for_status()
    return response.json()

# Use with fallback
try:
    payment = process_payment(order_id, amount)
except CircuitBreakerError:
    # Circuit is open
    payment = {'status': 'pending', 'message': 'Payment service unavailable'}

Go Implementation

package main

import (
    "github.com/sony/gobreaker"
    "time"
)

var paymentBreaker *gobreaker.CircuitBreaker

func init() {
    settings := gobreaker.Settings{
        Name:        "payment",
        MaxRequests: 3,
        Interval:    time.Second * 10,
        Timeout:     time.Second * 30,
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
            return counts.Requests >= 3 && failureRatio >= 0.6
        },
    }
    
    paymentBreaker = gobreaker.NewCircuitBreaker(settings)
}

func processPayment(orderID string, amount float64) (*Payment, error) {
    result, err := paymentBreaker.Execute(func() (interface{}, error) {
        return callPaymentService(orderID, amount)
    })
    
    if err != nil {
        // Circuit open or call failed
        return nil, err
    }
    
    return result.(*Payment), nil
}

Fallback Strategies

1. Return cached data:

@payment_breaker
def get_user_profile(user_id):
    return requests.get(f'http://user-service/users/{user_id}').json()

def get_user_profile_with_fallback(user_id):
    try:
        return get_user_profile(user_id)
    except CircuitBreakerError:
        # Return cached profile
        return cache.get(f'user:{user_id}')

2. Return default value:

try:
    recommendations = get_recommendations(user_id)
except CircuitBreakerError:
    # Return popular items instead
    recommendations = get_popular_items()

3. Degrade gracefully:

try:
    personalized_content = get_personalized_content(user_id)
except CircuitBreakerError:
    # Show generic content
    personalized_content = get_generic_content()

Monitoring Circuit Breakers

Expose metrics:

from prometheus_client import Gauge, Counter

circuit_state = Gauge('circuit_breaker_state', 'Circuit breaker state', ['service'])
circuit_failures = Counter('circuit_breaker_failures', 'Circuit breaker failures', ['service'])

def update_metrics():
    state_map = {'closed': 0, 'open': 1, 'half_open': 0.5}
    circuit_state.labels('payment_service').set(state_map[payment_breaker.current_state])
    
    if payment_breaker.current_state == 'open':
        circuit_failures.labels('payment_service').inc()

Prometheus alert:

groups:
- name: circuit_breaker
  rules:
  - alert: CircuitBreakerOpen
    expr: circuit_breaker_state{service="payment_service"} == 1
    for: 1m
    annotations:
      summary: "Circuit breaker open for {{ $labels.service }}"

Combining with Retry

Retry before opening circuit:

from retrying import retry

@retry(stop_max_attempt_number=3, wait_fixed=1000)
@payment_breaker
def process_payment(order_id, amount):
    return requests.post(
        'http://payment-service/process',
        json={'order_id': order_id, 'amount': amount}
    ).json()

Timeout Configuration

Set appropriate timeouts:

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

session = requests.Session()
retry_strategy = Retry(
    total=3,
    backoff_factor=1,
    status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)

@payment_breaker
def process_payment(order_id, amount):
    return session.post(
        'http://payment-service/process',
        json={'order_id': order_id, 'amount': amount},
        timeout=5  # 5 second timeout
    ).json()

Bulkhead Pattern

Isolate thread pools:

ThreadPoolBulkhead bulkhead = ThreadPoolBulkhead.of("paymentService",
    ThreadPoolBulkheadConfig.custom()
        .maxThreadPoolSize(10)
        .coreThreadPoolSize(5)
        .queueCapacity(20)
        .build());

Supplier<Payment> supplier = ThreadPoolBulkhead
    .decorateSupplier(bulkhead, () -> paymentService.processPayment(order));

Prevents one slow service from exhausting all threads.

Real-World Configuration

Our production setup:

# High-priority service (strict)
payment_breaker = CircuitBreaker(
    fail_max=3,  # Open after 3 failures
    timeout_duration=60,  # Stay open for 1 minute
    expected_exception=RequestException
)

# Low-priority service (lenient)
recommendation_breaker = CircuitBreaker(
    fail_max=10,  # Open after 10 failures
    timeout_duration=30,  # Stay open for 30 seconds
    expected_exception=RequestException
)

# External service (very lenient)
analytics_breaker = CircuitBreaker(
    fail_max=20,
    timeout_duration=10,  # Recover quickly
    expected_exception=RequestException
)

Testing Circuit Breakers

Simulate failures:

import pytest
from unittest.mock import patch

def test_circuit_breaker_opens():
    # Simulate 5 failures
    with patch('requests.post', side_effect=RequestException):
        for _ in range(5):
            try:
                process_payment('order-123', 99.99)
            except:
                pass
    
    # Circuit should be open
    assert payment_breaker.current_state == 'open'
    
    # Next call should fail immediately
    with pytest.raises(CircuitBreakerError):
        process_payment('order-124', 99.99)

def test_circuit_breaker_half_open():
    # Open circuit
    payment_breaker.open()
    
    # Wait for timeout
    time.sleep(31)
    
    # Should be half-open
    assert payment_breaker.current_state == 'half_open'
    
    # Successful call should close circuit
    with patch('requests.post', return_value=Mock(json=lambda: {'status': 'success'})):
        process_payment('order-125', 99.99)
    
    assert payment_breaker.current_state == 'closed'

Dashboard

Grafana dashboard:

{
  "panels": [
    {
      "title": "Circuit Breaker States",
      "targets": [
        {
          "expr": "circuit_breaker_state"
        }
      ]
    },
    {
      "title": "Failure Rate",
      "targets": [
        {
          "expr": "rate(circuit_breaker_failures[5m])"
        }
      ]
    }
  ]
}

Results

Before:

  • Cascading failures
  • Entire system down in 2 minutes
  • 30-minute recovery time
  • All users affected

After:

  • Isolated failures
  • Only affected service down
  • 30-second recovery time
  • Most users unaffected

Lessons Learned

  1. Set thresholds carefully - Too sensitive = false positives
  2. Use fallbacks - Always have a plan B
  3. Monitor state - Alert when circuits open
  4. Test regularly - Chaos engineering
  5. Different configs for different services - One size doesn’t fit all

Conclusion

Circuit breakers prevent cascading failures and enable graceful degradation. Essential for resilient microservices.

Key takeaways:

  1. Fail fast when downstream service is down
  2. Prevent resource exhaustion
  3. Automatic recovery testing
  4. Use fallbacks for better UX
  5. Monitor circuit breaker states

Protect your services. Implement circuit breakers today.