Skip to content
Learn Agentic AI11 min read0 views

The Circuit Breaker Pattern: Protecting Agent Systems from Cascading Failures

Implement the Circuit Breaker pattern to protect AI agent systems from cascading failures with automatic failure detection, open/half-open/closed states, and graceful recovery.

Why Agent Systems Need Circuit Breakers

AI agents depend on external services — LLM APIs, databases, tool endpoints — that can fail or slow down. Without protection, a failing dependency causes the agent to hang or error repeatedly, consuming resources and potentially bringing down the entire system. The Circuit Breaker pattern detects sustained failures and stops making requests to the failing service, allowing it time to recover while the agent falls back to alternative behavior.

The name comes from electrical engineering: when a circuit experiences an overload, the breaker trips open to prevent damage. Once conditions stabilize, the breaker closes and normal operation resumes.

The Three States

  1. Closed — Normal operation. Requests flow through. Failures are counted.
  2. Open — The breaker has tripped. All requests immediately fail with a fallback response. No calls are made to the downstream service.
  3. Half-Open — After a cooldown period, the breaker allows a limited number of test requests through. If they succeed, the breaker closes. If they fail, it opens again.

Implementation

from dataclasses import dataclass
from datetime import datetime, timedelta
from enum import Enum
from typing import Callable, Any
import threading


class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"


@dataclass
class CircuitStats:
    total_calls: int = 0
    failures: int = 0
    successes: int = 0
    last_failure_time: datetime | None = None
    last_success_time: datetime | None = None


class CircuitBreaker:
    def __init__(
        self,
        name: str,
        failure_threshold: int = 5,
        recovery_timeout: int = 30,
        half_open_max_calls: int = 3,
        success_threshold: int = 2,
    ):
        self.name = name
        self.failure_threshold = failure_threshold
        self.recovery_timeout = timedelta(seconds=recovery_timeout)
        self.half_open_max_calls = half_open_max_calls
        self.success_threshold = success_threshold

        self._state = CircuitState.CLOSED
        self._stats = CircuitStats()
        self._half_open_calls = 0
        self._half_open_successes = 0
        self._lock = threading.Lock()
        self._opened_at: datetime | None = None

    @property
    def state(self) -> CircuitState:
        with self._lock:
            if (self._state == CircuitState.OPEN
                    and self._opened_at
                    and datetime.now() - self._opened_at
                    >= self.recovery_timeout):
                self._transition_to(CircuitState.HALF_OPEN)
            return self._state

    def _transition_to(self, new_state: CircuitState):
        old = self._state
        self._state = new_state
        print(f"[{self.name}] Circuit: {old.value} -> {new_state.value}")

        if new_state == CircuitState.OPEN:
            self._opened_at = datetime.now()
        elif new_state == CircuitState.HALF_OPEN:
            self._half_open_calls = 0
            self._half_open_successes = 0
        elif new_state == CircuitState.CLOSED:
            self._stats.failures = 0

    def call(self, func: Callable, *args,
             fallback: Callable | None = None,
             **kwargs) -> Any:
        current = self.state

        if current == CircuitState.OPEN:
            if fallback:
                return fallback(*args, **kwargs)
            raise CircuitOpenError(
                f"Circuit '{self.name}' is OPEN. "
                f"Retry after {self.recovery_timeout.seconds}s."
            )

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            if fallback:
                return fallback(*args, **kwargs)
            raise

    def _on_success(self):
        with self._lock:
            self._stats.successes += 1
            self._stats.total_calls += 1
            self._stats.last_success_time = datetime.now()

            if self._state == CircuitState.HALF_OPEN:
                self._half_open_successes += 1
                if (self._half_open_successes
                        >= self.success_threshold):
                    self._transition_to(CircuitState.CLOSED)

    def _on_failure(self):
        with self._lock:
            self._stats.failures += 1
            self._stats.total_calls += 1
            self._stats.last_failure_time = datetime.now()

            if self._state == CircuitState.HALF_OPEN:
                self._transition_to(CircuitState.OPEN)
            elif (self._state == CircuitState.CLOSED
                  and self._stats.failures >= self.failure_threshold):
                self._transition_to(CircuitState.OPEN)


class CircuitOpenError(Exception):
    pass

Using Circuit Breakers with AI Agents

import openai

client = openai.OpenAI()

# Create a breaker for the LLM API
llm_breaker = CircuitBreaker(
    name="openai-api",
    failure_threshold=3,
    recovery_timeout=60,
    success_threshold=2,
)


def call_llm(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        timeout=10,
    )
    return response.choices[0].message.content


def cached_fallback(prompt: str) -> str:
    return "[Service temporarily unavailable. Using cached response.]"


# Protected call
result = llm_breaker.call(
    call_llm,
    "Explain quantum computing",
    fallback=cached_fallback,
)
print(result)
print(f"Circuit state: {llm_breaker.state.value}")

Decorator Variant for Cleaner Usage

from functools import wraps


def circuit_protected(breaker: CircuitBreaker,
                      fallback: Callable | None = None):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            return breaker.call(func, *args, fallback=fallback,
                                **kwargs)
        return wrapper
    return decorator


@circuit_protected(llm_breaker, fallback=cached_fallback)
def summarize(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Summarize concisely."},
            {"role": "user", "content": text},
        ],
    )
    return response.choices[0].message.content

Monitoring Circuit Health

Expose circuit breaker statistics for observability. Track how often each breaker opens, how long it stays open, and whether half-open test calls are succeeding. These metrics reveal which dependencies are unreliable and help you size failure_threshold and recovery_timeout appropriately.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

FAQ

How do I choose the right failure threshold and recovery timeout?

Start with a failure threshold of 5 and a recovery timeout of 30-60 seconds. Monitor your system under real traffic and adjust. Services with high latency variance may need higher thresholds to avoid false trips. Services that recover slowly need longer timeouts. Measure the actual mean time to recovery (MTTR) for each dependency and set the timeout slightly above it.

Should each agent have its own circuit breaker or share one?

Use one circuit breaker per downstream dependency, not per agent. If three agents all call the same LLM API, they should share a single breaker for that API. This way, failures detected by one agent protect all agents from hammering a downed service. Store breakers in a shared registry that agents access by dependency name.

How does the circuit breaker interact with retry logic?

The circuit breaker should wrap the retry logic, not the other way around. Retries happen inside the func that the breaker calls. If all retries fail, that counts as one failure for the breaker. This prevents retries from inflating the failure count and tripping the breaker prematurely.


#AgentDesignPatterns #CircuitBreaker #Python #FaultTolerance #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.