Retry Strategies for LLM API Calls: Exponential Backoff with Jitter and Tenacity

The Problem with Naive Retries

LLM API calls fail regularly. Rate limits, server overload, network blips, and cold start latency all cause intermittent errors. The instinct is to wrap the call in a while loop with a sleep, but naive retries create serious problems: they hammer the already-stressed API, synchronize retry storms across clients, and can rack up costs by resending expensive prompts repeatedly.

Production agents need structured retry strategies that maximize success probability while minimizing waste.

Understanding Backoff Algorithms

Fixed Delay

The simplest approach — wait a constant duration between retries. This works for isolated scripts but fails in production because all clients retry at the same intervals, creating synchronized load spikes.

Exponential Backoff

Each retry waits exponentially longer: 1s, 2s, 4s, 8s, 16s. This gives the overloaded service time to recover. However, if many clients start failing at the same time, they all retry at the same exponential intervals.

Exponential Backoff with Jitter

Adding randomness (jitter) to the backoff interval desynchronizes clients. This is the gold standard for distributed systems.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import random
import time
import httpx

def exponential_backoff_with_jitter(
    attempt: int,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
) -> float:
    """Calculate delay with full jitter strategy."""
    exp_delay = base_delay * (2 ** attempt)
    capped = min(exp_delay, max_delay)
    return random.uniform(0, capped)

def call_llm_with_retry(
    prompt: str,
    max_attempts: int = 5,
    retryable_status_codes: set = None,
) -> dict:
    if retryable_status_codes is None:
        retryable_status_codes = {429, 500, 502, 503, 504}

    last_exception = None
    for attempt in range(max_attempts):
        try:
            response = httpx.post(
                "https://api.openai.com/v1/chat/completions",
                json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
                headers={"Authorization": "Bearer ..."},
                timeout=30.0,
            )
            if response.status_code == 200:
                return response.json()
            if response.status_code not in retryable_status_codes:
                raise RuntimeError(f"Non-retryable status: {response.status_code}")

            delay = exponential_backoff_with_jitter(attempt)
            print(f"Attempt {attempt + 1} got {response.status_code}, retrying in {delay:.1f}s")
            time.sleep(delay)

        except (httpx.ConnectTimeout, httpx.ReadTimeout) as exc:
            last_exception = exc
            delay = exponential_backoff_with_jitter(attempt)
            time.sleep(delay)

    raise RuntimeError(f"All {max_attempts} attempts failed") from last_exception

Using Tenacity for Production Retries

The Tenacity library provides a declarative, composable retry framework that eliminates boilerplate.

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential_jitter,
    retry_if_exception_type,
    before_sleep_log,
    after_log,
)
import logging

logger = logging.getLogger("agent.llm")

class RateLimitError(Exception):
    pass

class ServerOverloadError(Exception):
    pass

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential_jitter(
        initial=1,
        max=60,
        jitter=5,
    ),
    retry=retry_if_exception_type((RateLimitError, ServerOverloadError, TimeoutError)),
    before_sleep=before_sleep_log(logger, logging.WARNING),
    after=after_log(logger, logging.INFO),
    reraise=True,
)
async def call_llm(messages: list[dict], model: str = "gpt-4o") -> str:
    """Call LLM with automatic retry on transient failures."""
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            "https://api.openai.com/v1/chat/completions",
            json={"model": model, "messages": messages},
            headers={"Authorization": "Bearer ..."},
            timeout=30.0,
        )
        if resp.status_code == 429:
            raise RateLimitError("Rate limited")
        if resp.status_code >= 500:
            raise ServerOverloadError(f"Server error: {resp.status_code}")
        resp.raise_for_status()
        return resp.json()["choices"][0]["message"]["content"]

Circuit Breaking: Knowing When to Stop

Retries are only useful when the failure is transient. If the provider is down for an extended period, continuous retries waste resources and increase latency. A circuit breaker stops retries after a threshold of consecutive failures and only allows a test request after a cooldown period.

import time

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, cooldown_seconds: float = 30.0):
        self.failure_threshold = failure_threshold
        self.cooldown_seconds = cooldown_seconds
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.state = "closed"  # closed = healthy, open = broken

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = "open"

    def record_success(self):
        self.failure_count = 0
        self.state = "closed"

    def can_proceed(self) -> bool:
        if self.state == "closed":
            return True
        elapsed = time.time() - self.last_failure_time
        if elapsed >= self.cooldown_seconds:
            self.state = "half-open"
            return True
        return False

FAQ

What is jitter and why does it matter?

Jitter adds randomness to retry delays. Without it, hundreds of clients that fail simultaneously will retry at the exact same moments (1s, 2s, 4s), creating synchronized traffic spikes that overwhelm the recovering server. Full jitter picks a random delay between 0 and the calculated backoff, spreading retries evenly over time.

Should I use the Retry-After header from the API?

Absolutely. When an LLM provider returns a 429 with a Retry-After header, always respect that value as your minimum wait time. Combine it with your backoff strategy by using max(retry_after_value, calculated_backoff) to ensure you never retry sooner than the server requests.

How many retries are appropriate for LLM calls?

For synchronous user-facing requests, 3 attempts with a maximum total timeout of 30 seconds is typical. For background processing, 5 to 7 attempts with a maximum backoff of 60 seconds works well. Always set an overall deadline so the total retry sequence cannot exceed your request budget.

#RetryPatterns #ExponentialBackoff #Tenacity #LLMAPIs #Python #AgenticAI #LearnAI #AIEngineering

Retry Strategies for LLM API Calls: Exponential Backoff with Jitter and Tenacity

The Problem with Naive Retries

Understanding Backoff Algorithms

Fixed Delay

Exponential Backoff

Exponential Backoff with Jitter

Using Tenacity for Production Retries

Circuit Breaking: Knowing When to Stop

FAQ

What is jitter and why does it matter?

Should I use the Retry-After header from the API?

How many retries are appropriate for LLM calls?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding