Skip to content
Learn Agentic AI10 min read0 views

Retry Strategies for LLM API Calls: Exponential Backoff with Jitter and Tenacity

Implement production-grade retry logic for LLM API calls using exponential backoff, jitter, and the Tenacity library. Learn when to retry, when to stop, and how to avoid the thundering herd problem.

The Problem with Naive Retries

LLM API calls fail regularly. Rate limits, server overload, network blips, and cold start latency all cause intermittent errors. The instinct is to wrap the call in a while loop with a sleep, but naive retries create serious problems: they hammer the already-stressed API, synchronize retry storms across clients, and can rack up costs by resending expensive prompts repeatedly.

Production agents need structured retry strategies that maximize success probability while minimizing waste.

Understanding Backoff Algorithms

Fixed Delay

The simplest approach — wait a constant duration between retries. This works for isolated scripts but fails in production because all clients retry at the same intervals, creating synchronized load spikes.

Exponential Backoff

Each retry waits exponentially longer: 1s, 2s, 4s, 8s, 16s. This gives the overloaded service time to recover. However, if many clients start failing at the same time, they all retry at the same exponential intervals.

Exponential Backoff with Jitter

Adding randomness (jitter) to the backoff interval desynchronizes clients. This is the gold standard for distributed systems.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import random
import time
import httpx

def exponential_backoff_with_jitter(
    attempt: int,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
) -> float:
    """Calculate delay with full jitter strategy."""
    exp_delay = base_delay * (2 ** attempt)
    capped = min(exp_delay, max_delay)
    return random.uniform(0, capped)

def call_llm_with_retry(
    prompt: str,
    max_attempts: int = 5,
    retryable_status_codes: set = None,
) -> dict:
    if retryable_status_codes is None:
        retryable_status_codes = {429, 500, 502, 503, 504}

    last_exception = None
    for attempt in range(max_attempts):
        try:
            response = httpx.post(
                "https://api.openai.com/v1/chat/completions",
                json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
                headers={"Authorization": "Bearer ..."},
                timeout=30.0,
            )
            if response.status_code == 200:
                return response.json()
            if response.status_code not in retryable_status_codes:
                raise RuntimeError(f"Non-retryable status: {response.status_code}")

            delay = exponential_backoff_with_jitter(attempt)
            print(f"Attempt {attempt + 1} got {response.status_code}, retrying in {delay:.1f}s")
            time.sleep(delay)

        except (httpx.ConnectTimeout, httpx.ReadTimeout) as exc:
            last_exception = exc
            delay = exponential_backoff_with_jitter(attempt)
            time.sleep(delay)

    raise RuntimeError(f"All {max_attempts} attempts failed") from last_exception

Using Tenacity for Production Retries

The Tenacity library provides a declarative, composable retry framework that eliminates boilerplate.

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential_jitter,
    retry_if_exception_type,
    before_sleep_log,
    after_log,
)
import logging

logger = logging.getLogger("agent.llm")

class RateLimitError(Exception):
    pass

class ServerOverloadError(Exception):
    pass

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential_jitter(
        initial=1,
        max=60,
        jitter=5,
    ),
    retry=retry_if_exception_type((RateLimitError, ServerOverloadError, TimeoutError)),
    before_sleep=before_sleep_log(logger, logging.WARNING),
    after=after_log(logger, logging.INFO),
    reraise=True,
)
async def call_llm(messages: list[dict], model: str = "gpt-4o") -> str:
    """Call LLM with automatic retry on transient failures."""
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            "https://api.openai.com/v1/chat/completions",
            json={"model": model, "messages": messages},
            headers={"Authorization": "Bearer ..."},
            timeout=30.0,
        )
        if resp.status_code == 429:
            raise RateLimitError("Rate limited")
        if resp.status_code >= 500:
            raise ServerOverloadError(f"Server error: {resp.status_code}")
        resp.raise_for_status()
        return resp.json()["choices"][0]["message"]["content"]

Circuit Breaking: Knowing When to Stop

Retries are only useful when the failure is transient. If the provider is down for an extended period, continuous retries waste resources and increase latency. A circuit breaker stops retries after a threshold of consecutive failures and only allows a test request after a cooldown period.

import time

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, cooldown_seconds: float = 30.0):
        self.failure_threshold = failure_threshold
        self.cooldown_seconds = cooldown_seconds
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.state = "closed"  # closed = healthy, open = broken

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = "open"

    def record_success(self):
        self.failure_count = 0
        self.state = "closed"

    def can_proceed(self) -> bool:
        if self.state == "closed":
            return True
        elapsed = time.time() - self.last_failure_time
        if elapsed >= self.cooldown_seconds:
            self.state = "half-open"
            return True
        return False

FAQ

What is jitter and why does it matter?

Jitter adds randomness to retry delays. Without it, hundreds of clients that fail simultaneously will retry at the exact same moments (1s, 2s, 4s), creating synchronized traffic spikes that overwhelm the recovering server. Full jitter picks a random delay between 0 and the calculated backoff, spreading retries evenly over time.

Should I use the Retry-After header from the API?

Absolutely. When an LLM provider returns a 429 with a Retry-After header, always respect that value as your minimum wait time. Combine it with your backoff strategy by using max(retry_after_value, calculated_backoff) to ensure you never retry sooner than the server requests.

How many retries are appropriate for LLM calls?

For synchronous user-facing requests, 3 attempts with a maximum total timeout of 30 seconds is typical. For background processing, 5 to 7 attempts with a maximum backoff of 60 seconds works well. Always set an overall deadline so the total retry sequence cannot exceed your request budget.


#RetryPatterns #ExponentialBackoff #Tenacity #LLMAPIs #Python #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.