SDK Retry and Error Handling: Building Resilient Client Libraries

Why SDKs Must Handle Retries

Network requests fail. Servers return 500 errors during deployments. Rate limiters throttle bursts. DNS resolution hiccups. TCP connections reset. If your SDK surfaces every transient failure directly to the user, their application becomes fragile. A production-grade SDK retries transient errors automatically so that intermittent infrastructure issues do not cascade into application failures.

The goal is not to mask errors — it is to absorb noise so that when an error reaches the user, it represents a genuine problem that requires their attention.

Error Classification

The first step is classifying errors into retryable and non-retryable categories. This classification drives the retry engine:

from enum import Enum


class ErrorCategory(Enum):
    RETRYABLE = "retryable"
    NON_RETRYABLE = "non_retryable"
    RATE_LIMITED = "rate_limited"


def classify_error(status_code: int | None, exception: Exception | None) -> ErrorCategory:
    """Classify an error to determine retry behavior."""

    # Network-level failures are always retryable
    if exception is not None:
        if isinstance(exception, (ConnectionError, TimeoutError)):
            return ErrorCategory.RETRYABLE
        return ErrorCategory.NON_RETRYABLE

    # HTTP status code classification
    if status_code is not None:
        if status_code == 429:
            return ErrorCategory.RATE_LIMITED
        if status_code in (408, 500, 502, 503, 504):
            return ErrorCategory.RETRYABLE
        if status_code == 409:
            return ErrorCategory.RETRYABLE  # Conflict, often transient
        return ErrorCategory.NON_RETRYABLE

    return ErrorCategory.NON_RETRYABLE

The critical distinction: 400 (bad request), 401 (unauthorized), 403 (forbidden), and 404 (not found) are never retried. The user must fix their request or credentials. 500, 502, 503, and 504 are retried because they typically indicate transient server issues. 429 (rate limited) is retried with special handling for the Retry-After header.

Retry Policy Configuration

Users need control over retry behavior. Some applications prefer fast failure; others can tolerate longer wait times for higher reliability:

from dataclasses import dataclass


@dataclass
class RetryPolicy:
    """Configuration for retry behavior."""
    max_retries: int = 3
    initial_delay: float = 0.5       # seconds
    max_delay: float = 30.0          # seconds
    backoff_factor: float = 2.0      # exponential multiplier
    retry_on_status: set[int] = None
    retry_on_timeout: bool = True

    def __post_init__(self):
        if self.retry_on_status is None:
            self.retry_on_status = {408, 429, 500, 502, 503, 504}

    def calculate_delay(self, attempt: int, retry_after: float | None = None) -> float:
        """Calculate delay before next retry with exponential backoff."""
        if retry_after is not None:
            return min(retry_after, self.max_delay)

        delay = self.initial_delay * (self.backoff_factor ** attempt)
        return min(delay, self.max_delay)

The calculate_delay method implements exponential backoff: 0.5s, 1s, 2s, 4s, and so on up to the maximum. When the server sends a Retry-After header, the SDK honors it but caps at max_delay to prevent unbounded waits.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

The Retry Engine

The retry engine wraps the HTTP request method and orchestrates classification, backoff, and logging:

import time
import logging

logger = logging.getLogger("myagent")


class RetryableClient:
    def __init__(self, http_client, retry_policy: RetryPolicy | None = None):
        self._http = http_client
        self.retry_policy = retry_policy or RetryPolicy()

    def request_with_retry(self, method: str, url: str, **kwargs) -> Response:
        last_exception = None

        for attempt in range(self.retry_policy.max_retries + 1):
            try:
                response = self._http.request(method, url, **kwargs)

                if response.status_code < 400:
                    return response

                category = classify_error(response.status_code, None)

                if category == ErrorCategory.NON_RETRYABLE:
                    raise APIError(response.status_code, response.text)

                if attempt == self.retry_policy.max_retries:
                    raise APIError(response.status_code, response.text)

                retry_after = self._parse_retry_after(response)
                delay = self.retry_policy.calculate_delay(attempt, retry_after)

                logger.warning(
                    "Request failed with %d, retrying in %.1fs (attempt %d/%d)",
                    response.status_code, delay, attempt + 1,
                    self.retry_policy.max_retries,
                )
                time.sleep(delay)

            except (ConnectionError, TimeoutError) as exc:
                last_exception = exc
                if attempt == self.retry_policy.max_retries:
                    raise APIConnectionError(str(exc)) from exc

                delay = self.retry_policy.calculate_delay(attempt)
                logger.warning(
                    "Connection failed, retrying in %.1fs (attempt %d/%d)",
                    delay, attempt + 1, self.retry_policy.max_retries,
                )
                time.sleep(delay)

    def _parse_retry_after(self, response) -> float | None:
        header = response.headers.get("Retry-After")
        if header is None:
            return None
        try:
            return float(header)
        except ValueError:
            return None

TypeScript Retry Implementation

The same pattern in TypeScript using async/await:

interface RetryConfig {
  maxRetries: number;
  initialDelay: number;
  maxDelay: number;
  backoffFactor: number;
}

const DEFAULT_RETRY: RetryConfig = {
  maxRetries: 3,
  initialDelay: 500,
  maxDelay: 30_000,
  backoffFactor: 2,
};

async function fetchWithRetry(
  url: string,
  init: RequestInit,
  config: RetryConfig = DEFAULT_RETRY,
): Promise<Response> {
  let lastError: Error | null = null;

  for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
    try {
      const response = await fetch(url, init);

      if (response.ok) return response;

      if (![408, 429, 500, 502, 503, 504].includes(response.status)) {
        throw new AgentAPIError(response.status, await response.text());
      }

      if (attempt === config.maxRetries) {
        throw new AgentAPIError(response.status, await response.text());
      }

      const retryAfter = response.headers.get('Retry-After');
      const delay = retryAfter
        ? Math.min(parseFloat(retryAfter) * 1000, config.maxDelay)
        : Math.min(config.initialDelay * config.backoffFactor ** attempt, config.maxDelay);

      await new Promise(resolve => setTimeout(resolve, delay));
    } catch (error) {
      if (error instanceof AgentAPIError) throw error;
      lastError = error as Error;

      if (attempt === config.maxRetries) throw lastError;

      const delay = Math.min(
        config.initialDelay * config.backoffFactor ** attempt,
        config.maxDelay,
      );
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }

  throw lastError ?? new Error('Retry exhausted');
}

Timeout Configuration

Offer multiple timeout levels — connection timeout, read timeout, and total request timeout:

@dataclass
class TimeoutConfig:
    connect: float = 5.0    # seconds to establish connection
    read: float = 30.0      # seconds to read response
    total: float = 60.0     # total request deadline

AI agent runs can take 30+ seconds. The SDK should default to generous timeouts for run operations while keeping shorter timeouts for metadata queries.

FAQ

Should I add jitter to the backoff delays?

Yes. Without jitter, retrying clients that failed at the same time will retry at the same time, creating a thundering herd. Add random jitter of up to 25% of the calculated delay: delay = delay * (0.75 + random.random() * 0.5). This spreads retry attempts across time and reduces the chance of synchronized retries overwhelming the server.

How do I prevent retries from masking genuine outages?

Log every retry at warning level with the attempt count, status code, and delay. If the SDK exhausts all retries, raise the final error with context about how many attempts were made. Users can monitor retry logs to detect degradation before it becomes a total outage.

Should the SDK respect Retry-After headers with very large values?

Cap Retry-After at your max_delay configuration. A server sending a 300-second Retry-After header is likely indicating a prolonged outage. Rather than blocking the user's thread for five minutes, respect your timeout policy and fail with a clear error message suggesting the user retry later.

#RetryLogic #ErrorHandling #SDKDesign #Resilience #AgenticAI #Python #LearnAI #AIEngineering

SDK Retry and Error Handling: Building Resilient Client Libraries

Why SDKs Must Handle Retries

Error Classification

Retry Policy Configuration

The Retry Engine

TypeScript Retry Implementation

Timeout Configuration

FAQ

Should I add jitter to the backoff delays?

How do I prevent retries from masking genuine outages?

Should the SDK respect Retry-After headers with very large values?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding