Skip to content
Learn Agentic AI11 min read0 views

SDK Retry and Error Handling: Building Resilient Client Libraries

Learn how to implement robust retry policies, error classification, timeout configuration, and structured logging in AI agent SDK client libraries for production reliability.

Why SDKs Must Handle Retries

Network requests fail. Servers return 500 errors during deployments. Rate limiters throttle bursts. DNS resolution hiccups. TCP connections reset. If your SDK surfaces every transient failure directly to the user, their application becomes fragile. A production-grade SDK retries transient errors automatically so that intermittent infrastructure issues do not cascade into application failures.

The goal is not to mask errors — it is to absorb noise so that when an error reaches the user, it represents a genuine problem that requires their attention.

Error Classification

The first step is classifying errors into retryable and non-retryable categories. This classification drives the retry engine:

from enum import Enum


class ErrorCategory(Enum):
    RETRYABLE = "retryable"
    NON_RETRYABLE = "non_retryable"
    RATE_LIMITED = "rate_limited"


def classify_error(status_code: int | None, exception: Exception | None) -> ErrorCategory:
    """Classify an error to determine retry behavior."""

    # Network-level failures are always retryable
    if exception is not None:
        if isinstance(exception, (ConnectionError, TimeoutError)):
            return ErrorCategory.RETRYABLE
        return ErrorCategory.NON_RETRYABLE

    # HTTP status code classification
    if status_code is not None:
        if status_code == 429:
            return ErrorCategory.RATE_LIMITED
        if status_code in (408, 500, 502, 503, 504):
            return ErrorCategory.RETRYABLE
        if status_code == 409:
            return ErrorCategory.RETRYABLE  # Conflict, often transient
        return ErrorCategory.NON_RETRYABLE

    return ErrorCategory.NON_RETRYABLE

The critical distinction: 400 (bad request), 401 (unauthorized), 403 (forbidden), and 404 (not found) are never retried. The user must fix their request or credentials. 500, 502, 503, and 504 are retried because they typically indicate transient server issues. 429 (rate limited) is retried with special handling for the Retry-After header.

Retry Policy Configuration

Users need control over retry behavior. Some applications prefer fast failure; others can tolerate longer wait times for higher reliability:

from dataclasses import dataclass


@dataclass
class RetryPolicy:
    """Configuration for retry behavior."""
    max_retries: int = 3
    initial_delay: float = 0.5       # seconds
    max_delay: float = 30.0          # seconds
    backoff_factor: float = 2.0      # exponential multiplier
    retry_on_status: set[int] = None
    retry_on_timeout: bool = True

    def __post_init__(self):
        if self.retry_on_status is None:
            self.retry_on_status = {408, 429, 500, 502, 503, 504}

    def calculate_delay(self, attempt: int, retry_after: float | None = None) -> float:
        """Calculate delay before next retry with exponential backoff."""
        if retry_after is not None:
            return min(retry_after, self.max_delay)

        delay = self.initial_delay * (self.backoff_factor ** attempt)
        return min(delay, self.max_delay)

The calculate_delay method implements exponential backoff: 0.5s, 1s, 2s, 4s, and so on up to the maximum. When the server sends a Retry-After header, the SDK honors it but caps at max_delay to prevent unbounded waits.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

The Retry Engine

The retry engine wraps the HTTP request method and orchestrates classification, backoff, and logging:

import time
import logging

logger = logging.getLogger("myagent")


class RetryableClient:
    def __init__(self, http_client, retry_policy: RetryPolicy | None = None):
        self._http = http_client
        self.retry_policy = retry_policy or RetryPolicy()

    def request_with_retry(self, method: str, url: str, **kwargs) -> Response:
        last_exception = None

        for attempt in range(self.retry_policy.max_retries + 1):
            try:
                response = self._http.request(method, url, **kwargs)

                if response.status_code < 400:
                    return response

                category = classify_error(response.status_code, None)

                if category == ErrorCategory.NON_RETRYABLE:
                    raise APIError(response.status_code, response.text)

                if attempt == self.retry_policy.max_retries:
                    raise APIError(response.status_code, response.text)

                retry_after = self._parse_retry_after(response)
                delay = self.retry_policy.calculate_delay(attempt, retry_after)

                logger.warning(
                    "Request failed with %d, retrying in %.1fs (attempt %d/%d)",
                    response.status_code, delay, attempt + 1,
                    self.retry_policy.max_retries,
                )
                time.sleep(delay)

            except (ConnectionError, TimeoutError) as exc:
                last_exception = exc
                if attempt == self.retry_policy.max_retries:
                    raise APIConnectionError(str(exc)) from exc

                delay = self.retry_policy.calculate_delay(attempt)
                logger.warning(
                    "Connection failed, retrying in %.1fs (attempt %d/%d)",
                    delay, attempt + 1, self.retry_policy.max_retries,
                )
                time.sleep(delay)

    def _parse_retry_after(self, response) -> float | None:
        header = response.headers.get("Retry-After")
        if header is None:
            return None
        try:
            return float(header)
        except ValueError:
            return None

TypeScript Retry Implementation

The same pattern in TypeScript using async/await:

interface RetryConfig {
  maxRetries: number;
  initialDelay: number;
  maxDelay: number;
  backoffFactor: number;
}

const DEFAULT_RETRY: RetryConfig = {
  maxRetries: 3,
  initialDelay: 500,
  maxDelay: 30_000,
  backoffFactor: 2,
};

async function fetchWithRetry(
  url: string,
  init: RequestInit,
  config: RetryConfig = DEFAULT_RETRY,
): Promise<Response> {
  let lastError: Error | null = null;

  for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
    try {
      const response = await fetch(url, init);

      if (response.ok) return response;

      if (![408, 429, 500, 502, 503, 504].includes(response.status)) {
        throw new AgentAPIError(response.status, await response.text());
      }

      if (attempt === config.maxRetries) {
        throw new AgentAPIError(response.status, await response.text());
      }

      const retryAfter = response.headers.get('Retry-After');
      const delay = retryAfter
        ? Math.min(parseFloat(retryAfter) * 1000, config.maxDelay)
        : Math.min(config.initialDelay * config.backoffFactor ** attempt, config.maxDelay);

      await new Promise(resolve => setTimeout(resolve, delay));
    } catch (error) {
      if (error instanceof AgentAPIError) throw error;
      lastError = error as Error;

      if (attempt === config.maxRetries) throw lastError;

      const delay = Math.min(
        config.initialDelay * config.backoffFactor ** attempt,
        config.maxDelay,
      );
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }

  throw lastError ?? new Error('Retry exhausted');
}

Timeout Configuration

Offer multiple timeout levels — connection timeout, read timeout, and total request timeout:

@dataclass
class TimeoutConfig:
    connect: float = 5.0    # seconds to establish connection
    read: float = 30.0      # seconds to read response
    total: float = 60.0     # total request deadline

AI agent runs can take 30+ seconds. The SDK should default to generous timeouts for run operations while keeping shorter timeouts for metadata queries.

FAQ

Should I add jitter to the backoff delays?

Yes. Without jitter, retrying clients that failed at the same time will retry at the same time, creating a thundering herd. Add random jitter of up to 25% of the calculated delay: delay = delay * (0.75 + random.random() * 0.5). This spreads retry attempts across time and reduces the chance of synchronized retries overwhelming the server.

How do I prevent retries from masking genuine outages?

Log every retry at warning level with the attempt count, status code, and delay. If the SDK exhausts all retries, raise the final error with context about how many attempts were made. Users can monitor retry logs to detect degradation before it becomes a total outage.

Should the SDK respect Retry-After headers with very large values?

Cap Retry-After at your max_delay configuration. A server sending a 300-second Retry-After header is likely indicating a prolonged outage. Rather than blocking the user's thread for five minutes, respect your timeout policy and fail with a clear error message suggesting the user retry later.


#RetryLogic #ErrorHandling #SDKDesign #Resilience #AgenticAI #Python #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.