SDK Retry and Error Handling: Building Resilient Client Libraries
Learn how to implement robust retry policies, error classification, timeout configuration, and structured logging in AI agent SDK client libraries for production reliability.
Why SDKs Must Handle Retries
Network requests fail. Servers return 500 errors during deployments. Rate limiters throttle bursts. DNS resolution hiccups. TCP connections reset. If your SDK surfaces every transient failure directly to the user, their application becomes fragile. A production-grade SDK retries transient errors automatically so that intermittent infrastructure issues do not cascade into application failures.
The goal is not to mask errors — it is to absorb noise so that when an error reaches the user, it represents a genuine problem that requires their attention.
Error Classification
The first step is classifying errors into retryable and non-retryable categories. This classification drives the retry engine:
from enum import Enum
class ErrorCategory(Enum):
RETRYABLE = "retryable"
NON_RETRYABLE = "non_retryable"
RATE_LIMITED = "rate_limited"
def classify_error(status_code: int | None, exception: Exception | None) -> ErrorCategory:
"""Classify an error to determine retry behavior."""
# Network-level failures are always retryable
if exception is not None:
if isinstance(exception, (ConnectionError, TimeoutError)):
return ErrorCategory.RETRYABLE
return ErrorCategory.NON_RETRYABLE
# HTTP status code classification
if status_code is not None:
if status_code == 429:
return ErrorCategory.RATE_LIMITED
if status_code in (408, 500, 502, 503, 504):
return ErrorCategory.RETRYABLE
if status_code == 409:
return ErrorCategory.RETRYABLE # Conflict, often transient
return ErrorCategory.NON_RETRYABLE
return ErrorCategory.NON_RETRYABLE
The critical distinction: 400 (bad request), 401 (unauthorized), 403 (forbidden), and 404 (not found) are never retried. The user must fix their request or credentials. 500, 502, 503, and 504 are retried because they typically indicate transient server issues. 429 (rate limited) is retried with special handling for the Retry-After header.
Retry Policy Configuration
Users need control over retry behavior. Some applications prefer fast failure; others can tolerate longer wait times for higher reliability:
from dataclasses import dataclass
@dataclass
class RetryPolicy:
"""Configuration for retry behavior."""
max_retries: int = 3
initial_delay: float = 0.5 # seconds
max_delay: float = 30.0 # seconds
backoff_factor: float = 2.0 # exponential multiplier
retry_on_status: set[int] = None
retry_on_timeout: bool = True
def __post_init__(self):
if self.retry_on_status is None:
self.retry_on_status = {408, 429, 500, 502, 503, 504}
def calculate_delay(self, attempt: int, retry_after: float | None = None) -> float:
"""Calculate delay before next retry with exponential backoff."""
if retry_after is not None:
return min(retry_after, self.max_delay)
delay = self.initial_delay * (self.backoff_factor ** attempt)
return min(delay, self.max_delay)
The calculate_delay method implements exponential backoff: 0.5s, 1s, 2s, 4s, and so on up to the maximum. When the server sends a Retry-After header, the SDK honors it but caps at max_delay to prevent unbounded waits.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
The Retry Engine
The retry engine wraps the HTTP request method and orchestrates classification, backoff, and logging:
import time
import logging
logger = logging.getLogger("myagent")
class RetryableClient:
def __init__(self, http_client, retry_policy: RetryPolicy | None = None):
self._http = http_client
self.retry_policy = retry_policy or RetryPolicy()
def request_with_retry(self, method: str, url: str, **kwargs) -> Response:
last_exception = None
for attempt in range(self.retry_policy.max_retries + 1):
try:
response = self._http.request(method, url, **kwargs)
if response.status_code < 400:
return response
category = classify_error(response.status_code, None)
if category == ErrorCategory.NON_RETRYABLE:
raise APIError(response.status_code, response.text)
if attempt == self.retry_policy.max_retries:
raise APIError(response.status_code, response.text)
retry_after = self._parse_retry_after(response)
delay = self.retry_policy.calculate_delay(attempt, retry_after)
logger.warning(
"Request failed with %d, retrying in %.1fs (attempt %d/%d)",
response.status_code, delay, attempt + 1,
self.retry_policy.max_retries,
)
time.sleep(delay)
except (ConnectionError, TimeoutError) as exc:
last_exception = exc
if attempt == self.retry_policy.max_retries:
raise APIConnectionError(str(exc)) from exc
delay = self.retry_policy.calculate_delay(attempt)
logger.warning(
"Connection failed, retrying in %.1fs (attempt %d/%d)",
delay, attempt + 1, self.retry_policy.max_retries,
)
time.sleep(delay)
def _parse_retry_after(self, response) -> float | None:
header = response.headers.get("Retry-After")
if header is None:
return None
try:
return float(header)
except ValueError:
return None
TypeScript Retry Implementation
The same pattern in TypeScript using async/await:
interface RetryConfig {
maxRetries: number;
initialDelay: number;
maxDelay: number;
backoffFactor: number;
}
const DEFAULT_RETRY: RetryConfig = {
maxRetries: 3,
initialDelay: 500,
maxDelay: 30_000,
backoffFactor: 2,
};
async function fetchWithRetry(
url: string,
init: RequestInit,
config: RetryConfig = DEFAULT_RETRY,
): Promise<Response> {
let lastError: Error | null = null;
for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
try {
const response = await fetch(url, init);
if (response.ok) return response;
if (![408, 429, 500, 502, 503, 504].includes(response.status)) {
throw new AgentAPIError(response.status, await response.text());
}
if (attempt === config.maxRetries) {
throw new AgentAPIError(response.status, await response.text());
}
const retryAfter = response.headers.get('Retry-After');
const delay = retryAfter
? Math.min(parseFloat(retryAfter) * 1000, config.maxDelay)
: Math.min(config.initialDelay * config.backoffFactor ** attempt, config.maxDelay);
await new Promise(resolve => setTimeout(resolve, delay));
} catch (error) {
if (error instanceof AgentAPIError) throw error;
lastError = error as Error;
if (attempt === config.maxRetries) throw lastError;
const delay = Math.min(
config.initialDelay * config.backoffFactor ** attempt,
config.maxDelay,
);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw lastError ?? new Error('Retry exhausted');
}
Timeout Configuration
Offer multiple timeout levels — connection timeout, read timeout, and total request timeout:
@dataclass
class TimeoutConfig:
connect: float = 5.0 # seconds to establish connection
read: float = 30.0 # seconds to read response
total: float = 60.0 # total request deadline
AI agent runs can take 30+ seconds. The SDK should default to generous timeouts for run operations while keeping shorter timeouts for metadata queries.
FAQ
Should I add jitter to the backoff delays?
Yes. Without jitter, retrying clients that failed at the same time will retry at the same time, creating a thundering herd. Add random jitter of up to 25% of the calculated delay: delay = delay * (0.75 + random.random() * 0.5). This spreads retry attempts across time and reduces the chance of synchronized retries overwhelming the server.
How do I prevent retries from masking genuine outages?
Log every retry at warning level with the attempt count, status code, and delay. If the SDK exhausts all retries, raise the final error with context about how many attempts were made. Users can monitor retry logs to detect degradation before it becomes a total outage.
Should the SDK respect Retry-After headers with very large values?
Cap Retry-After at your max_delay configuration. A server sending a 300-second Retry-After header is likely indicating a prolonged outage. Rather than blocking the user's thread for five minutes, respect your timeout policy and fail with a clear error message suggesting the user retry later.
#RetryLogic #ErrorHandling #SDKDesign #Resilience #AgenticAI #Python #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.