Handling Structured Output Failures: Retries, Fallbacks, and Partial Parsing
Build resilient structured output systems that handle LLM failures gracefully. Learn retry strategies with exponential backoff, fallback schemas, partial result recovery, and graceful degradation patterns.
Why Structured Outputs Fail
Even with OpenAI's constrained decoding and Pydantic validation, structured output extraction fails in production. Common failure modes include:
- API errors: Rate limits (429), server errors (500), timeouts
- Validation errors: The model returns valid JSON that fails your business logic validators
- Content refusals: The model refuses to process the input due to safety filters
- Malformed output: Rare with strict mode, but possible with JSON mode or local models
- Hallucination: The JSON is valid and schema-conforming, but the extracted values are wrong
A production system must handle every one of these gracefully. Crashing on the first validation error is not acceptable.
Retry with Exponential Backoff
The simplest resilience pattern. Retry API failures with increasing delays:
import time
import random
from typing import TypeVar, Callable
from openai import RateLimitError, APITimeoutError, APIConnectionError
T = TypeVar("T")
def retry_with_backoff(
func: Callable[..., T],
max_retries: int = 5,
base_delay: float = 1.0,
max_delay: float = 60.0,
retryable_errors: tuple = (RateLimitError, APITimeoutError, APIConnectionError),
) -> Callable[..., T]:
"""Decorator that retries a function with exponential backoff."""
def wrapper(*args, **kwargs) -> T:
last_error = None
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except retryable_errors as e:
last_error = e
delay = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.1f}s")
time.sleep(delay)
except Exception:
raise # Non-retryable errors propagate immediately
raise last_error
return wrapper
The jitter (random.uniform(0, 1)) prevents thundering herd problems when multiple processes hit rate limits simultaneously.
Validation-Aware Retries with Instructor
Instructor's built-in retry mechanism feeds validation errors back to the model. Customize this behavior:
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator
from typing import List
client = instructor.from_openai(OpenAI())
class StrictProduct(BaseModel):
name: str
price: float = Field(gt=0)
currency: str = Field(pattern=r"^[A-Z]{3}$")
categories: List[str] = Field(min_length=1, max_length=5)
@field_validator("name")
@classmethod
def name_not_generic(cls, v: str) -> str:
generic_names = {"product", "item", "thing", "unknown"}
if v.lower().strip() in generic_names:
raise ValueError(f"Name '{v}' is too generic. Extract the actual product name.")
return v
# Instructor automatically retries with validation errors in the prompt
product = client.chat.completions.create(
model="gpt-4o",
response_model=StrictProduct,
max_retries=3, # Will retry up to 3 times on validation failure
messages=[
{"role": "user", "content": "The new widget costs fifteen dollars."}
],
)
On each retry, the model sees its previous output and the exact validation error, allowing it to self-correct.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Fallback Schemas
When a detailed extraction fails repeatedly, fall back to a simpler schema that captures partial data:
class DetailedExtraction(BaseModel):
company_name: str
founding_year: int
revenue: float
employee_count: int
headquarters_city: str
headquarters_country: str
industry: str
ceo_name: str
class FallbackExtraction(BaseModel):
company_name: str
raw_details: str = Field(description="Any other details as free text")
extraction_complete: bool = False
def extract_with_fallback(text: str) -> DetailedExtraction | FallbackExtraction:
"""Try detailed extraction first, fall back to simple on failure."""
try:
return client.chat.completions.create(
model="gpt-4o",
response_model=DetailedExtraction,
max_retries=2,
messages=[
{"role": "system", "content": "Extract company details precisely."},
{"role": "user", "content": text}
],
)
except Exception as e:
print(f"Detailed extraction failed: {e}. Trying fallback.")
return client.chat.completions.create(
model="gpt-4o",
response_model=FallbackExtraction,
max_retries=1,
messages=[
{"role": "system", "content": "Extract whatever company information you can."},
{"role": "user", "content": text}
],
)
The fallback captures the company name (almost always extractable) and dumps everything else into free text. This is better than returning nothing — downstream systems can still use the partial data.
Partial Result Recovery
When extracting a list of items, some may validate while others fail. Recover the valid ones:
from pydantic import ValidationError
class Transaction(BaseModel):
date: str = Field(pattern=r"^\d{4}-\d{2}-\d{2}$")
amount: float = Field(gt=0)
merchant: str
category: str
def extract_transactions_with_recovery(raw_items: list[dict]) -> tuple[list[Transaction], list[dict]]:
"""Parse a list of raw dicts, separating valid from invalid."""
valid = []
invalid = []
for item in raw_items:
try:
valid.append(Transaction.model_validate(item))
except ValidationError as e:
invalid.append({"data": item, "errors": e.errors()})
return valid, invalid
# Example usage after getting raw JSON from LLM
import json
raw_response = '''[
{"date": "2025-03-15", "amount": 42.50, "merchant": "Coffee Shop", "category": "food"},
{"date": "March 15", "amount": -10, "merchant": "", "category": "other"},
{"date": "2025-03-16", "amount": 120.00, "merchant": "Gas Station", "category": "transport"}
]'''
raw_items = json.loads(raw_response)
valid, invalid = extract_transactions_with_recovery(raw_items)
print(f"Recovered {len(valid)} of {len(raw_items)} transactions")
print(f"Failed items: {len(invalid)}")
Graceful Degradation Pipeline
Combine all patterns into a complete resilience pipeline:
from dataclasses import dataclass
from typing import Any, Optional
@dataclass
class ExtractionResult:
data: Any
quality: str # "full", "partial", "fallback", "failed"
errors: list[str]
attempts: int
def resilient_extract(text: str) -> ExtractionResult:
errors = []
# Attempt 1: Full extraction with strict model
try:
result = client.chat.completions.create(
model="gpt-4o",
response_model=DetailedExtraction,
max_retries=2,
messages=[
{"role": "system", "content": "Extract all company details."},
{"role": "user", "content": text}
],
)
return ExtractionResult(data=result, quality="full", errors=[], attempts=1)
except Exception as e:
errors.append(f"Full extraction failed: {e}")
# Attempt 2: Fallback schema with cheaper model
try:
result = client.chat.completions.create(
model="gpt-4o-mini",
response_model=FallbackExtraction,
max_retries=1,
messages=[
{"role": "system", "content": "Extract basic company info."},
{"role": "user", "content": text}
],
)
return ExtractionResult(data=result, quality="fallback", errors=errors, attempts=2)
except Exception as e:
errors.append(f"Fallback extraction failed: {e}")
return ExtractionResult(data=None, quality="failed", errors=errors, attempts=2)
FAQ
How many retries should I configure for production systems?
For API errors (rate limits, timeouts), use 3-5 retries with exponential backoff. For validation errors via Instructor, use 2-3 retries — if the model cannot produce valid output in 3 attempts, more retries rarely help and you should fall back to a simpler schema. Total retry budget should stay under 30 seconds for user-facing applications.
How do I log structured output failures for debugging?
Log the full context: input text, raw LLM response, validation errors, retry count, and which fallback stage succeeded. Use structured logging (JSON format) so you can query failures by error type, schema, and model. This data is invaluable for identifying which validators are too strict and which input patterns cause consistent failures.
Should I use circuit breakers for LLM extraction?
Yes, especially in high-throughput systems. If the LLM API returns errors on 50%+ of recent requests, stop sending new requests for a cooldown period (30-60 seconds). This prevents cascading failures and wasted API spend. Libraries like tenacity and pybreaker make this easy to implement.
#ErrorHandling #Resilience #StructuredOutputs #Production #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.