OpenAI JSON Mode and Structured Outputs: Reliable Data Extraction
Master OpenAI's JSON mode and structured outputs to extract reliable, schema-validated data from LLMs with guaranteed format compliance and Pydantic integration.
The Problem with Unstructured LLM Output
By default, LLMs return free-form text. When you need structured data — a JSON object with specific fields, types, and constraints — you are relying on the model to follow your prompt instructions perfectly. It usually works, but sometimes the model wraps JSON in markdown code fences, adds extra commentary, omits fields, or returns invalid JSON.
OpenAI provides two mechanisms to solve this: JSON mode and structured outputs. Both guarantee valid JSON, but structured outputs go further by enforcing a specific schema.
JSON Mode: Guaranteed Valid JSON
JSON mode ensures the model outputs valid JSON, but does not enforce a specific structure:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract the person's details as JSON with name, age, and city fields."},
{"role": "user", "content": "John Smith is 34 years old and lives in Chicago."},
],
response_format={"type": "json_object"},
)
import json
data = json.loads(response.choices[0].message.content)
print(data)
# {"name": "John Smith", "age": 34, "city": "Chicago"}
Important: You must mention JSON in your system or user message when using JSON mode. The API requires this and will error if you do not.
Structured Outputs: Schema-Enforced JSON
Structured outputs go beyond JSON mode by enforcing a specific JSON schema:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract product information from the text."},
{"role": "user", "content": "The MacBook Pro 16-inch costs $2499, weighs 4.8 lbs, and has an M3 Max chip."},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "product_info",
"strict": True,
"schema": {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price_usd": {"type": "number"},
"weight_lbs": {"type": "number"},
"processor": {"type": "string"},
},
"required": ["product_name", "price_usd", "weight_lbs", "processor"],
"additionalProperties": False,
},
},
},
)
data = json.loads(response.choices[0].message.content)
print(data)
With strict: True, the model is constrained to output JSON that conforms exactly to your schema. Every required field will be present, types will match, and no extra fields will appear.
Pydantic Integration
The SDK integrates with Pydantic models for a cleaner developer experience:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI()
class ContactInfo(BaseModel):
name: str
email: str
phone: str
company: str
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract contact information from the text."},
{"role": "user", "content": "Reach out to Sarah Connor at sarah@skynet.com or 555-0199. She works at Cyberdyne Systems."},
],
response_format=ContactInfo,
)
contact = response.choices[0].message.parsed
print(f"Name: {contact.name}")
print(f"Email: {contact.email}")
print(f"Phone: {contact.phone}")
print(f"Company: {contact.company}")
The .parse() method automatically converts the Pydantic model into a JSON schema, sends it to the API, and parses the response back into a typed Pydantic instance.
Nested and Complex Schemas
Structured outputs support nested objects, arrays, and enums:
from pydantic import BaseModel
from enum import Enum
class Severity(str, Enum):
low = "low"
medium = "medium"
high = "high"
critical = "critical"
class Step(BaseModel):
description: str
estimated_hours: float
class BugReport(BaseModel):
title: str
severity: Severity
affected_component: str
steps_to_reproduce: list[Step]
expected_behavior: str
actual_behavior: str
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Parse the bug report into structured format."},
{"role": "user", "content": "Critical bug in the payment module. When a user clicks 'Pay Now' with an expired card (takes 2 seconds), the system shows a success message instead of an error. Expected: error message. Actual: success confirmation."},
],
response_format=BugReport,
)
bug = response.choices[0].message.parsed
print(f"Title: {bug.title}")
print(f"Severity: {bug.severity}")
print(f"Steps: {len(bug.steps_to_reproduce)}")
Handling Refusals
Sometimes the model refuses to fill the schema (e.g., for safety reasons). Check for this:
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract the information."},
{"role": "user", "content": "Some input text here."},
],
response_format=ContactInfo,
)
message = response.choices[0].message
if message.refusal:
print(f"Model refused: {message.refusal}")
else:
contact = message.parsed
print(contact)
Practical Example: Invoice Parsing
Here is a realistic data extraction pipeline:
from pydantic import BaseModel
class LineItem(BaseModel):
description: str
quantity: int
unit_price: float
total: float
class Invoice(BaseModel):
invoice_number: str
date: str
vendor_name: str
line_items: list[LineItem]
subtotal: float
tax: float
total: float
def parse_invoice(raw_text: str) -> Invoice:
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Parse the invoice text into structured data. Calculate totals if not explicitly stated."},
{"role": "user", "content": raw_text},
],
response_format=Invoice,
)
return response.choices[0].message.parsed
FAQ
What is the difference between JSON mode and structured outputs?
JSON mode guarantees the output is valid JSON but does not enforce a specific structure. Structured outputs enforce a specific JSON schema with exact field names, types, and constraints. Use JSON mode for flexibility, structured outputs for reliability.
Do structured outputs work with all OpenAI models?
Structured outputs with json_schema require GPT-4o or later models. JSON mode (json_object) is supported by GPT-4o, GPT-4o-mini, and GPT-3.5-turbo. Check the API documentation for the latest model compatibility.
Can I use optional fields in structured output schemas?
With strict: True, all properties must be listed in required. To make a field optional, use a union type with null: {"type": ["string", "null"]}. In Pydantic, use Optional[str] with a default of None.
#OpenAI #JSONMode #StructuredOutputs #Pydantic #DataExtraction #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.