OpenAI Vision API: Building Applications That Understand Images
Learn how to use OpenAI's Vision API to analyze images, send base64-encoded and URL-based images, build multi-modal prompts, and create practical image understanding applications.
What Is the Vision API?
OpenAI's Vision API lets you send images alongside text to models like GPT-4o and receive intelligent analysis, descriptions, or data extraction based on the visual content. The model can read text in images, describe scenes, analyze charts, identify objects, compare images, and answer questions about visual content.
This capability unlocks applications that were previously impossible with text-only models: document processing, visual QA systems, accessibility tools, UI analysis, and more.
Sending an Image via URL
The simplest approach is to pass a publicly accessible image URL:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image? Describe it in detail."},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg",
},
},
],
},
],
)
print(response.choices[0].message.content)
Notice that the content field is now an array of content parts, mixing text and image inputs. This is the multi-modal message format.
Sending Base64-Encoded Images
For local files or dynamically generated images, encode them as base64:
import base64
from openai import OpenAI
client = OpenAI()
def encode_image(image_path: str) -> str:
with open(image_path, "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")
image_data = encode_image("screenshot.png")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Extract all text visible in this screenshot."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_data}",
},
},
],
},
],
)
print(response.choices[0].message.content)
Supported formats include PNG, JPEG, GIF (first frame), and WebP. The data URL must include the correct MIME type.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Controlling Image Detail Level
The detail parameter controls how the model processes the image:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Is this a cat or a dog?"},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/pet.jpg",
"detail": "low", # or "high" or "auto"
},
},
],
},
],
)
- low — Uses a fixed 512x512 thumbnail. Fastest and cheapest. Good for simple classification tasks.
- high — Processes the full-resolution image with multiple crops. Best for reading small text, analyzing details, or complex visual tasks.
- auto (default) — The model decides based on the image size and content.
Multiple Images in One Request
Send several images for comparison or batch analysis:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Compare these two UI designs. Which one has better visual hierarchy and why?"},
{
"type": "image_url",
"image_url": {"url": "https://example.com/design_a.png"},
},
{
"type": "image_url",
"image_url": {"url": "https://example.com/design_b.png"},
},
],
},
],
)
print(response.choices[0].message.content)
Practical Example: Document Data Extraction
Combine vision with structured outputs to extract data from images of forms, receipts, or documents:
import base64
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI()
class ReceiptData(BaseModel):
store_name: str
date: str
items: list[dict]
subtotal: float
tax: float
total: float
payment_method: str
def extract_receipt(image_path: str) -> ReceiptData:
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "Extract all information from this receipt image into structured data.",
},
{
"role": "user",
"content": [
{"type": "text", "text": "Parse this receipt."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}",
"detail": "high",
},
},
],
},
],
response_format=ReceiptData,
)
return response.choices[0].message.parsed
receipt = extract_receipt("receipt.jpg")
print(f"Store: {receipt.store_name}")
print(f"Total: ${receipt.total:.2f}")
Building an Accessibility Description Generator
Use vision to create alt text for images automatically:
import base64
from openai import OpenAI
client = OpenAI()
def generate_alt_text(image_path: str) -> str:
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "Generate concise, descriptive alt text for web accessibility. "
"Focus on the key visual content and context. Keep it under 125 characters.",
},
{
"role": "user",
"content": [
{"type": "text", "text": "Generate alt text for this image."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_data}",
"detail": "low",
},
},
],
},
],
max_tokens=100,
)
return response.choices[0].message.content
alt = generate_alt_text("hero-banner.png")
print(f'<img src="hero-banner.png" alt="{alt}" />')
FAQ
What is the maximum image size I can send?
OpenAI accepts images up to 20MB each. For base64-encoded images, the encoded string will be approximately 33% larger than the original file. If your image is too large, resize it before sending — the model works well with images in the 1024x1024 to 2048x2048 range.
How are images counted toward the token limit?
Images consume tokens based on their resolution and detail setting. A low detail image costs a fixed 85 tokens. A high detail image is split into 512x512 tiles, each costing 170 tokens, plus a base 85 tokens. A 2048x2048 high-detail image costs around 765 tokens.
Can the model generate images or only analyze them?
The Chat Completions API with vision is analysis-only — it understands images but does not create them. For image generation, use the DALL-E API via client.images.generate().
#OpenAI #VisionAPI #MultiModal #ImageAnalysis #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.