OpenAI Vision API: Building Applications That Understand Images

What Is the Vision API?

OpenAI's Vision API lets you send images alongside text to models like GPT-4o and receive intelligent analysis, descriptions, or data extraction based on the visual content. The model can read text in images, describe scenes, analyze charts, identify objects, compare images, and answer questions about visual content.

This capability unlocks applications that were previously impossible with text-only models: document processing, visual QA systems, accessibility tools, UI analysis, and more.

Sending an Image via URL

The simplest approach is to pass a publicly accessible image URL:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image? Describe it in detail."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg",
                    },
                },
            ],
        },
    ],
)

print(response.choices[0].message.content)

Notice that the content field is now an array of content parts, mixing text and image inputs. This is the multi-modal message format.

Sending Base64-Encoded Images

For local files or dynamically generated images, encode them as base64:

import base64
from openai import OpenAI

client = OpenAI()

def encode_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

image_data = encode_image("screenshot.png")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract all text visible in this screenshot."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_data}",
                    },
                },
            ],
        },
    ],
)

print(response.choices[0].message.content)

Supported formats include PNG, JPEG, GIF (first frame), and WebP. The data URL must include the correct MIME type.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Controlling Image Detail Level

The detail parameter controls how the model processes the image:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Is this a cat or a dog?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/pet.jpg",
                        "detail": "low",  # or "high" or "auto"
                    },
                },
            ],
        },
    ],
)

low — Uses a fixed 512x512 thumbnail. Fastest and cheapest. Good for simple classification tasks.
high — Processes the full-resolution image with multiple crops. Best for reading small text, analyzing details, or complex visual tasks.
auto (default) — The model decides based on the image size and content.

Multiple Images in One Request

Send several images for comparison or batch analysis:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare these two UI designs. Which one has better visual hierarchy and why?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/design_a.png"},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/design_b.png"},
                },
            ],
        },
    ],
)

print(response.choices[0].message.content)

Practical Example: Document Data Extraction

Combine vision with structured outputs to extract data from images of forms, receipts, or documents:

import base64
from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class ReceiptData(BaseModel):
    store_name: str
    date: str
    items: list[dict]
    subtotal: float
    tax: float
    total: float
    payment_method: str

def extract_receipt(image_path: str) -> ReceiptData:
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Extract all information from this receipt image into structured data.",
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Parse this receipt."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_data}",
                            "detail": "high",
                        },
                    },
                ],
            },
        ],
        response_format=ReceiptData,
    )

    return response.choices[0].message.parsed

receipt = extract_receipt("receipt.jpg")
print(f"Store: {receipt.store_name}")
print(f"Total: ${receipt.total:.2f}")

Building an Accessibility Description Generator

Use vision to create alt text for images automatically:

import base64
from openai import OpenAI

client = OpenAI()

def generate_alt_text(image_path: str) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Generate concise, descriptive alt text for web accessibility. "
                           "Focus on the key visual content and context. Keep it under 125 characters.",
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Generate alt text for this image."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_data}",
                            "detail": "low",
                        },
                    },
                ],
            },
        ],
        max_tokens=100,
    )

    return response.choices[0].message.content

alt = generate_alt_text("hero-banner.png")
print(f'<img src="hero-banner.png" alt="{alt}" />')

FAQ

What is the maximum image size I can send?

OpenAI accepts images up to 20MB each. For base64-encoded images, the encoded string will be approximately 33% larger than the original file. If your image is too large, resize it before sending — the model works well with images in the 1024x1024 to 2048x2048 range.

How are images counted toward the token limit?

Images consume tokens based on their resolution and detail setting. A low detail image costs a fixed 85 tokens. A high detail image is split into 512x512 tiles, each costing 170 tokens, plus a base 85 tokens. A 2048x2048 high-detail image costs around 765 tokens.

Can the model generate images or only analyze them?

The Chat Completions API with vision is analysis-only — it understands images but does not create them. For image generation, use the DALL-E API via client.images.generate().

#OpenAI #VisionAPI #MultiModal #ImageAnalysis #Python #AgenticAI #LearnAI #AIEngineering

OpenAI Vision API: Building Applications That Understand Images

What Is the Vision API?

Sending an Image via URL

Sending Base64-Encoded Images

Controlling Image Detail Level

Multiple Images in One Request

Practical Example: Document Data Extraction

Building an Accessibility Description Generator

FAQ

What is the maximum image size I can send?

How are images counted toward the token limit?

Can the model generate images or only analyze them?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding