Claude Vision: Building Multi-Modal Agents That Understand Images and Documents

Claude's Vision Capabilities

Claude can process images as part of its input, enabling agents that understand screenshots, photographs, diagrams, charts, documents, and handwritten text. This is not a separate vision API — images are simply another content type within the standard messages API, meaning you can combine vision with tool use, system prompts, and multi-turn conversations seamlessly.

Claude's vision excels at understanding context within images: reading text, interpreting charts, describing scenes, analyzing UI layouts, and extracting structured data from documents. This makes it particularly powerful for document processing agents, QA testing agents, and data extraction workflows.

Sending Images via Base64

The most common approach is encoding images as base64 and including them in the message content:

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

# Read and encode the image
image_data = Path("screenshot.png").read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": base64_image
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what you see in this screenshot. Identify any UI elements and their states."
                }
            ]
        }
    ]
)

print(message.content[0].text)

The content field accepts a list of content blocks — you can mix text and image blocks freely within a single message. Supported image formats include PNG, JPEG, GIF, and WebP.

Sending Images via URL

For publicly accessible images, you can provide a URL directly:

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "url",
                        "url": "https://example.com/chart.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Extract all data points from this bar chart and return them as a JSON array."
                }
            ]
        }
    ]
)

print(message.content[0].text)

URL-based images avoid the overhead of base64 encoding and reduce request payload size, making them preferable when the image is already hosted.

Building a Document Analysis Agent

Combine vision with structured output for production document processing:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import anthropic
import base64
import json
from pathlib import Path

client = anthropic.Anthropic()

def analyze_invoice(image_path: str) -> dict:
    image_data = Path(image_path).read_bytes()
    base64_image = base64.standard_b64encode(image_data).decode("utf-8")

    # Determine media type
    suffix = Path(image_path).suffix.lower()
    media_types = {".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg"}
    media_type = media_types.get(suffix, "image/png")

    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="""You are an invoice processing agent. Extract structured data from invoice images.
Always return valid JSON with these fields:
- vendor_name: string
- invoice_number: string
- date: string (YYYY-MM-DD)
- line_items: array of {description, quantity, unit_price, total}
- subtotal: number
- tax: number
- total: number""",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": base64_image
                        }
                    },
                    {
                        "type": "text",
                        "text": "Extract all data from this invoice and return it as JSON."
                    }
                ]
            }
        ]
    )

    return json.loads(message.content[0].text)

result = analyze_invoice("sample_invoice.png")
print(json.dumps(result, indent=2))

This pattern works for invoices, receipts, forms, business cards, and any structured document. The system prompt defines the exact output schema, and Claude extracts the relevant fields from the image.

Multi-Image Analysis

Claude can process multiple images in a single request, enabling comparison tasks:

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def encode_image(path: str) -> dict:
    data = base64.standard_b64encode(Path(path).read_bytes()).decode("utf-8")
    suffix = Path(path).suffix.lower()
    media_type = "image/jpeg" if suffix in [".jpg", ".jpeg"] else "image/png"
    return {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}}

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                encode_image("design_v1.png"),
                encode_image("design_v2.png"),
                {
                    "type": "text",
                    "text": "Compare these two UI designs. List specific differences in layout, color, typography, and component placement."
                }
            ]
        }
    ]
)

print(message.content[0].text)

This is powerful for UI regression testing, before/after comparisons, and visual QA agents that need to spot differences between designs and implementations.

Vision Combined with Tool Use

The most powerful pattern is combining vision with tools so the agent can see and act:

import anthropic
import base64
import json
from pathlib import Path

client = anthropic.Anthropic()

tools = [
    {
        "name": "create_jira_ticket",
        "description": "Create a Jira ticket for a UI bug.",
        "input_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "description": {"type": "string"},
                "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]}
            },
            "required": ["title", "description", "severity"]
        }
    }
]

image_data = base64.standard_b64encode(Path("bug_screenshot.png").read_bytes()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    tools=tools,
    system="You are a QA agent. Analyze screenshots for bugs and file tickets for any issues found.",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_data}},
                {"type": "text", "text": "Review this screenshot and file tickets for any visual bugs you find."}
            ]
        }
    ]
)

This creates a QA agent that can look at a screenshot, identify visual bugs, and automatically file tickets — a complete vision-to-action pipeline.

FAQ

What is the maximum image size Claude can process?

Claude accepts images up to approximately 20 megapixels. For larger images, resize before sending. The API also has a payload size limit, so very large base64-encoded images may need compression. In practice, most screenshots and document scans work without any preprocessing.

Can Claude read PDFs directly?

Claude supports PDF input via base64 encoding with media_type: "application/pdf". You can send multi-page PDFs and Claude will analyze all pages. For very long documents, consider splitting into page ranges and processing them separately to stay within token limits.

How accurate is Claude's OCR compared to dedicated OCR tools?

Claude's text extraction from images is remarkably accurate for printed text, typed documents, and clean handwriting. For degraded images, unusual fonts, or historical documents, a dedicated OCR tool like Tesseract or Google Vision may perform better. Many production systems use a hybrid approach: OCR for raw text extraction, then Claude for understanding and structuring the extracted content.

#Anthropic #Claude #Vision #MultiModal #DocumentAnalysis #AgenticAI #LearnAI #AIEngineering

Claude Vision: Building Multi-Modal Agents That Understand Images and Documents

Claude's Vision Capabilities

Sending Images via Base64

Sending Images via URL

Building a Document Analysis Agent

Multi-Image Analysis

Vision Combined with Tool Use

FAQ

What is the maximum image size Claude can process?

Can Claude read PDFs directly?

How accurate is Claude's OCR compared to dedicated OCR tools?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding