Claude Vision: Building Multi-Modal Agents That Understand Images and Documents
Build multi-modal agents that process images, PDFs, and diagrams using Claude's vision capabilities. Learn how to send image data via the API, analyze documents, and combine vision with tool use.
Claude's Vision Capabilities
Claude can process images as part of its input, enabling agents that understand screenshots, photographs, diagrams, charts, documents, and handwritten text. This is not a separate vision API — images are simply another content type within the standard messages API, meaning you can combine vision with tool use, system prompts, and multi-turn conversations seamlessly.
Claude's vision excels at understanding context within images: reading text, interpreting charts, describing scenes, analyzing UI layouts, and extracting structured data from documents. This makes it particularly powerful for document processing agents, QA testing agents, and data extraction workflows.
Sending Images via Base64
The most common approach is encoding images as base64 and including them in the message content:
import anthropic
import base64
from pathlib import Path
client = anthropic.Anthropic()
# Read and encode the image
image_data = Path("screenshot.png").read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": base64_image
}
},
{
"type": "text",
"text": "Describe what you see in this screenshot. Identify any UI elements and their states."
}
]
}
]
)
print(message.content[0].text)
The content field accepts a list of content blocks — you can mix text and image blocks freely within a single message. Supported image formats include PNG, JPEG, GIF, and WebP.
Sending Images via URL
For publicly accessible images, you can provide a URL directly:
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/chart.png"
}
},
{
"type": "text",
"text": "Extract all data points from this bar chart and return them as a JSON array."
}
]
}
]
)
print(message.content[0].text)
URL-based images avoid the overhead of base64 encoding and reduce request payload size, making them preferable when the image is already hosted.
Building a Document Analysis Agent
Combine vision with structured output for production document processing:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import anthropic
import base64
import json
from pathlib import Path
client = anthropic.Anthropic()
def analyze_invoice(image_path: str) -> dict:
image_data = Path(image_path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")
# Determine media type
suffix = Path(image_path).suffix.lower()
media_types = {".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg"}
media_type = media_types.get(suffix, "image/png")
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
system="""You are an invoice processing agent. Extract structured data from invoice images.
Always return valid JSON with these fields:
- vendor_name: string
- invoice_number: string
- date: string (YYYY-MM-DD)
- line_items: array of {description, quantity, unit_price, total}
- subtotal: number
- tax: number
- total: number""",
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": base64_image
}
},
{
"type": "text",
"text": "Extract all data from this invoice and return it as JSON."
}
]
}
]
)
return json.loads(message.content[0].text)
result = analyze_invoice("sample_invoice.png")
print(json.dumps(result, indent=2))
This pattern works for invoices, receipts, forms, business cards, and any structured document. The system prompt defines the exact output schema, and Claude extracts the relevant fields from the image.
Multi-Image Analysis
Claude can process multiple images in a single request, enabling comparison tasks:
import anthropic
import base64
from pathlib import Path
client = anthropic.Anthropic()
def encode_image(path: str) -> dict:
data = base64.standard_b64encode(Path(path).read_bytes()).decode("utf-8")
suffix = Path(path).suffix.lower()
media_type = "image/jpeg" if suffix in [".jpg", ".jpeg"] else "image/png"
return {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}}
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
encode_image("design_v1.png"),
encode_image("design_v2.png"),
{
"type": "text",
"text": "Compare these two UI designs. List specific differences in layout, color, typography, and component placement."
}
]
}
]
)
print(message.content[0].text)
This is powerful for UI regression testing, before/after comparisons, and visual QA agents that need to spot differences between designs and implementations.
Vision Combined with Tool Use
The most powerful pattern is combining vision with tools so the agent can see and act:
import anthropic
import base64
import json
from pathlib import Path
client = anthropic.Anthropic()
tools = [
{
"name": "create_jira_ticket",
"description": "Create a Jira ticket for a UI bug.",
"input_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"description": {"type": "string"},
"severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]}
},
"required": ["title", "description", "severity"]
}
}
]
image_data = base64.standard_b64encode(Path("bug_screenshot.png").read_bytes()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
tools=tools,
system="You are a QA agent. Analyze screenshots for bugs and file tickets for any issues found.",
messages=[
{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_data}},
{"type": "text", "text": "Review this screenshot and file tickets for any visual bugs you find."}
]
}
]
)
This creates a QA agent that can look at a screenshot, identify visual bugs, and automatically file tickets — a complete vision-to-action pipeline.
FAQ
What is the maximum image size Claude can process?
Claude accepts images up to approximately 20 megapixels. For larger images, resize before sending. The API also has a payload size limit, so very large base64-encoded images may need compression. In practice, most screenshots and document scans work without any preprocessing.
Can Claude read PDFs directly?
Claude supports PDF input via base64 encoding with media_type: "application/pdf". You can send multi-page PDFs and Claude will analyze all pages. For very long documents, consider splitting into page ranges and processing them separately to stay within token limits.
How accurate is Claude's OCR compared to dedicated OCR tools?
Claude's text extraction from images is remarkably accurate for printed text, typed documents, and clean handwriting. For degraded images, unusual fonts, or historical documents, a dedicated OCR tool like Tesseract or Google Vision may perform better. Many production systems use a hybrid approach: OCR for raw text extraction, then Claude for understanding and structuring the extracted content.
#Anthropic #Claude #Vision #MultiModal #DocumentAnalysis #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.