Claude PDF and Document Analysis Agent: Processing Complex Documents at Scale

Claude's Native PDF Understanding

Claude can process PDF documents directly through the Messages API. Rather than converting PDFs to text first (losing formatting, tables, and layout information), Claude analyzes the rendered pages as images while simultaneously processing any embedded text. This dual understanding — visual layout plus textual content — makes it exceptionally capable at extracting structured data from complex documents.

This capability is particularly valuable for contracts, financial reports, research papers, invoices, and any document where layout carries meaning.

Uploading PDFs to Claude

PDFs are sent as base64-encoded content in the message:

import anthropic
import base64

client = anthropic.Anthropic()

def analyze_pdf(file_path: str, question: str) -> str:
    with open(file_path, "rb") as f:
        pdf_data = base64.standard_b64encode(f.read()).decode()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data,
                    }
                },
                {
                    "type": "text",
                    "text": question,
                }
            ]
        }]
    )
    return response.content[0].text

Claude processes each page of the PDF, understanding both the text content and the visual layout. This means it can correctly interpret tables, charts, headers, footnotes, and multi-column layouts.

Page-Level Analysis

For large documents, you may want to analyze specific page ranges or process pages individually. Send targeted questions about specific sections:

def analyze_pages(file_path: str, analyses: list[dict]) -> list[dict]:
    """Run multiple analyses on a single PDF."""
    with open(file_path, "rb") as f:
        pdf_data = base64.standard_b64encode(f.read()).decode()

    results = []
    for analysis in analyses:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "document",
                        "source": {
                            "type": "base64",
                            "media_type": "application/pdf",
                            "data": pdf_data,
                        }
                    },
                    {
                        "type": "text",
                        "text": analysis["question"],
                    }
                ]
            }]
        )
        results.append({
            "analysis": analysis["name"],
            "result": response.content[0].text
        })
    return results

# Usage
results = analyze_pages("annual_report.pdf", [
    {"name": "financial_summary", "question": "Extract all revenue figures, costs, and profit margins from the financial statements."},
    {"name": "risk_factors", "question": "List all risk factors mentioned in the document with their severity."},
    {"name": "key_metrics", "question": "What are the key performance indicators and their year-over-year changes?"},
])

Structured Data Extraction with Tools

Combine PDF analysis with tool use to extract structured data that can be programmatically processed:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

extraction_tool = {
    "name": "extract_invoice_data",
    "description": "Extract structured data from an invoice document",
    "input_schema": {
        "type": "object",
        "properties": {
            "vendor_name": {"type": "string"},
            "invoice_number": {"type": "string"},
            "invoice_date": {"type": "string", "description": "ISO format date"},
            "due_date": {"type": "string", "description": "ISO format date"},
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "quantity": {"type": "number"},
                        "unit_price": {"type": "number"},
                        "total": {"type": "number"}
                    },
                    "required": ["description", "quantity", "unit_price", "total"]
                }
            },
            "subtotal": {"type": "number"},
            "tax": {"type": "number"},
            "total": {"type": "number"},
            "currency": {"type": "string"}
        },
        "required": ["vendor_name", "invoice_number", "invoice_date", "line_items", "total"]
    }
}

def extract_invoice(pdf_path: str) -> dict:
    with open(pdf_path, "rb") as f:
        pdf_data = base64.standard_b64encode(f.read()).decode()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        tools=[extraction_tool],
        tool_choice={"type": "tool", "name": "extract_invoice_data"},
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data,
                    }
                },
                {"type": "text", "text": "Extract all invoice data from this document."}
            ]
        }]
    )

    for block in response.content:
        if block.type == "tool_use":
            return block.input
    return {}

Forcing tool use with tool_choice guarantees structured JSON output that you can insert directly into a database or feed to a downstream system.

Multi-Document Comparison

One of Claude's strongest capabilities is comparing information across multiple documents in a single conversation:

def compare_documents(pdf_paths: list[str], comparison_prompt: str) -> str:
    content = []

    for i, path in enumerate(pdf_paths):
        with open(path, "rb") as f:
            pdf_data = base64.standard_b64encode(f.read()).decode()

        content.append({
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": pdf_data,
            }
        })
        content.append({
            "type": "text",
            "text": f"The above is Document {i + 1}: {path}",
        })

    content.append({"type": "text", "text": comparison_prompt})

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

# Compare two contracts
result = compare_documents(
    ["contract_v1.pdf", "contract_v2.pdf"],
    "Compare these two contract versions. List every change including "
    "additions, deletions, and modifications to terms. Flag any changes "
    "that affect liability, payment terms, or termination clauses."
)

Scaling Document Processing

For batch document processing, combine PDF analysis with the Batches API:

def batch_analyze_pdfs(pdf_paths: list[str], question: str) -> str:
    requests = []
    for i, path in enumerate(pdf_paths):
        with open(path, "rb") as f:
            pdf_data = base64.standard_b64encode(f.read()).decode()

        requests.append({
            "custom_id": f"pdf-{i}-{path}",
            "params": {
                "model": "claude-sonnet-4-20250514",
                "max_tokens": 2048,
                "messages": [{
                    "role": "user",
                    "content": [
                        {
                            "type": "document",
                            "source": {
                                "type": "base64",
                                "media_type": "application/pdf",
                                "data": pdf_data,
                            }
                        },
                        {"type": "text", "text": question}
                    ]
                }]
            }
        })

    batch = client.messages.batches.create(requests=requests)
    return batch.id

This approach processes hundreds of PDFs at 50% cost while handling rate limits automatically.

FAQ

What is the maximum PDF size Claude can process?

Each PDF is converted to images internally. Claude can handle PDFs up to approximately 100 pages per request, though performance is optimal with shorter documents. For very large documents, split them into sections and process each section separately, then use a final synthesis step.

Can Claude extract data from scanned PDFs without OCR?

Yes. Because Claude processes PDF pages as images, it can read text from scanned documents directly — no OCR preprocessing required. This works for most print quality scans. Very low resolution scans or heavily distorted documents may need preprocessing with image enhancement tools first.

How accurate is table extraction from PDFs?

Claude's table extraction is highly accurate for standard table layouts — rows, columns, headers, and merged cells are handled well. Complex nested tables or tables that span multiple pages may require additional prompting to handle correctly. Always validate extracted numerical data against known totals when accuracy is critical.

#Claude #PDFProcessing #DocumentAnalysis #DataExtraction #Python #AgenticAI #LearnAI #AIEngineering

Claude PDF and Document Analysis Agent: Processing Complex Documents at Scale

Claude's Native PDF Understanding

Uploading PDFs to Claude

Page-Level Analysis

Structured Data Extraction with Tools

Multi-Document Comparison

Scaling Document Processing

FAQ

What is the maximum PDF size Claude can process?

Can Claude extract data from scanned PDFs without OCR?

How accurate is table extraction from PDFs?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding