Claude Vision for PDF Processing in the Browser: Reading Documents Without Download

The Browser PDF Problem

Many web applications display PDFs directly in the browser — embedded in iframes, rendered through custom viewers like PDF.js, or shown in the browser's native PDF viewer. Traditional scraping tools cannot access this content because the PDF is rendered as a canvas element, not as parseable HTML. Downloading the file is not always possible — some applications disable downloads, use DRM, or generate PDFs on-the-fly.

Claude Computer Use solves this by reading the PDF the same way a human would: looking at the rendered pages and extracting information visually. This works with any PDF viewer, regardless of how the content is rendered.

Basic PDF Page Reading

The simplest approach captures a screenshot of the PDF viewer and asks Claude to extract the text content:

import anthropic
import json
import base64
from playwright.async_api import async_playwright

client = anthropic.Anthropic()

async def read_pdf_page(page, frame_selector: str = None) -> str:
    """Read text content from a PDF displayed in the browser."""
    if frame_selector:
        # PDF in an iframe
        frame = page.frame_locator(frame_selector)
        screenshot = await frame.locator("body").screenshot()
    else:
        screenshot = await page.screenshot()

    screenshot_b64 = base64.standard_b64encode(screenshot).decode()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": screenshot_b64,
                }},
                {"type": "text", "text": """Extract ALL text content from this PDF page
exactly as it appears. Preserve:
- Paragraph structure
- Headings and subheadings
- Bullet points and numbered lists
- Table structure (format as markdown tables)
- Any footnotes or annotations

Do not summarize or paraphrase. Return the exact text content."""},
            ],
        }],
    )
    return response.content[0].text

Reading a full PDF requires navigating through all pages. Since we are working through the browser's PDF viewer, we need to use the viewer's navigation controls:

class BrowserPDFReader:
    def __init__(self):
        self.client = anthropic.Anthropic()
        self.pages_content = []

    async def read_full_pdf(self, page, total_pages: int = None) -> list[str]:
        """Read all pages of a PDF in the browser viewer."""
        if total_pages is None:
            total_pages = await self._detect_page_count(page)

        for page_num in range(total_pages):
            # Navigate to the page
            if page_num > 0:
                await self._go_to_page(page, page_num + 1)
                import asyncio
                await asyncio.sleep(1)  # Wait for render

            # Read current page content
            content = await self._read_current_page(page)
            self.pages_content.append({
                "page_number": page_num + 1,
                "content": content
            })

        return self.pages_content

    async def _detect_page_count(self, page) -> int:
        """Detect total page count from the PDF viewer UI."""
        screenshot = await page.screenshot()
        screenshot_b64 = base64.standard_b64encode(screenshot).decode()

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=256,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_b64,
                    }},
                    {"type": "text", "text": """Look at the PDF viewer controls.
Find the total page count indicator (usually shows "Page X of Y" or "X / Y").
Return ONLY the total number of pages as a JSON number, e.g.: {"total_pages": 15}"""},
                ],
            }],
        )
        result = json.loads(response.content[0].text)
        return result["total_pages"]

    async def _go_to_page(self, page, target_page: int):
        """Navigate to a specific page in the PDF viewer."""
        screenshot = await page.screenshot()
        screenshot_b64 = base64.standard_b64encode(screenshot).decode()

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=512,
            tools=[{
                "type": "computer_20241022",
                "name": "computer",
                "display_width_px": 1280,
                "display_height_px": 800,
                "display_number": 0,
            }],
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_b64,
                    }},
                    {"type": "text", "text": f"Navigate to page {target_page} in this PDF viewer. "
                     f"Click on the page number input field, clear it, type {target_page}, and press Enter."},
                ],
            }],
        )

        # Execute the returned actions
        for block in response.content:
            if block.type == "tool_use":
                await self._execute_action(page, block.input)

    async def _read_current_page(self, page) -> str:
        screenshot = await page.screenshot()
        screenshot_b64 = base64.standard_b64encode(screenshot).decode()

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_b64,
                    }},
                    {"type": "text", "text": "Extract all text content from this PDF page. Preserve structure exactly."},
                ],
            }],
        )
        return response.content[0].text

Table Extraction from PDFs

PDFs often contain tables that are notoriously difficult to extract with text-based tools. Vision-based extraction handles complex table layouts naturally:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

async def extract_pdf_tables(page) -> list[dict]:
    """Extract structured table data from the current PDF page."""
    screenshot = await page.screenshot()
    screenshot_b64 = base64.standard_b64encode(screenshot).decode()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": screenshot_b64,
                }},
                {"type": "text", "text": """Find all tables on this PDF page.
For each table, extract:
- table_title: any title or caption above the table
- headers: list of column headers
- rows: list of rows, each row is a list of cell values
- has_merged_cells: true if any cells span multiple rows/columns
- notes: any footnotes or annotations related to the table

Return as JSON: {"tables": [...]}"""},
            ],
        }],
    )
    return json.loads(response.content[0].text)

Annotation Detection

PDFs with highlights, comments, and stamps contain important metadata. Claude can detect these visual annotations:

async def detect_annotations(page) -> list[dict]:
    """Detect highlights, comments, and other annotations on the PDF."""
    screenshot = await page.screenshot()
    screenshot_b64 = base64.standard_b64encode(screenshot).decode()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": screenshot_b64,
                }},
                {"type": "text", "text": """Identify all annotations visible on this PDF page:
- Highlighted text (text with colored background)
- Margin comments or sticky notes
- Stamps (Approved, Draft, Confidential, etc.)
- Underlined or struck-through text
- Hand-drawn marks or circles

For each annotation, return:
- type: highlight, comment, stamp, strikethrough, or markup
- content: the annotated or marked text
- color: the color of the annotation if applicable
- note_text: any comment text associated with the annotation

Return as JSON: {"annotations": [...]}"""},
            ],
        }],
    )
    return json.loads(response.content[0].text)

Practical Use Case: Invoice Processing

Combining these tools, here is a complete invoice extraction pipeline that works with PDFs displayed in any browser viewer:

async def extract_invoice(page) -> dict:
    """Extract structured invoice data from a PDF in the browser."""
    screenshot = await page.screenshot()
    screenshot_b64 = base64.standard_b64encode(screenshot).decode()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": screenshot_b64,
                }},
                {"type": "text", "text": """Extract all invoice data from this PDF:
- invoice_number, invoice_date, due_date
- vendor: {name, address, phone, email, tax_id}
- bill_to: {name, address}
- line_items: [{description, quantity, unit_price, amount}]
- subtotal, tax_rate, tax_amount, total
- payment_terms, bank_details if shown

Return as JSON with exact values as printed."""},
            ],
        }],
    )
    return json.loads(response.content[0].text)

FAQ

Does this work with scanned PDFs that contain handwritten text?

Claude has strong OCR capabilities and can read many types of handwritten text, especially neatly written content. For heavily degraded scans or cursive handwriting, accuracy may drop. Test with representative samples from your document set before deploying.

How accurate is table extraction compared to specialized PDF libraries?

For well-structured tables with clear borders, Claude achieves near-perfect accuracy comparable to libraries like Camelot or Tabula. For borderless tables or tables with merged cells, Claude often outperforms these libraries because it understands visual grouping and alignment.

What about PDF forms with fillable fields?

Claude can read the values in filled PDF form fields since they are rendered visually. It can also identify which fields are empty and need to be filled, making it useful for PDF form processing workflows.

#PDFProcessing #ClaudeVision #DocumentExtraction #BrowserPDF #InvoiceAutomation #AIDocumentReader #ComputerUse

Claude Vision for PDF Processing in the Browser: Reading Documents Without Download

The Browser PDF Problem

Basic PDF Page Reading

Multi-Page PDF Navigation

Table Extraction from PDFs

Annotation Detection

Practical Use Case: Invoice Processing

FAQ

Does this work with scanned PDFs that contain handwritten text?

How accurate is table extraction compared to specialized PDF libraries?

What about PDF forms with fillable fields?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding