Claude Vision for PDF Processing in the Browser: Reading Documents Without Download
Use Claude Computer Use to read PDFs rendered in browser viewers — navigating pages, extracting text and tables, detecting annotations, and converting visual PDF content to structured data without file downloads.
The Browser PDF Problem
Many web applications display PDFs directly in the browser — embedded in iframes, rendered through custom viewers like PDF.js, or shown in the browser's native PDF viewer. Traditional scraping tools cannot access this content because the PDF is rendered as a canvas element, not as parseable HTML. Downloading the file is not always possible — some applications disable downloads, use DRM, or generate PDFs on-the-fly.
Claude Computer Use solves this by reading the PDF the same way a human would: looking at the rendered pages and extracting information visually. This works with any PDF viewer, regardless of how the content is rendered.
Basic PDF Page Reading
The simplest approach captures a screenshot of the PDF viewer and asks Claude to extract the text content:
import anthropic
import json
import base64
from playwright.async_api import async_playwright
client = anthropic.Anthropic()
async def read_pdf_page(page, frame_selector: str = None) -> str:
"""Read text content from a PDF displayed in the browser."""
if frame_selector:
# PDF in an iframe
frame = page.frame_locator(frame_selector)
screenshot = await frame.locator("body").screenshot()
else:
screenshot = await page.screenshot()
screenshot_b64 = base64.standard_b64encode(screenshot).decode()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
}},
{"type": "text", "text": """Extract ALL text content from this PDF page
exactly as it appears. Preserve:
- Paragraph structure
- Headings and subheadings
- Bullet points and numbered lists
- Table structure (format as markdown tables)
- Any footnotes or annotations
Do not summarize or paraphrase. Return the exact text content."""},
],
}],
)
return response.content[0].text
Multi-Page PDF Navigation
Reading a full PDF requires navigating through all pages. Since we are working through the browser's PDF viewer, we need to use the viewer's navigation controls:
class BrowserPDFReader:
def __init__(self):
self.client = anthropic.Anthropic()
self.pages_content = []
async def read_full_pdf(self, page, total_pages: int = None) -> list[str]:
"""Read all pages of a PDF in the browser viewer."""
if total_pages is None:
total_pages = await self._detect_page_count(page)
for page_num in range(total_pages):
# Navigate to the page
if page_num > 0:
await self._go_to_page(page, page_num + 1)
import asyncio
await asyncio.sleep(1) # Wait for render
# Read current page content
content = await self._read_current_page(page)
self.pages_content.append({
"page_number": page_num + 1,
"content": content
})
return self.pages_content
async def _detect_page_count(self, page) -> int:
"""Detect total page count from the PDF viewer UI."""
screenshot = await page.screenshot()
screenshot_b64 = base64.standard_b64encode(screenshot).decode()
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=256,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
}},
{"type": "text", "text": """Look at the PDF viewer controls.
Find the total page count indicator (usually shows "Page X of Y" or "X / Y").
Return ONLY the total number of pages as a JSON number, e.g.: {"total_pages": 15}"""},
],
}],
)
result = json.loads(response.content[0].text)
return result["total_pages"]
async def _go_to_page(self, page, target_page: int):
"""Navigate to a specific page in the PDF viewer."""
screenshot = await page.screenshot()
screenshot_b64 = base64.standard_b64encode(screenshot).decode()
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
tools=[{
"type": "computer_20241022",
"name": "computer",
"display_width_px": 1280,
"display_height_px": 800,
"display_number": 0,
}],
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
}},
{"type": "text", "text": f"Navigate to page {target_page} in this PDF viewer. "
f"Click on the page number input field, clear it, type {target_page}, and press Enter."},
],
}],
)
# Execute the returned actions
for block in response.content:
if block.type == "tool_use":
await self._execute_action(page, block.input)
async def _read_current_page(self, page) -> str:
screenshot = await page.screenshot()
screenshot_b64 = base64.standard_b64encode(screenshot).decode()
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
}},
{"type": "text", "text": "Extract all text content from this PDF page. Preserve structure exactly."},
],
}],
)
return response.content[0].text
Table Extraction from PDFs
PDFs often contain tables that are notoriously difficult to extract with text-based tools. Vision-based extraction handles complex table layouts naturally:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
async def extract_pdf_tables(page) -> list[dict]:
"""Extract structured table data from the current PDF page."""
screenshot = await page.screenshot()
screenshot_b64 = base64.standard_b64encode(screenshot).decode()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
}},
{"type": "text", "text": """Find all tables on this PDF page.
For each table, extract:
- table_title: any title or caption above the table
- headers: list of column headers
- rows: list of rows, each row is a list of cell values
- has_merged_cells: true if any cells span multiple rows/columns
- notes: any footnotes or annotations related to the table
Return as JSON: {"tables": [...]}"""},
],
}],
)
return json.loads(response.content[0].text)
Annotation Detection
PDFs with highlights, comments, and stamps contain important metadata. Claude can detect these visual annotations:
async def detect_annotations(page) -> list[dict]:
"""Detect highlights, comments, and other annotations on the PDF."""
screenshot = await page.screenshot()
screenshot_b64 = base64.standard_b64encode(screenshot).decode()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
}},
{"type": "text", "text": """Identify all annotations visible on this PDF page:
- Highlighted text (text with colored background)
- Margin comments or sticky notes
- Stamps (Approved, Draft, Confidential, etc.)
- Underlined or struck-through text
- Hand-drawn marks or circles
For each annotation, return:
- type: highlight, comment, stamp, strikethrough, or markup
- content: the annotated or marked text
- color: the color of the annotation if applicable
- note_text: any comment text associated with the annotation
Return as JSON: {"annotations": [...]}"""},
],
}],
)
return json.loads(response.content[0].text)
Practical Use Case: Invoice Processing
Combining these tools, here is a complete invoice extraction pipeline that works with PDFs displayed in any browser viewer:
async def extract_invoice(page) -> dict:
"""Extract structured invoice data from a PDF in the browser."""
screenshot = await page.screenshot()
screenshot_b64 = base64.standard_b64encode(screenshot).decode()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
}},
{"type": "text", "text": """Extract all invoice data from this PDF:
- invoice_number, invoice_date, due_date
- vendor: {name, address, phone, email, tax_id}
- bill_to: {name, address}
- line_items: [{description, quantity, unit_price, amount}]
- subtotal, tax_rate, tax_amount, total
- payment_terms, bank_details if shown
Return as JSON with exact values as printed."""},
],
}],
)
return json.loads(response.content[0].text)
FAQ
Does this work with scanned PDFs that contain handwritten text?
Claude has strong OCR capabilities and can read many types of handwritten text, especially neatly written content. For heavily degraded scans or cursive handwriting, accuracy may drop. Test with representative samples from your document set before deploying.
How accurate is table extraction compared to specialized PDF libraries?
For well-structured tables with clear borders, Claude achieves near-perfect accuracy comparable to libraries like Camelot or Tabula. For borderless tables or tables with merged cells, Claude often outperforms these libraries because it understands visual grouping and alignment.
What about PDF forms with fillable fields?
Claude can read the values in filled PDF form fields since they are rendered visually. It can also identify which fields are empty and need to be filled, making it useful for PDF form processing workflows.
#PDFProcessing #ClaudeVision #DocumentExtraction #BrowserPDF #InvoiceAutomation #AIDocumentReader #ComputerUse
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.