PDF Processing Agent: Extracting Text, Tables, and Charts from Documents
Build a PDF processing agent that extracts text, tables, and charts from documents using Python. Covers page-level parsing, table detection with pdfplumber, chart analysis with vision models, and structured output generation.
The Challenge of PDF Processing
PDFs are the most common format for business documents, yet they are notoriously difficult to process programmatically. A single PDF might contain flowing paragraphs, multi-column layouts, embedded tables, charts rendered as vector graphics, and scanned images of handwritten notes. An effective PDF processing agent must detect and handle each of these content types with the right tool.
Architecture of a PDF Processing Agent
The agent follows a three-stage pipeline:
- Page extraction — convert each page to both text and image representations
- Content classification — determine what type of content each page region contains
- Specialized extraction — apply the right tool to each content type
Install the required dependencies:
pip install pdfplumber pymupdf pillow openai
Stage 1: Page Extraction
Start by extracting both text and rendered images from each page. Having both representations lets the agent fall back to vision-based analysis when text extraction fails:
import pdfplumber
import fitz # PyMuPDF
from dataclasses import dataclass, field
from PIL import Image
import io
@dataclass
class PageContent:
page_number: int
raw_text: str
image: Image.Image
tables: list[list[list[str]]] = field(default_factory=list)
has_charts: bool = False
def extract_pages(pdf_path: str) -> list[PageContent]:
"""Extract text and images from every page of a PDF."""
pages = []
# Use pdfplumber for text and tables
with pdfplumber.open(pdf_path) as pdf:
plumber_pages = list(pdf.pages)
# Use PyMuPDF for page images
doc = fitz.open(pdf_path)
for i, plumber_page in enumerate(plumber_pages):
# Extract raw text
raw_text = plumber_page.extract_text() or ""
# Extract tables
tables = plumber_page.extract_tables() or []
cleaned_tables = []
for table in tables:
cleaned = [
[cell or "" for cell in row]
for row in table
if any(cell for cell in row)
]
if cleaned:
cleaned_tables.append(cleaned)
# Render page as image
mupdf_page = doc[i]
mat = fitz.Matrix(2.0, 2.0) # 2x zoom for clarity
pix = mupdf_page.get_pixmap(matrix=mat)
img = Image.open(io.BytesIO(pix.tobytes("png")))
pages.append(PageContent(
page_number=i + 1,
raw_text=raw_text,
image=img,
tables=cleaned_tables,
))
doc.close()
return pages
Stage 2: Detecting Charts and Visual Elements
Tables are extracted directly by pdfplumber, but charts — bar graphs, pie charts, line plots — are rendered as graphics with no extractable text. Detect them by checking for visual elements without corresponding text:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
def detect_charts(page: PageContent) -> bool:
"""Heuristic: a page likely has charts if it has
little text but significant visual content."""
text_density = len(page.raw_text.strip())
# Pages with tables already accounted for
if page.tables:
text_in_tables = sum(
len(cell)
for table in page.tables
for row in table
for cell in row
)
non_table_text = text_density - text_in_tables
else:
non_table_text = text_density
# If page has very little non-table text, likely has
# charts or figures
return non_table_text < 200 and text_density < 500
For robust chart detection, send the page image to a vision model:
import openai
import base64
async def analyze_chart(
img: Image.Image, client: openai.AsyncOpenAI
) -> dict:
"""Use GPT-4o to extract data from a chart image."""
buf = io.BytesIO()
img.save(buf, format="PNG")
b64 = base64.b64encode(buf.getvalue()).decode()
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": (
"Analyze this chart. Return a JSON object with: "
"chart_type, title, x_axis_label, y_axis_label, "
"and data_points as a list of {label, value} objects."
),
},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{b64}"},
},
],
}],
response_format={"type": "json_object"},
)
import json
return json.loads(response.choices[0].message.content)
Stage 3: The PDF Agent
Combine everything into an agent that answers questions about PDF content:
class PDFProcessingAgent:
def __init__(self):
self.client = openai.AsyncOpenAI()
self.pages: list[PageContent] = []
def load(self, pdf_path: str) -> int:
"""Load a PDF and return the page count."""
self.pages = extract_pages(pdf_path)
for page in self.pages:
page.has_charts = detect_charts(page)
return len(self.pages)
def _format_tables(self, tables: list[list[list[str]]]) -> str:
"""Convert tables to markdown format."""
parts = []
for table in tables:
if not table:
continue
header = "| " + " | ".join(table[0]) + " |"
sep = "| " + " | ".join("---" for _ in table[0]) + " |"
rows = [
"| " + " | ".join(row) + " |"
for row in table[1:]
]
parts.append("\n".join([header, sep] + rows))
return "\n\n".join(parts)
async def query(self, question: str) -> str:
"""Answer a question about the loaded PDF."""
context_parts = []
for page in self.pages:
parts = [f"--- Page {page.page_number} ---"]
if page.raw_text.strip():
parts.append(page.raw_text.strip())
if page.tables:
parts.append(
"Tables:\n" + self._format_tables(page.tables)
)
if page.has_charts:
chart_data = await analyze_chart(
page.image, self.client
)
parts.append(f"Chart data: {chart_data}")
context_parts.append("\n".join(parts))
full_context = "\n\n".join(context_parts)
response = await self.client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a document analysis agent. Answer "
"questions based on the extracted PDF content."
),
},
{
"role": "user",
"content": (
f"Document content:\n{full_context}\n\n"
f"Question: {question}"
),
},
],
)
return response.choices[0].message.content
Usage Example
import asyncio
async def main():
agent = PDFProcessingAgent()
page_count = agent.load("quarterly_report.pdf")
print(f"Loaded {page_count} pages")
answer = await agent.query(
"What was the revenue growth rate in Q3?"
)
print(answer)
asyncio.run(main())
FAQ
How do I handle scanned PDFs with no extractable text?
For scanned PDFs, pdfplumber returns empty text. In that case, fall back to OCR by running Tesseract on the rendered page image. Add a check in the extraction stage: if raw_text is empty or very short, apply pytesseract.image_to_string(page.image) and use that as the text content.
What is the best approach for extracting complex nested tables?
Pdfplumber handles simple tables well but struggles with merged cells, nested headers, and spanning rows. For complex tables, send the page image to GPT-4o with a prompt asking it to extract the table as a JSON array. The vision model understands visual table structure better than rule-based parsers for complex layouts.
How do I process very large PDFs without running out of memory?
Process pages in batches rather than loading the entire document at once. Modify extract_pages to yield pages lazily using a generator. For the agent query step, first identify which pages are relevant to the question using a lightweight text search or embedding-based retrieval, then only process those pages in detail.
#PDFProcessing #DocumentAI #TableExtraction #ChartAnalysis #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.