Skip to content
Learn Agentic AI14 min read0 views

PDF Processing Agent: Extracting Text, Tables, and Charts from Documents

Build a PDF processing agent that extracts text, tables, and charts from documents using Python. Covers page-level parsing, table detection with pdfplumber, chart analysis with vision models, and structured output generation.

The Challenge of PDF Processing

PDFs are the most common format for business documents, yet they are notoriously difficult to process programmatically. A single PDF might contain flowing paragraphs, multi-column layouts, embedded tables, charts rendered as vector graphics, and scanned images of handwritten notes. An effective PDF processing agent must detect and handle each of these content types with the right tool.

Architecture of a PDF Processing Agent

The agent follows a three-stage pipeline:

  1. Page extraction — convert each page to both text and image representations
  2. Content classification — determine what type of content each page region contains
  3. Specialized extraction — apply the right tool to each content type

Install the required dependencies:

pip install pdfplumber pymupdf pillow openai

Stage 1: Page Extraction

Start by extracting both text and rendered images from each page. Having both representations lets the agent fall back to vision-based analysis when text extraction fails:

import pdfplumber
import fitz  # PyMuPDF
from dataclasses import dataclass, field
from PIL import Image
import io


@dataclass
class PageContent:
    page_number: int
    raw_text: str
    image: Image.Image
    tables: list[list[list[str]]] = field(default_factory=list)
    has_charts: bool = False


def extract_pages(pdf_path: str) -> list[PageContent]:
    """Extract text and images from every page of a PDF."""
    pages = []

    # Use pdfplumber for text and tables
    with pdfplumber.open(pdf_path) as pdf:
        plumber_pages = list(pdf.pages)

    # Use PyMuPDF for page images
    doc = fitz.open(pdf_path)

    for i, plumber_page in enumerate(plumber_pages):
        # Extract raw text
        raw_text = plumber_page.extract_text() or ""

        # Extract tables
        tables = plumber_page.extract_tables() or []
        cleaned_tables = []
        for table in tables:
            cleaned = [
                [cell or "" for cell in row]
                for row in table
                if any(cell for cell in row)
            ]
            if cleaned:
                cleaned_tables.append(cleaned)

        # Render page as image
        mupdf_page = doc[i]
        mat = fitz.Matrix(2.0, 2.0)  # 2x zoom for clarity
        pix = mupdf_page.get_pixmap(matrix=mat)
        img = Image.open(io.BytesIO(pix.tobytes("png")))

        pages.append(PageContent(
            page_number=i + 1,
            raw_text=raw_text,
            image=img,
            tables=cleaned_tables,
        ))

    doc.close()
    return pages

Stage 2: Detecting Charts and Visual Elements

Tables are extracted directly by pdfplumber, but charts — bar graphs, pie charts, line plots — are rendered as graphics with no extractable text. Detect them by checking for visual elements without corresponding text:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def detect_charts(page: PageContent) -> bool:
    """Heuristic: a page likely has charts if it has
    little text but significant visual content."""
    text_density = len(page.raw_text.strip())
    # Pages with tables already accounted for
    if page.tables:
        text_in_tables = sum(
            len(cell)
            for table in page.tables
            for row in table
            for cell in row
        )
        non_table_text = text_density - text_in_tables
    else:
        non_table_text = text_density

    # If page has very little non-table text, likely has
    # charts or figures
    return non_table_text < 200 and text_density < 500

For robust chart detection, send the page image to a vision model:

import openai
import base64


async def analyze_chart(
    img: Image.Image, client: openai.AsyncOpenAI
) -> dict:
    """Use GPT-4o to extract data from a chart image."""
    buf = io.BytesIO()
    img.save(buf, format="PNG")
    b64 = base64.b64encode(buf.getvalue()).decode()

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Analyze this chart. Return a JSON object with: "
                        "chart_type, title, x_axis_label, y_axis_label, "
                        "and data_points as a list of {label, value} objects."
                    ),
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{b64}"},
                },
            ],
        }],
        response_format={"type": "json_object"},
    )
    import json
    return json.loads(response.choices[0].message.content)

Stage 3: The PDF Agent

Combine everything into an agent that answers questions about PDF content:

class PDFProcessingAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()
        self.pages: list[PageContent] = []

    def load(self, pdf_path: str) -> int:
        """Load a PDF and return the page count."""
        self.pages = extract_pages(pdf_path)
        for page in self.pages:
            page.has_charts = detect_charts(page)
        return len(self.pages)

    def _format_tables(self, tables: list[list[list[str]]]) -> str:
        """Convert tables to markdown format."""
        parts = []
        for table in tables:
            if not table:
                continue
            header = "| " + " | ".join(table[0]) + " |"
            sep = "| " + " | ".join("---" for _ in table[0]) + " |"
            rows = [
                "| " + " | ".join(row) + " |"
                for row in table[1:]
            ]
            parts.append("\n".join([header, sep] + rows))
        return "\n\n".join(parts)

    async def query(self, question: str) -> str:
        """Answer a question about the loaded PDF."""
        context_parts = []
        for page in self.pages:
            parts = [f"--- Page {page.page_number} ---"]
            if page.raw_text.strip():
                parts.append(page.raw_text.strip())
            if page.tables:
                parts.append(
                    "Tables:\n" + self._format_tables(page.tables)
                )
            if page.has_charts:
                chart_data = await analyze_chart(
                    page.image, self.client
                )
                parts.append(f"Chart data: {chart_data}")
            context_parts.append("\n".join(parts))

        full_context = "\n\n".join(context_parts)

        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are a document analysis agent. Answer "
                        "questions based on the extracted PDF content."
                    ),
                },
                {
                    "role": "user",
                    "content": (
                        f"Document content:\n{full_context}\n\n"
                        f"Question: {question}"
                    ),
                },
            ],
        )
        return response.choices[0].message.content

Usage Example

import asyncio


async def main():
    agent = PDFProcessingAgent()
    page_count = agent.load("quarterly_report.pdf")
    print(f"Loaded {page_count} pages")

    answer = await agent.query(
        "What was the revenue growth rate in Q3?"
    )
    print(answer)

asyncio.run(main())

FAQ

How do I handle scanned PDFs with no extractable text?

For scanned PDFs, pdfplumber returns empty text. In that case, fall back to OCR by running Tesseract on the rendered page image. Add a check in the extraction stage: if raw_text is empty or very short, apply pytesseract.image_to_string(page.image) and use that as the text content.

What is the best approach for extracting complex nested tables?

Pdfplumber handles simple tables well but struggles with merged cells, nested headers, and spanning rows. For complex tables, send the page image to GPT-4o with a prompt asking it to extract the table as a JSON array. The vision model understands visual table structure better than rule-based parsers for complex layouts.

How do I process very large PDFs without running out of memory?

Process pages in batches rather than loading the entire document at once. Modify extract_pages to yield pages lazily using a generator. For the agent query step, first identify which pages are relevant to the question using a lightweight text search or embedding-based retrieval, then only process those pages in detail.


#PDFProcessing #DocumentAI #TableExtraction #ChartAnalysis #Python #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.