Building Document Processing Agents: PDF, Email, and Spreadsheet Automation

The Case for Document Processing Agents

Every enterprise runs on documents. Invoices arrive as PDFs. Contracts land in email attachments. Financial reports live in spreadsheets. Teams spend thousands of hours per year manually extracting data from these documents, classifying them, routing them to the right people, and entering the results into downstream systems.

Document processing agents automate this entire pipeline. Unlike simple OCR tools or rule-based extractors, agents understand context, handle edge cases, and adapt to format variations without reprogramming. An agent processing invoices does not just extract the total — it validates line items against purchase orders, flags discrepancies, and routes exceptions to the right approver.

PDF Parsing and Extraction

PDFs are the most challenging document format because they encode visual layout rather than semantic structure. A table in a PDF is just a collection of text fragments positioned at specific coordinates — there is no table element. Modern PDF processing combines layout analysis with LLM-based extraction to handle this.

import fitz  # PyMuPDF
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from pathlib import Path

class InvoiceData(BaseModel):
    vendor_name: str
    invoice_number: str
    invoice_date: str
    due_date: str
    line_items: list[dict] = Field(
        description="List of {description, quantity, unit_price, total}"
    )
    subtotal: float
    tax: float
    total: float
    payment_terms: str | None = None

class PDFProcessor:
    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)

    def extract_text_with_layout(self, pdf_path: str) -> str:
        doc = fitz.open(pdf_path)
        full_text = []
        for page_num, page in enumerate(doc):
            blocks = page.get_text("blocks")
            blocks.sort(key=lambda b: (b[1], b[0]))  # sort by y, then x
            page_text = []
            for block in blocks:
                text = block[4].strip()
                if text:
                    page_text.append(text)
            full_text.append(
                f"=== Page {page_num + 1} ===
" + "
".join(page_text)
            )
        doc.close()
        return "

".join(full_text)

    def extract_tables(self, pdf_path: str) -> list[list[list[str]]]:
        doc = fitz.open(pdf_path)
        tables = []
        for page in doc:
            tabs = page.find_tables()
            for tab in tabs:
                table_data = tab.extract()
                if table_data:
                    tables.append(table_data)
        doc.close()
        return tables

    async def extract_invoice(self, pdf_path: str) -> InvoiceData:
        text = self.extract_text_with_layout(pdf_path)
        tables = self.extract_tables(pdf_path)

        prompt = f"""Extract invoice data from this PDF content.

Text content:
{text}

Tables found:
{tables}

Extract all fields precisely. For line items, include every row
from the invoice table. Calculate and verify the total matches
the sum of line items plus tax."""

        extractor = self.llm.with_structured_output(InvoiceData)
        return await extractor.ainvoke(prompt)

For handling scanned PDFs (image-based), add an OCR layer before extraction:

import pytesseract
from pdf2image import convert_from_path

class ScannedPDFProcessor(PDFProcessor):
    def extract_text_with_layout(self, pdf_path: str) -> str:
        # First try direct text extraction
        text = super().extract_text_with_layout(pdf_path)
        if len(text.strip()) > 100:
            return text

        # Fall back to OCR for scanned documents
        images = convert_from_path(pdf_path, dpi=300)
        ocr_texts = []
        for i, image in enumerate(images):
            ocr_text = pytesseract.image_to_string(image)
            ocr_texts.append(f"=== Page {i + 1} ===
{ocr_text}")
        return "

".join(ocr_texts)

Email Classification and Routing Agent

Email processing agents need to classify incoming messages, extract actionable information, and route them to the right team or workflow. The agent architecture uses a classifier stage followed by specialized extractors for each email type.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from enum import Enum
from pydantic import BaseModel, Field
import imaplib
import email
from email.header import decode_header

class EmailCategory(str, Enum):
    INVOICE = "invoice"
    SUPPORT_REQUEST = "support_request"
    SALES_INQUIRY = "sales_inquiry"
    COMPLIANCE = "compliance"
    INTERNAL = "internal"
    SPAM = "spam"

class ClassifiedEmail(BaseModel):
    category: EmailCategory
    priority: str = Field(description="high, medium, or low")
    summary: str = Field(description="One-sentence summary")
    action_required: str = Field(description="What action is needed")
    route_to: str = Field(description="Team or person to route to")

class EmailAgent:
    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
        self.routing_rules = {
            EmailCategory.INVOICE: "finance@company.com",
            EmailCategory.SUPPORT_REQUEST: "support-queue",
            EmailCategory.SALES_INQUIRY: "sales-team",
            EmailCategory.COMPLIANCE: "legal@company.com",
            EmailCategory.INTERNAL: "auto-archive",
            EmailCategory.SPAM: "trash",
        }

    async def classify(
        self, subject: str, body: str, sender: str
    ) -> ClassifiedEmail:
        prompt = f"""Classify this email and determine routing.

From: {sender}
Subject: {subject}
Body: {body[:2000]}

Categories: invoice, support_request, sales_inquiry,
compliance, internal, spam

Priority rules:
- high: legal/compliance, payment issues, outages
- medium: support requests, sales with budget mentioned
- low: general inquiries, internal updates"""

        classifier = self.llm.with_structured_output(ClassifiedEmail)
        result = await classifier.ainvoke(prompt)

        # Apply routing rules
        if result.route_to == "auto":
            result.route_to = self.routing_rules.get(
                result.category, "general-inbox"
            )
        return result

    async def process_inbox(self, imap_config: dict) -> list[ClassifiedEmail]:
        mail = imaplib.IMAP4_SSL(imap_config["host"])
        mail.login(imap_config["user"], imap_config["password"])
        mail.select("inbox")

        _, messages = mail.search(None, "UNSEEN")
        results = []

        for msg_id in messages[0].split():
            _, data = mail.fetch(msg_id, "(RFC822)")
            msg = email.message_from_bytes(data[0][1])

            subject = decode_header(msg["Subject"])[0][0]
            if isinstance(subject, bytes):
                subject = subject.decode()
            sender = msg["From"]
            body = self._get_body(msg)

            classified = await self.classify(subject, body, sender)
            results.append(classified)

        mail.logout()
        return results

    def _get_body(self, msg) -> str:
        if msg.is_multipart():
            for part in msg.walk():
                if part.get_content_type() == "text/plain":
                    return part.get_payload(decode=True).decode(
                        errors="replace"
                    )
        return msg.get_payload(decode=True).decode(errors="replace")

Spreadsheet Analysis Agent

Spreadsheet agents read, analyze, and generate reports from Excel and CSV files. The key challenge is understanding the structure of arbitrary spreadsheets — column meanings, data types, relationships between sheets, and implicit business rules.

import pandas as pd
from langchain.tools import tool

class SpreadsheetAgent:
    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
        self.loaded_data: dict[str, pd.DataFrame] = {}

    def load_file(self, path: str) -> dict[str, pd.DataFrame]:
        if path.endswith(".csv"):
            df = pd.read_csv(path)
            self.loaded_data["Sheet1"] = df
        else:
            xls = pd.ExcelFile(path)
            for sheet in xls.sheet_names:
                self.loaded_data[sheet] = pd.read_excel(xls, sheet)
        return self.loaded_data

    def get_schema(self) -> str:
        schema_parts = []
        for name, df in self.loaded_data.items():
            schema_parts.append(f"Sheet: {name}")
            schema_parts.append(f"  Rows: {len(df)}")
            schema_parts.append(f"  Columns:")
            for col in df.columns:
                dtype = str(df[col].dtype)
                sample = str(df[col].dropna().iloc[0]) if len(df[col].dropna()) > 0 else "N/A"
                nulls = df[col].isnull().sum()
                schema_parts.append(
                    f"    - {col} ({dtype}, nulls: {nulls}, sample: {sample})"
                )
        return "
".join(schema_parts)

    async def analyze(self, question: str) -> str:
        schema = self.get_schema()
        prompt = f"""You are a data analyst. Given this spreadsheet schema,
write Python pandas code to answer the question.

Schema:
{schema}

Question: {question}

Return ONLY executable Python code that uses the variable 'df'
(for single sheet) or 'sheets' dict (for multi-sheet).
Print the result."""

        response = await self.llm.ainvoke(prompt)
        code = self._extract_code(response.content)

        # Execute in sandboxed environment
        local_vars = {"pd": pd}
        if len(self.loaded_data) == 1:
            local_vars["df"] = list(self.loaded_data.values())[0]
        else:
            local_vars["sheets"] = self.loaded_data

        import io, contextlib
        output = io.StringIO()
        with contextlib.redirect_stdout(output):
            exec(code, {"__builtins__": {}}, local_vars)
        return output.getvalue()

    def _extract_code(self, text: str) -> str:
        if "~~~" in text:
            blocks = text.split("~~~")
            if len(blocks) >= 3:
                code_block = blocks[1]
                if code_block.startswith("python"):
                    code_block = code_block[6:]
                return code_block.strip()
        return text.strip()

Orchestrating the Full Pipeline

In production, these processors work together. An email arrives with a PDF attachment. The email agent classifies it as an invoice, the PDF processor extracts structured data, the spreadsheet agent updates the accounts payable tracker, and the system sends a notification to the approver.

class DocumentPipelineAgent:
    def __init__(self):
        self.email_agent = EmailAgent()
        self.pdf_processor = PDFProcessor()
        self.spreadsheet_agent = SpreadsheetAgent()

    async def process_email_with_attachments(
        self, subject: str, body: str, sender: str,
        attachments: list[tuple[str, bytes]]
    ) -> dict:
        # Step 1: Classify the email
        classification = await self.email_agent.classify(
            subject, body, sender
        )

        results = {"classification": classification, "extractions": []}

        # Step 2: Process attachments based on classification
        for filename, content in attachments:
            if filename.endswith(".pdf"):
                if classification.category == EmailCategory.INVOICE:
                    invoice = await self.pdf_processor.extract_invoice(
                        self._save_temp(filename, content)
                    )
                    results["extractions"].append({
                        "type": "invoice",
                        "data": invoice.model_dump()
                    })

            elif filename.endswith((".xlsx", ".csv")):
                path = self._save_temp(filename, content)
                self.spreadsheet_agent.load_file(path)
                summary = await self.spreadsheet_agent.analyze(
                    "Provide a summary of key metrics"
                )
                results["extractions"].append({
                    "type": "spreadsheet_summary",
                    "data": summary
                })

        return results

FAQ

How do I handle PDFs with complex layouts like multi-column text or nested tables?

For complex layouts, use a layout analysis model like LayoutLM or Docling before text extraction. These models detect regions (headers, paragraphs, tables, figures) and their reading order. PyMuPDF's block-level extraction preserves some layout, but for truly complex documents (academic papers, financial statements with nested tables), you need a dedicated layout parser. The LLM extraction step then works with properly ordered text rather than a jumbled mix of columns.

What is the accuracy of LLM-based document extraction compared to template-based approaches?

Template-based extraction (defining exact regions for each field) achieves 98-99% accuracy on documents that match the template. LLM-based extraction typically achieves 92-96% accuracy but works across format variations without template creation. The recommended production approach is hybrid: use templates for high-volume, standardized documents (like invoices from your top 10 vendors) and LLM extraction for everything else. Always include a confidence score and route low-confidence extractions to human review.

How should I handle sensitive data in document processing pipelines?

Never send unredacted documents to external LLM APIs if they contain PII, PHI, or financial account numbers. Use on-premise models (Llama, Mistral) or Azure OpenAI with data processing agreements for sensitive documents. Implement a pre-processing step that detects and masks sensitive fields before LLM processing, then re-injects the original values into the structured output. Log extracted data to encrypted storage only and implement access controls on the extraction results.

#DocumentProcessing #PDFExtraction #EmailAutomation #SpreadsheetAI #Automation #AIAgents #OCR #DataExtraction

Building Document Processing Agents: PDF, Email, and Spreadsheet Automation

The Case for Document Processing Agents

PDF Parsing and Extraction

Email Classification and Routing Agent

Spreadsheet Analysis Agent

Orchestrating the Full Pipeline

FAQ

How do I handle PDFs with complex layouts like multi-column text or nested tables?

What is the accuracy of LLM-based document extraction compared to template-based approaches?

How should I handle sensitive data in document processing pipelines?

Try CallSphere AI Voice Agents

Related Articles

Evaluating AI Pipelines: From LLMs to Real-World Impact

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

Agent A/B Testing: Comparing Model Versions, Prompts, and Architectures in Production