Skip to content
Learn Agentic AI13 min read0 views

AI Agent for Tax Preparation: Document Collection, Categorization, and Form Filling

Learn to build an AI agent that collects tax documents, classifies them by type, extracts key financial data, and maps values to the correct tax form fields.

Why Tax Preparation Is Ripe for AI Agents

Tax preparation involves a predictable but tedious workflow: gather documents, classify them, extract data, apply tax rules, and fill forms. Each step follows clear rules, making it well-suited for an AI agent. The challenge lies in the variety of document formats (W-2s, 1099s, receipts, brokerage statements) and the complexity of tax code rules. An agent can handle the mechanical work while flagging edge cases for human review.

Agent Architecture

The tax prep agent has four stages:

  1. Document Ingestion — accept files and extract text with OCR
  2. Document Classification — identify the type of each document
  3. Data Extraction — pull key financial figures from each document
  4. Form Mapping — apply tax rules and map values to form fields

Step 1: Document Ingestion and OCR

Many tax documents arrive as scanned PDFs or photos. We use OCR to extract text.

import pytesseract
from PIL import Image
from pathlib import Path
import pdfplumber


def ingest_document(file_path: str) -> str:
    """Extract text from various document formats."""
    path = Path(file_path)
    suffix = path.suffix.lower()

    if suffix in (".png", ".jpg", ".jpeg", ".tiff"):
        image = Image.open(path)
        return pytesseract.image_to_string(image)

    elif suffix == ".pdf":
        with pdfplumber.open(path) as pdf:
            text = ""
            for page in pdf.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text + "\n"
                else:
                    # Fallback to OCR for scanned pages
                    img = page.to_image(resolution=300)
                    text += pytesseract.image_to_string(
                        img.original
                    ) + "\n"
            return text

    elif suffix == ".txt":
        return path.read_text()

    raise ValueError(f"Unsupported format: {suffix}")

Step 2: Document Classification

The agent classifies each document into tax form categories.

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()


class DocumentClassification(BaseModel):
    document_type: str  # "W-2", "1099-INT", "1099-DIV", etc.
    tax_year: int
    issuer: str
    confidence: float
    recipient_name: str


DOCUMENT_TYPES = [
    "W-2 (Wage and Tax Statement)",
    "1099-INT (Interest Income)",
    "1099-DIV (Dividends and Distributions)",
    "1099-B (Broker Transactions)",
    "1099-MISC (Miscellaneous Income)",
    "1099-NEC (Nonemployee Compensation)",
    "1098 (Mortgage Interest)",
    "1098-T (Tuition Statement)",
    "Receipt (Deductible Expense)",
    "K-1 (Partner/Shareholder Income)",
    "Other / Unknown",
]


def classify_document(text: str) -> DocumentClassification:
    """Classify a tax document by type."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "Classify this tax document. Identify the form type, "
                    "tax year, issuer, and recipient.\n\n"
                    f"Valid types: {', '.join(DOCUMENT_TYPES)}"
                ),
            },
            {"role": "user", "content": text[:3000]},
        ],
        response_format=DocumentClassification,
    )
    return response.choices[0].message.parsed

Step 3: Data Extraction by Document Type

Each document type has specific fields to extract. We use type-specific schemas.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

class W2Data(BaseModel):
    employer_name: str
    employer_ein: str
    wages: float  # Box 1
    federal_tax_withheld: float  # Box 2
    social_security_wages: float  # Box 3
    social_security_tax: float  # Box 4
    medicare_wages: float  # Box 5
    medicare_tax: float  # Box 6
    state: str
    state_wages: float  # Box 16
    state_tax_withheld: float  # Box 17


class Form1099INT(BaseModel):
    payer_name: str
    interest_income: float  # Box 1
    early_withdrawal_penalty: float  # Box 2
    us_savings_bond_interest: float  # Box 3
    federal_tax_withheld: float  # Box 4


EXTRACTION_SCHEMAS = {
    "W-2": W2Data,
    "1099-INT": Form1099INT,
    # Add more schemas for each document type
}


def extract_data(text: str, doc_type: str) -> BaseModel:
    """Extract structured data based on document type."""
    schema = EXTRACTION_SCHEMAS.get(doc_type)
    if not schema:
        raise ValueError(f"No extraction schema for: {doc_type}")

    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    f"Extract all fields for a {doc_type} form. "
                    "Use 0.0 for any field not found in the document."
                ),
            },
            {"role": "user", "content": text},
        ],
        response_format=schema,
    )
    return response.choices[0].message.parsed

Step 4: Tax Rule Application and Form Mapping

After extraction, the agent applies tax rules to map values onto the correct lines of the tax return.

from dataclasses import dataclass, field


@dataclass
class TaxFormLine:
    form: str  # e.g., "1040"
    line: str  # e.g., "1a"
    description: str
    value: float = 0.0


@dataclass
class TaxReturn:
    tax_year: int
    filing_status: str
    lines: dict[str, TaxFormLine] = field(default_factory=dict)

    def add_to_line(self, line_key: str, amount: float):
        if line_key in self.lines:
            self.lines[line_key].value += amount

    def get_line(self, line_key: str) -> float:
        return self.lines.get(line_key, TaxFormLine("", "", "")).value


def build_1040(extracted_docs: list[dict]) -> TaxReturn:
    """Map extracted document data to Form 1040 lines."""
    tax_return = TaxReturn(
        tax_year=2025,
        filing_status="single",
        lines={
            "1a": TaxFormLine("1040", "1a", "Wages", 0.0),
            "2b": TaxFormLine("1040", "2b", "Taxable Interest", 0.0),
            "3b": TaxFormLine("1040", "3b", "Ordinary Dividends", 0.0),
            "25a": TaxFormLine("1040", "25a", "W-2 Withholding", 0.0),
        },
    )

    for doc in extracted_docs:
        doc_type = doc["type"]
        data = doc["data"]

        if doc_type == "W-2":
            tax_return.add_to_line("1a", data.wages)
            tax_return.add_to_line("25a", data.federal_tax_withheld)

        elif doc_type == "1099-INT":
            tax_return.add_to_line("2b", data.interest_income)
            tax_return.add_to_line("25a", data.federal_tax_withheld)

    return tax_return

Full Pipeline

def prepare_taxes(document_paths: list[str]) -> TaxReturn:
    """Run the full tax preparation pipeline."""
    extracted_docs = []

    for path in document_paths:
        text = ingest_document(path)
        classification = classify_document(text)
        data = extract_data(text, classification.document_type)
        extracted_docs.append({
            "type": classification.document_type,
            "data": data,
            "source": path,
        })

    return build_1040(extracted_docs)


tax_return = prepare_taxes(["w2_2025.pdf", "1099_int.pdf"])
for key, line in tax_return.lines.items():
    print(f"Line {line.line} ({line.description}): ${line.value:,.2f}")

FAQ

How does the agent handle discrepancies between documents?

The agent flags inconsistencies — for example, if total W-2 wages across multiple employers seem unreasonably high or if withholding amounts do not match expected rates. It generates a discrepancy report for human review rather than making assumptions.

Can this approach handle business tax returns (Schedule C, partnerships)?

Yes, but business returns are more complex. You would extend the extraction schemas for Schedule C, K-1 forms, and depreciation schedules. The tax rule engine needs additional logic for business deductions, self-employment tax, and estimated tax payments.

What about state tax returns?

State returns require state-specific rules. The agent can be extended with a state module that takes the federal return as input, applies state-specific adjustments (state-specific deductions, different tax brackets), and generates the appropriate state form. Each state would have its own rule configuration.


#TaxPreparation #DocumentClassification #OCR #FinancialAI #Automation #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.