AI Agent for Tax Preparation: Document Collection, Categorization, and Form Filling
Learn to build an AI agent that collects tax documents, classifies them by type, extracts key financial data, and maps values to the correct tax form fields.
Why Tax Preparation Is Ripe for AI Agents
Tax preparation involves a predictable but tedious workflow: gather documents, classify them, extract data, apply tax rules, and fill forms. Each step follows clear rules, making it well-suited for an AI agent. The challenge lies in the variety of document formats (W-2s, 1099s, receipts, brokerage statements) and the complexity of tax code rules. An agent can handle the mechanical work while flagging edge cases for human review.
Agent Architecture
The tax prep agent has four stages:
- Document Ingestion — accept files and extract text with OCR
- Document Classification — identify the type of each document
- Data Extraction — pull key financial figures from each document
- Form Mapping — apply tax rules and map values to form fields
Step 1: Document Ingestion and OCR
Many tax documents arrive as scanned PDFs or photos. We use OCR to extract text.
import pytesseract
from PIL import Image
from pathlib import Path
import pdfplumber
def ingest_document(file_path: str) -> str:
"""Extract text from various document formats."""
path = Path(file_path)
suffix = path.suffix.lower()
if suffix in (".png", ".jpg", ".jpeg", ".tiff"):
image = Image.open(path)
return pytesseract.image_to_string(image)
elif suffix == ".pdf":
with pdfplumber.open(path) as pdf:
text = ""
for page in pdf.pages:
page_text = page.extract_text()
if page_text:
text += page_text + "\n"
else:
# Fallback to OCR for scanned pages
img = page.to_image(resolution=300)
text += pytesseract.image_to_string(
img.original
) + "\n"
return text
elif suffix == ".txt":
return path.read_text()
raise ValueError(f"Unsupported format: {suffix}")
Step 2: Document Classification
The agent classifies each document into tax form categories.
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI()
class DocumentClassification(BaseModel):
document_type: str # "W-2", "1099-INT", "1099-DIV", etc.
tax_year: int
issuer: str
confidence: float
recipient_name: str
DOCUMENT_TYPES = [
"W-2 (Wage and Tax Statement)",
"1099-INT (Interest Income)",
"1099-DIV (Dividends and Distributions)",
"1099-B (Broker Transactions)",
"1099-MISC (Miscellaneous Income)",
"1099-NEC (Nonemployee Compensation)",
"1098 (Mortgage Interest)",
"1098-T (Tuition Statement)",
"Receipt (Deductible Expense)",
"K-1 (Partner/Shareholder Income)",
"Other / Unknown",
]
def classify_document(text: str) -> DocumentClassification:
"""Classify a tax document by type."""
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Classify this tax document. Identify the form type, "
"tax year, issuer, and recipient.\n\n"
f"Valid types: {', '.join(DOCUMENT_TYPES)}"
),
},
{"role": "user", "content": text[:3000]},
],
response_format=DocumentClassification,
)
return response.choices[0].message.parsed
Step 3: Data Extraction by Document Type
Each document type has specific fields to extract. We use type-specific schemas.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class W2Data(BaseModel):
employer_name: str
employer_ein: str
wages: float # Box 1
federal_tax_withheld: float # Box 2
social_security_wages: float # Box 3
social_security_tax: float # Box 4
medicare_wages: float # Box 5
medicare_tax: float # Box 6
state: str
state_wages: float # Box 16
state_tax_withheld: float # Box 17
class Form1099INT(BaseModel):
payer_name: str
interest_income: float # Box 1
early_withdrawal_penalty: float # Box 2
us_savings_bond_interest: float # Box 3
federal_tax_withheld: float # Box 4
EXTRACTION_SCHEMAS = {
"W-2": W2Data,
"1099-INT": Form1099INT,
# Add more schemas for each document type
}
def extract_data(text: str, doc_type: str) -> BaseModel:
"""Extract structured data based on document type."""
schema = EXTRACTION_SCHEMAS.get(doc_type)
if not schema:
raise ValueError(f"No extraction schema for: {doc_type}")
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
f"Extract all fields for a {doc_type} form. "
"Use 0.0 for any field not found in the document."
),
},
{"role": "user", "content": text},
],
response_format=schema,
)
return response.choices[0].message.parsed
Step 4: Tax Rule Application and Form Mapping
After extraction, the agent applies tax rules to map values onto the correct lines of the tax return.
from dataclasses import dataclass, field
@dataclass
class TaxFormLine:
form: str # e.g., "1040"
line: str # e.g., "1a"
description: str
value: float = 0.0
@dataclass
class TaxReturn:
tax_year: int
filing_status: str
lines: dict[str, TaxFormLine] = field(default_factory=dict)
def add_to_line(self, line_key: str, amount: float):
if line_key in self.lines:
self.lines[line_key].value += amount
def get_line(self, line_key: str) -> float:
return self.lines.get(line_key, TaxFormLine("", "", "")).value
def build_1040(extracted_docs: list[dict]) -> TaxReturn:
"""Map extracted document data to Form 1040 lines."""
tax_return = TaxReturn(
tax_year=2025,
filing_status="single",
lines={
"1a": TaxFormLine("1040", "1a", "Wages", 0.0),
"2b": TaxFormLine("1040", "2b", "Taxable Interest", 0.0),
"3b": TaxFormLine("1040", "3b", "Ordinary Dividends", 0.0),
"25a": TaxFormLine("1040", "25a", "W-2 Withholding", 0.0),
},
)
for doc in extracted_docs:
doc_type = doc["type"]
data = doc["data"]
if doc_type == "W-2":
tax_return.add_to_line("1a", data.wages)
tax_return.add_to_line("25a", data.federal_tax_withheld)
elif doc_type == "1099-INT":
tax_return.add_to_line("2b", data.interest_income)
tax_return.add_to_line("25a", data.federal_tax_withheld)
return tax_return
Full Pipeline
def prepare_taxes(document_paths: list[str]) -> TaxReturn:
"""Run the full tax preparation pipeline."""
extracted_docs = []
for path in document_paths:
text = ingest_document(path)
classification = classify_document(text)
data = extract_data(text, classification.document_type)
extracted_docs.append({
"type": classification.document_type,
"data": data,
"source": path,
})
return build_1040(extracted_docs)
tax_return = prepare_taxes(["w2_2025.pdf", "1099_int.pdf"])
for key, line in tax_return.lines.items():
print(f"Line {line.line} ({line.description}): ${line.value:,.2f}")
FAQ
How does the agent handle discrepancies between documents?
The agent flags inconsistencies — for example, if total W-2 wages across multiple employers seem unreasonably high or if withholding amounts do not match expected rates. It generates a discrepancy report for human review rather than making assumptions.
Can this approach handle business tax returns (Schedule C, partnerships)?
Yes, but business returns are more complex. You would extend the extraction schemas for Schedule C, K-1 forms, and depreciation schedules. The tax rule engine needs additional logic for business deductions, self-employment tax, and estimated tax payments.
What about state tax returns?
State returns require state-specific rules. The agent can be extended with a state module that takes the federal return as input, applies state-specific adjustments (state-specific deductions, different tax brackets), and generates the appropriate state form. Each state would have its own rule configuration.
#TaxPreparation #DocumentClassification #OCR #FinancialAI #Automation #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.