AI Agent for Automated Data Entry: Reading Source Documents and Filling Web Forms

The Data Entry Problem

Data entry remains one of the most labor-intensive tasks in business operations. A human reads a source document — an invoice, insurance claim, patient intake form, or purchase order — then manually types the extracted information into a web application. This process is slow, error-prone, and soul-crushing for the people doing it.

An AI-powered data entry agent automates the complete pipeline: reading the source document, extracting structured data, mapping fields to the target web form, filling in values, and validating the result. The key insight is that modern vision models can read documents as well as or better than traditional OCR, and LLMs can reason about how extracted data maps to form fields.

Document Reading with Vision Models

The first step is extracting structured data from source documents. Vision-capable LLMs like GPT-4o can read invoices, receipts, and forms directly from images, handling messy layouts, handwriting, and multi-column formats that trip up traditional OCR.

import base64
import json
from openai import AsyncOpenAI
from pathlib import Path

class DocumentReader:
    def __init__(self, client: AsyncOpenAI):
        self.client = client

    async def extract_fields(self, document_path: str,
                              field_schema: dict) -> dict:
        """Extract structured data from a document image."""
        image_b64 = self._encode_image(document_path)

        schema_description = json.dumps(field_schema, indent=2)

        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": (
                    "You are a document data extraction specialist. "
                    "Extract the requested fields from the document "
                    "image. Return a JSON object matching the schema. "
                    "If a field is not visible, set it to null."
                )},
                {"role": "user", "content": [
                    {"type": "text", "text": (
                        f"Extract these fields:\n{schema_description}"
                    )},
                    {"type": "image_url", "image_url": {
                        "url": f"data:image/png;base64,{image_b64}"
                    }},
                ]},
            ],
            response_format={"type": "json_object"},
            temperature=0,
        )

        return json.loads(response.choices[0].message.content)

    def _encode_image(self, path: str) -> str:
        return base64.b64encode(Path(path).read_bytes()).decode()


# Define the schema for an invoice
invoice_schema = {
    "vendor_name": "string",
    "invoice_number": "string",
    "invoice_date": "string (YYYY-MM-DD)",
    "due_date": "string (YYYY-MM-DD)",
    "total_amount": "number",
    "currency": "string",
    "line_items": [
        {
            "description": "string",
            "quantity": "number",
            "unit_price": "number",
            "total": "number",
        }
    ],
}

Form Detection and Field Mapping

Once you have extracted data, the agent needs to understand the target web form. Rather than hard-coding selectors for each form, the agent inspects the form structure and uses an LLM to map extracted fields to form inputs.

from playwright.async_api import Page

class FormAnalyzer:
    def __init__(self, client: AsyncOpenAI):
        self.client = client

    async def detect_form_fields(self, page: Page) -> list[dict]:
        """Detect all fillable fields in a web form."""
        fields = await page.evaluate("""
            () => {
                const inputs = document.querySelectorAll(
                    'input, select, textarea'
                );
                return Array.from(inputs).map(el => ({
                    tag: el.tagName.toLowerCase(),
                    type: el.type || 'text',
                    name: el.name,
                    id: el.id,
                    label: (() => {
                        const label = document.querySelector(
                            'label[for="' + el.id + '"]'
                        );
                        return label ? label.textContent.trim() : '';
                    })(),
                    placeholder: el.placeholder || '',
                    required: el.required,
                    options: el.tagName === 'SELECT'
                        ? Array.from(el.options).map(
                            o => ({value: o.value, text: o.text})
                          )
                        : [],
                    selector: el.id
                        ? '#' + el.id
                        : '[name="' + el.name + '"]',
                }));
            }
        """)
        return fields

    async def map_data_to_fields(self, extracted_data: dict,
                                  form_fields: list[dict]) -> list[dict]:
        """Use LLM to map extracted data to form fields."""
        response = await self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": (
                    "Map extracted document data to web form fields. "
                    "Return a JSON array of objects with keys: "
                    "selector, value, action (fill/select/check)."
                )},
                {"role": "user", "content": (
                    f"Extracted data:\n{json.dumps(extracted_data)}\n\n"
                    f"Form fields:\n{json.dumps(form_fields)}"
                )},
            ],
            response_format={"type": "json_object"},
            temperature=0,
        )
        result = json.loads(response.choices[0].message.content)
        return result.get("mappings", [])

Form Filling with Validation

The form filler executes the mappings, filling each field and then validating the result by comparing what was entered against what was expected.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

class FormFiller:
    def __init__(self, page: Page):
        self.page = page
        self.fill_log: list[dict] = []

    async def fill_form(self, mappings: list[dict]) -> list[dict]:
        """Fill form fields based on LLM-generated mappings."""
        results = []

        for mapping in mappings:
            selector = mapping["selector"]
            value = str(mapping["value"])
            action = mapping.get("action", "fill")

            try:
                if action == "fill":
                    await self.page.fill(selector, value)
                elif action == "select":
                    await self.page.select_option(selector, value)
                elif action == "check":
                    if value.lower() in ("true", "yes", "1"):
                        await self.page.check(selector)

                # Verify the value was entered correctly
                actual = await self._get_field_value(selector)
                match = self._values_match(value, actual)

                results.append({
                    "selector": selector,
                    "expected": value,
                    "actual": actual,
                    "success": match,
                })

            except Exception as e:
                results.append({
                    "selector": selector,
                    "expected": value,
                    "actual": None,
                    "success": False,
                    "error": str(e),
                })

        self.fill_log = results
        return results

    async def _get_field_value(self, selector: str) -> str:
        return await self.page.input_value(selector)

    def _values_match(self, expected: str, actual: str) -> bool:
        """Flexible comparison that handles formatting differences."""
        clean = lambda s: s.strip().lower().replace(",", "")
        return clean(expected) == clean(actual)

Error Correction Pipeline

When validation detects a mismatch — for example, a date entered in the wrong format or a select field that does not have the expected option — the error correction pipeline re-analyzes the field and attempts an alternative approach.

class ErrorCorrector:
    def __init__(self, client: AsyncOpenAI, page: Page):
        self.client = client
        self.page = page

    async def fix_failed_fields(self, fill_results: list[dict],
                                 form_fields: list[dict]) -> int:
        """Attempt to fix fields that failed validation."""
        failed = [r for r in fill_results if not r["success"]]
        fixed_count = 0

        for failure in failed:
            field_info = next(
                (f for f in form_fields
                 if f["selector"] == failure["selector"]),
                None,
            )
            if not field_info:
                continue

            response = await self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": (
                        "A form field failed to accept a value. "
                        "Suggest an alternative value or approach."
                    )},
                    {"role": "user", "content": (
                        f"Field: {json.dumps(field_info)}\n"
                        f"Attempted value: {failure['expected']}\n"
                        f"Error: {failure.get('error', 'value mismatch')}\n"
                        f"Actual value in field: {failure['actual']}"
                    )},
                ],
                temperature=0,
            )

            new_value = response.choices[0].message.content.strip()
            try:
                await self.page.fill(failure["selector"], new_value)
                actual = await self.page.input_value(failure["selector"])
                if actual.strip():
                    fixed_count += 1
            except Exception:
                continue

        return fixed_count

Full Pipeline Orchestration

The orchestrator ties document reading, form analysis, filling, and correction into a single workflow.

async def process_document_to_form(document_path: str,
                                    form_url: str,
                                    field_schema: dict):
    """Complete pipeline: document to filled form."""
    client = AsyncOpenAI()
    reader = DocumentReader(client)

    # Step 1: Extract data from document
    extracted = await reader.extract_fields(document_path, field_schema)
    print(f"Extracted {len(extracted)} fields from document")

    # Step 2: Open form and analyze fields
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto(form_url)

        analyzer = FormAnalyzer(client)
        form_fields = await analyzer.detect_form_fields(page)

        # Step 3: Map and fill
        mappings = await analyzer.map_data_to_fields(
            extracted, form_fields
        )
        filler = FormFiller(page)
        results = await filler.fill_form(mappings)

        # Step 4: Correct errors
        corrector = ErrorCorrector(client, page)
        fixes = await corrector.fix_failed_fields(results, form_fields)

        success_rate = sum(
            1 for r in results if r["success"]
        ) / len(results) * 100
        print(f"Fill accuracy: {success_rate:.1f}%, fixes: {fixes}")

        await browser.close()

FAQ

How does vision-based extraction compare to traditional OCR like Tesseract?

Vision LLMs like GPT-4o significantly outperform Tesseract on complex documents with mixed layouts, tables, handwriting, and poor scan quality. Tesseract is faster and cheaper for simple, clean text extraction. For production systems, use Tesseract for bulk text extraction and fall back to vision models for complex or ambiguous documents.

How do I handle multi-page documents like long invoices?

Split the document into individual page images and process each page through the vision model separately. Then use an LLM to merge the results, handling cases where tables span across pages or header information appears only on the first page.

What accuracy should I expect from automated data entry?

With GPT-4o vision and a well-designed validation pipeline, expect 90-95% field-level accuracy on clean documents. The error correction pipeline typically recovers another 2-3%. Always include a human review step for high-value transactions and flag any fields where the model reports low confidence.

#DataEntry #OCR #FormAutomation #VisionAI #DocumentProcessing #AIAgents #WebAutomation #IntelligentAutomation

AI Agent for Automated Data Entry: Reading Source Documents and Filling Web Forms

The Data Entry Problem

Document Reading with Vision Models

Form Detection and Field Mapping

Form Filling with Validation

Error Correction Pipeline

Full Pipeline Orchestration

FAQ

How does vision-based extraction compare to traditional OCR like Tesseract?

How do I handle multi-page documents like long invoices?

What accuracy should I expect from automated data entry?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding