Skip to content
Learn Agentic AI12 min read0 views

Building a Form Filler Agent with GPT Vision: Understanding and Completing Web Forms

Build an AI agent that uses GPT Vision to detect form fields, understand their purpose, map values to the correct inputs, and verify successful submission — all without relying on CSS selectors.

Why Forms Are Hard for Traditional Automation

Web forms are the most common interaction point for browser automation, and paradoxically, the most fragile. Labels can be associated through for attributes, visual proximity, placeholder text, or floating labels that animate on focus. Dropdowns might be native <select> elements, custom React components, or headless UI libraries. Date pickers vary wildly across sites.

GPT Vision cuts through this complexity by analyzing the form the way a human does: reading labels, understanding spatial relationships, and identifying what each field expects.

Detecting Form Structure

The first step is capturing the form and asking GPT-4V to map out its structure.

from pydantic import BaseModel
from openai import OpenAI

class FormField(BaseModel):
    label: str
    field_type: str  # text, email, phone, date, dropdown, checkbox, etc.
    is_required: bool
    x_center: int
    y_center: int
    placeholder: str
    options: list[str]  # for dropdowns/radio groups
    current_value: str

class FormStructure(BaseModel):
    form_title: str
    fields: list[FormField]
    submit_button_label: str
    submit_button_x: int
    submit_button_y: int

client = OpenAI()

def detect_form(screenshot_b64: str) -> FormStructure:
    """Detect form structure from a screenshot."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a form analysis expert. The viewport is "
                    "1280x720 pixels. Identify every form field, its "
                    "label, type, whether it appears required (asterisk "
                    "or 'required' text), its center coordinates, and "
                    "any visible placeholder text or dropdown options. "
                    "Also locate the submit button."
                ),
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Analyze this form."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{screenshot_b64}",
                            "detail": "high",
                        },
                    },
                ],
            },
        ],
        response_format=FormStructure,
    )
    return response.choices[0].message.parsed

Mapping Data to Fields

Once you know the form structure, you need to map your data to the detected fields. GPT-4V can also handle this mapping intelligently.

class FieldMapping(BaseModel):
    field_label: str
    value_to_enter: str
    interaction_type: str  # type, select, check, click

class FormFillingPlan(BaseModel):
    mappings: list[FieldMapping]
    unmapped_fields: list[str]  # fields with no matching data
    unused_data: list[str]  # data keys with no matching field

def plan_form_filling(
    form: FormStructure, data: dict[str, str]
) -> FormFillingPlan:
    """Map data values to form fields using GPT-4V."""
    fields_desc = "\n".join(
        f"- {f.label} ({f.field_type})" for f in form.fields
    )
    data_desc = "\n".join(f"- {k}: {v}" for k, v in data.items())

    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a data-to-form mapping expert. Match "
                    "each data value to the correct form field based "
                    "on semantic understanding. For example, map "
                    "'email_address' to a field labeled 'Email' or "
                    "'E-mail Address'."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Form fields:\n{fields_desc}\n\n"
                    f"Data to enter:\n{data_desc}\n\n"
                    "Create the mapping."
                ),
            },
        ],
        response_format=FormFillingPlan,
    )
    return response.choices[0].message.parsed

Executing the Form Fill

With the plan in hand, execute each field interaction sequentially.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from playwright.async_api import Page
import asyncio

async def fill_form(
    page: Page, form: FormStructure, plan: FormFillingPlan
) -> None:
    """Execute the form filling plan."""
    field_lookup = {f.label.lower(): f for f in form.fields}

    for mapping in plan.mappings:
        field = field_lookup.get(mapping.field_label.lower())
        if not field:
            print(f"Warning: field '{mapping.field_label}' not found")
            continue

        if mapping.interaction_type == "type":
            # Click the field to focus it
            await page.mouse.click(field.x_center, field.y_center)
            await asyncio.sleep(0.3)
            # Clear any existing value
            await page.keyboard.press("Control+a")
            await page.keyboard.press("Backspace")
            # Type the value
            await page.keyboard.type(mapping.value_to_enter, delay=30)

        elif mapping.interaction_type == "select":
            # Click the dropdown to open it
            await page.mouse.click(field.x_center, field.y_center)
            await asyncio.sleep(0.5)
            # Type to filter options, then press Enter
            await page.keyboard.type(mapping.value_to_enter, delay=50)
            await asyncio.sleep(0.3)
            await page.keyboard.press("Enter")

        elif mapping.interaction_type == "check":
            await page.mouse.click(field.x_center, field.y_center)

        await asyncio.sleep(0.2)

Verifying Submission

After filling and submitting, capture a new screenshot and verify the result.

class SubmissionResult(BaseModel):
    success: bool
    confirmation_message: str
    errors: list[str]

async def submit_and_verify(
    page: Page, form: FormStructure, screenshot_fn
) -> SubmissionResult:
    """Submit the form and verify the result."""
    # Click submit
    await page.mouse.click(
        form.submit_button_x, form.submit_button_y
    )
    await page.wait_for_load_state("networkidle")
    await asyncio.sleep(1)

    # Capture post-submission screenshot
    post_screenshot = await screenshot_fn(page)

    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "Analyze this screenshot taken after a form "
                    "submission. Determine if the submission was "
                    "successful, extract any confirmation message, "
                    "and list any validation errors shown."
                ),
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Was this form submission successful?",
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{post_screenshot}",
                            "detail": "high",
                        },
                    },
                ],
            },
        ],
        response_format=SubmissionResult,
    )
    return response.choices[0].message.parsed

Handling Edge Cases

Real-world forms present several challenges. Multi-step wizard forms require detecting "Next" buttons and tracking progress across pages. CAPTCHA fields need human escalation. Auto-complete dropdowns require waiting for suggestions to load before selecting. Date pickers often need a click-then-navigate approach through month/year selectors.

Build defensive logic: after each field interaction, optionally re-capture and verify the field now shows the expected value. This catch-and-retry pattern prevents silent failures that only surface at submission time.

FAQ

How does the agent handle multi-step forms with "Next" buttons?

Treat each step as a separate form detection cycle. After filling visible fields, detect and click the "Next" button, wait for the new step to load, then re-analyze the screenshot for new fields. Track completed steps to avoid repeating data entry if the page reloads.

What happens when the form has validation errors after submission?

The verification step detects error messages visually. When errors are found, the agent can re-analyze the form screenshot to identify which fields have errors, correct the values, and resubmit. Build a maximum retry count to prevent infinite loops.

Can GPT Vision handle custom-styled form components like date pickers or color selectors?

GPT-4V recognizes most custom components visually, but interacting with them requires multi-step sequences. For a date picker, the agent might need to click the field, detect the calendar popup in a new screenshot, navigate to the correct month, and click the date. Each sub-interaction needs its own screenshot-action cycle.


#FormAutomation #GPTVision #BrowserAgent #WebForms #AIFormFiller #VisualAI #AgenticAI #Python

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.