Building a Form Filler Agent with GPT Vision: Understanding and Completing Web Forms
Build an AI agent that uses GPT Vision to detect form fields, understand their purpose, map values to the correct inputs, and verify successful submission — all without relying on CSS selectors.
Why Forms Are Hard for Traditional Automation
Web forms are the most common interaction point for browser automation, and paradoxically, the most fragile. Labels can be associated through for attributes, visual proximity, placeholder text, or floating labels that animate on focus. Dropdowns might be native <select> elements, custom React components, or headless UI libraries. Date pickers vary wildly across sites.
GPT Vision cuts through this complexity by analyzing the form the way a human does: reading labels, understanding spatial relationships, and identifying what each field expects.
Detecting Form Structure
The first step is capturing the form and asking GPT-4V to map out its structure.
from pydantic import BaseModel
from openai import OpenAI
class FormField(BaseModel):
label: str
field_type: str # text, email, phone, date, dropdown, checkbox, etc.
is_required: bool
x_center: int
y_center: int
placeholder: str
options: list[str] # for dropdowns/radio groups
current_value: str
class FormStructure(BaseModel):
form_title: str
fields: list[FormField]
submit_button_label: str
submit_button_x: int
submit_button_y: int
client = OpenAI()
def detect_form(screenshot_b64: str) -> FormStructure:
"""Detect form structure from a screenshot."""
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a form analysis expert. The viewport is "
"1280x720 pixels. Identify every form field, its "
"label, type, whether it appears required (asterisk "
"or 'required' text), its center coordinates, and "
"any visible placeholder text or dropdown options. "
"Also locate the submit button."
),
},
{
"role": "user",
"content": [
{"type": "text", "text": "Analyze this form."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{screenshot_b64}",
"detail": "high",
},
},
],
},
],
response_format=FormStructure,
)
return response.choices[0].message.parsed
Mapping Data to Fields
Once you know the form structure, you need to map your data to the detected fields. GPT-4V can also handle this mapping intelligently.
class FieldMapping(BaseModel):
field_label: str
value_to_enter: str
interaction_type: str # type, select, check, click
class FormFillingPlan(BaseModel):
mappings: list[FieldMapping]
unmapped_fields: list[str] # fields with no matching data
unused_data: list[str] # data keys with no matching field
def plan_form_filling(
form: FormStructure, data: dict[str, str]
) -> FormFillingPlan:
"""Map data values to form fields using GPT-4V."""
fields_desc = "\n".join(
f"- {f.label} ({f.field_type})" for f in form.fields
)
data_desc = "\n".join(f"- {k}: {v}" for k, v in data.items())
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a data-to-form mapping expert. Match "
"each data value to the correct form field based "
"on semantic understanding. For example, map "
"'email_address' to a field labeled 'Email' or "
"'E-mail Address'."
),
},
{
"role": "user",
"content": (
f"Form fields:\n{fields_desc}\n\n"
f"Data to enter:\n{data_desc}\n\n"
"Create the mapping."
),
},
],
response_format=FormFillingPlan,
)
return response.choices[0].message.parsed
Executing the Form Fill
With the plan in hand, execute each field interaction sequentially.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from playwright.async_api import Page
import asyncio
async def fill_form(
page: Page, form: FormStructure, plan: FormFillingPlan
) -> None:
"""Execute the form filling plan."""
field_lookup = {f.label.lower(): f for f in form.fields}
for mapping in plan.mappings:
field = field_lookup.get(mapping.field_label.lower())
if not field:
print(f"Warning: field '{mapping.field_label}' not found")
continue
if mapping.interaction_type == "type":
# Click the field to focus it
await page.mouse.click(field.x_center, field.y_center)
await asyncio.sleep(0.3)
# Clear any existing value
await page.keyboard.press("Control+a")
await page.keyboard.press("Backspace")
# Type the value
await page.keyboard.type(mapping.value_to_enter, delay=30)
elif mapping.interaction_type == "select":
# Click the dropdown to open it
await page.mouse.click(field.x_center, field.y_center)
await asyncio.sleep(0.5)
# Type to filter options, then press Enter
await page.keyboard.type(mapping.value_to_enter, delay=50)
await asyncio.sleep(0.3)
await page.keyboard.press("Enter")
elif mapping.interaction_type == "check":
await page.mouse.click(field.x_center, field.y_center)
await asyncio.sleep(0.2)
Verifying Submission
After filling and submitting, capture a new screenshot and verify the result.
class SubmissionResult(BaseModel):
success: bool
confirmation_message: str
errors: list[str]
async def submit_and_verify(
page: Page, form: FormStructure, screenshot_fn
) -> SubmissionResult:
"""Submit the form and verify the result."""
# Click submit
await page.mouse.click(
form.submit_button_x, form.submit_button_y
)
await page.wait_for_load_state("networkidle")
await asyncio.sleep(1)
# Capture post-submission screenshot
post_screenshot = await screenshot_fn(page)
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Analyze this screenshot taken after a form "
"submission. Determine if the submission was "
"successful, extract any confirmation message, "
"and list any validation errors shown."
),
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Was this form submission successful?",
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{post_screenshot}",
"detail": "high",
},
},
],
},
],
response_format=SubmissionResult,
)
return response.choices[0].message.parsed
Handling Edge Cases
Real-world forms present several challenges. Multi-step wizard forms require detecting "Next" buttons and tracking progress across pages. CAPTCHA fields need human escalation. Auto-complete dropdowns require waiting for suggestions to load before selecting. Date pickers often need a click-then-navigate approach through month/year selectors.
Build defensive logic: after each field interaction, optionally re-capture and verify the field now shows the expected value. This catch-and-retry pattern prevents silent failures that only surface at submission time.
FAQ
How does the agent handle multi-step forms with "Next" buttons?
Treat each step as a separate form detection cycle. After filling visible fields, detect and click the "Next" button, wait for the new step to load, then re-analyze the screenshot for new fields. Track completed steps to avoid repeating data entry if the page reloads.
What happens when the form has validation errors after submission?
The verification step detects error messages visually. When errors are found, the agent can re-analyze the form screenshot to identify which fields have errors, correct the values, and resubmit. Build a maximum retry count to prevent infinite loops.
Can GPT Vision handle custom-styled form components like date pickers or color selectors?
GPT-4V recognizes most custom components visually, but interacting with them requires multi-step sequences. For a date picker, the agent might need to click the field, detect the calendar popup in a new screenshot, navigate to the correct month, and click the date. Each sub-interaction needs its own screenshot-action cycle.
#FormAutomation #GPTVision #BrowserAgent #WebForms #AIFormFiller #VisualAI #AgenticAI #Python
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.