Skip to content
Learn Agentic AI11 min read0 views

Element Detection with GPT Vision: Finding Buttons, Forms, and Links Without Selectors

Discover how GPT Vision identifies interactive web elements visually, eliminating the need for CSS selectors or XPaths. Learn bounding box extraction, OCR-free text reading, and visual element classification.

The Selector Fragility Problem

Every web automation engineer has experienced it: your carefully crafted CSS selector button.btn-primary.submit-form stops working because the development team renamed the class to btn-action-submit. XPaths break when a new div wrapper is added. Data attributes get removed during refactors.

GPT Vision sidesteps this entire class of problems. Instead of relying on implementation details of the HTML structure, it identifies elements the way a human does — by how they look and what text they contain.

Visual Element Detection with Structured Output

The most reliable approach is to ask GPT-4V to return structured data about every interactive element it detects on the page.

from pydantic import BaseModel
from openai import OpenAI

class DetectedElement(BaseModel):
    element_type: str  # button, link, text_input, checkbox, etc.
    label: str  # visible text or aria description
    x_center: int  # estimated center x coordinate
    y_center: int  # estimated center y coordinate
    width: int  # estimated width in pixels
    height: int  # estimated height in pixels
    confidence: str  # high, medium, low
    is_enabled: bool
    context: str  # surrounding context or section

class ElementDetectionResult(BaseModel):
    page_description: str
    elements: list[DetectedElement]
    total_interactive_count: int

client = OpenAI()

def detect_elements(screenshot_b64: str) -> ElementDetectionResult:
    """Detect all interactive elements in a screenshot."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a UI element detector. The screenshot is "
                    "1280x720 pixels. Identify every interactive element: "
                    "buttons, links, input fields, checkboxes, dropdowns, "
                    "toggles, and tabs. For each element, estimate its "
                    "center coordinates and bounding box dimensions. "
                    "Report confidence as high/medium/low."
                ),
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Detect all interactive elements.",
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{screenshot_b64}",
                            "detail": "high",
                        },
                    },
                ],
            },
        ],
        response_format=ElementDetectionResult,
    )
    return response.choices[0].message.parsed

Filtering Elements by Type

Once you have structured detection results, filtering for specific element types becomes straightforward Python.

def find_buttons(result: ElementDetectionResult) -> list[DetectedElement]:
    """Find all detected buttons."""
    return [
        el for el in result.elements
        if el.element_type == "button" and el.is_enabled
    ]

def find_element_by_label(
    result: ElementDetectionResult, label: str
) -> DetectedElement | None:
    """Find an element by its visible label text."""
    label_lower = label.lower()
    for el in result.elements:
        if label_lower in el.label.lower():
            return el
    return None

def find_inputs_in_region(
    result: ElementDetectionResult,
    x_min: int, y_min: int, x_max: int, y_max: int
) -> list[DetectedElement]:
    """Find input fields within a specific page region."""
    return [
        el for el in result.elements
        if el.element_type in ("text_input", "textarea", "dropdown")
        and x_min <= el.x_center <= x_max
        and y_min <= el.y_center <= y_max
    ]

OCR-Free Text Extraction

GPT-4V reads text directly from screenshots without requiring a separate OCR pipeline. This is particularly useful for extracting text from elements that are difficult to access via the DOM, such as text rendered in canvas, SVG labels, or styled components where the text node is deeply nested.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

class ExtractedText(BaseModel):
    text: str
    source_type: str  # heading, paragraph, label, button_text, etc.
    approximate_y: int  # vertical position for ordering

class PageTextExtraction(BaseModel):
    texts: list[ExtractedText]

def extract_visible_text(screenshot_b64: str) -> PageTextExtraction:
    """Extract all visible text from a screenshot."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "Extract all visible text from this web page screenshot. "
                    "Include headings, paragraph text, button labels, link "
                    "text, form labels, and any other readable text. Order "
                    "by vertical position (top to bottom)."
                ),
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Extract all text from this page.",
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{screenshot_b64}",
                            "detail": "high",
                        },
                    },
                ],
            },
        ],
        response_format=PageTextExtraction,
    )
    return response.choices[0].message.parsed

Building a Click Target Resolver

Combining element detection with Playwright, you can build a robust click resolver that finds elements by visual description rather than selectors.

from playwright.async_api import Page

async def click_element_by_description(
    page: Page, description: str, screenshot_b64: str
) -> bool:
    """Click an element found by visual description."""
    result = detect_elements(screenshot_b64)
    target = find_element_by_label(result, description)

    if target is None:
        print(f"Element '{description}' not found")
        return False

    if target.confidence == "low":
        print(f"Warning: low confidence match for '{description}'")

    await page.mouse.click(target.x_center, target.y_center)
    return True

When Visual Detection Falls Short

Visual detection struggles with certain scenarios. Overlapping elements, very small icons without text labels, and elements hidden behind hover states are all challenging. For these cases, combine vision with a quick DOM check: use GPT-4V for the initial scan, then fall back to page.query_selector() for edge cases where visual detection reports low confidence.

FAQ

Can GPT-4V detect elements inside iframes?

GPT-4V sees whatever is rendered in the screenshot, including iframe content. However, it cannot distinguish iframe boundaries, so it might report elements as clickable even when they require switching to the iframe context in Playwright first. Capture separate screenshots of iframe contents when precision matters.

How does element detection accuracy compare to traditional computer vision models?

For standard web UI elements, GPT-4V performs comparably to specialized models like YOLO trained on UI datasets. Its advantage is zero-shot generalization — it handles unusual designs, custom components, and non-standard layouts without any training. Specialized models are faster and cheaper per inference but require training data for each UI pattern.

Does this work for mobile-responsive layouts?

Yes. Set the Playwright viewport to a mobile size (e.g., 375x812) and GPT-4V will detect elements in the mobile layout. Be aware that hamburger menus, bottom sheets, and slide-out panels may hide elements until user interaction reveals them.


#ElementDetection #GPTVision #SelectorFree #WebAutomation #VisualAI #BoundingBox #OCRFree #AgenticAI

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.