Screenshot Analysis Agent: Understanding UI Elements and Generating Descriptions

Why Screenshot Analysis Matters for AI Agents

Screenshot analysis is the foundation of computer use agents, automated QA testing, and accessibility tooling. An agent that can look at a screenshot and understand what UI elements are present — buttons, text fields, navigation menus, data tables — can then interact with those elements, verify their correctness, or generate descriptions for users who rely on screen readers.

Setting Up the Agent

pip install openai pillow numpy

The agent combines vision-model analysis with structured output parsing to deliver actionable UI understanding.

Detecting UI Elements with Vision Models

Rather than training custom object detection models for every UI framework, modern vision language models can identify UI elements directly from screenshots:

import openai
import base64
import io
import json
from PIL import Image
from dataclasses import dataclass, field
from pydantic import BaseModel


class UIElement(BaseModel):
    element_type: str  # button, input, link, text, image, etc.
    label: str
    bounding_box: dict  # {x, y, width, height} as percentages
    state: str = "default"  # default, disabled, focused, error
    description: str = ""


class ScreenAnalysis(BaseModel):
    page_type: str  # login, dashboard, form, list, etc.
    elements: list[UIElement]
    layout_description: str
    accessibility_issues: list[str]


async def analyze_screenshot(
    image_bytes: bytes,
    client: openai.AsyncOpenAI,
) -> ScreenAnalysis:
    """Analyze a screenshot and identify all UI elements."""
    b64 = base64.b64encode(image_bytes).decode()

    response = await client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a UI analysis expert. Analyze the "
                    "screenshot and identify all interactive and "
                    "informational UI elements. For each element, "
                    "provide its type, label, approximate bounding "
                    "box as percentage coordinates (x, y from "
                    "top-left, width, height), current state, and "
                    "a brief description. Also identify the page "
                    "type, overall layout, and any accessibility "
                    "issues."
                ),
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{b64}"
                        },
                    },
                    {
                        "type": "text",
                        "text": "Analyze this UI screenshot.",
                    },
                ],
            },
        ],
        response_format=ScreenAnalysis,
    )
    return response.choices[0].message.parsed

Layout Analysis: Understanding Spatial Relationships

Beyond identifying individual elements, the agent must understand how elements relate to each other spatially. This is critical for generating meaningful descriptions and for computer use agents that need to navigate layouts:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

@dataclass
class LayoutRegion:
    name: str  # header, sidebar, main_content, footer, modal
    elements: list[UIElement]
    bounds: dict  # {x, y, width, height}


def group_elements_by_region(
    elements: list[UIElement],
) -> list[LayoutRegion]:
    """Group UI elements into layout regions based on position."""
    regions = {
        "header": LayoutRegion("header", [], {
            "x": 0, "y": 0, "width": 100, "height": 15
        }),
        "sidebar": LayoutRegion("sidebar", [], {
            "x": 0, "y": 15, "width": 20, "height": 70
        }),
        "main_content": LayoutRegion("main_content", [], {
            "x": 20, "y": 15, "width": 80, "height": 70
        }),
        "footer": LayoutRegion("footer", [], {
            "x": 0, "y": 85, "width": 100, "height": 15
        }),
    }

    for element in elements:
        box = element.bounding_box
        center_x = box.get("x", 0) + box.get("width", 0) / 2
        center_y = box.get("y", 0) + box.get("height", 0) / 2

        assigned = False
        for region in regions.values():
            rb = region.bounds
            if (rb["x"] <= center_x <= rb["x"] + rb["width"]
                    and rb["y"] <= center_y <= rb["y"] + rb["height"]):
                region.elements.append(element)
                assigned = True
                break

        if not assigned:
            regions["main_content"].elements.append(element)

    return [r for r in regions.values() if r.elements]

Generating Accessibility Descriptions

A key application is generating descriptions for accessibility auditing or screen reader content:

def generate_accessibility_description(
    analysis: ScreenAnalysis,
) -> str:
    """Generate an accessibility-oriented description of the UI."""
    regions = group_elements_by_region(analysis.elements)

    lines = [
        f"Page type: {analysis.page_type}",
        f"Layout: {analysis.layout_description}",
        "",
    ]

    for region in regions:
        lines.append(f"## {region.name.replace('_', ' ').title()}")
        for elem in region.elements:
            state_info = (
                f" ({elem.state})" if elem.state != "default" else ""
            )
            lines.append(
                f"- [{elem.element_type}] {elem.label}{state_info}"
            )
            if elem.description:
                lines.append(f"  {elem.description}")
        lines.append("")

    if analysis.accessibility_issues:
        lines.append("## Accessibility Issues")
        for issue in analysis.accessibility_issues:
            lines.append(f"- {issue}")

    return "\n".join(lines)

The Complete Screenshot Agent

class ScreenshotAnalysisAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()
        self.last_analysis: ScreenAnalysis | None = None

    async def analyze(self, image_bytes: bytes) -> dict:
        self.last_analysis = await analyze_screenshot(
            image_bytes, self.client
        )
        description = generate_accessibility_description(
            self.last_analysis
        )
        return {
            "page_type": self.last_analysis.page_type,
            "element_count": len(self.last_analysis.elements),
            "description": description,
            "issues": self.last_analysis.accessibility_issues,
        }

    def find_element(self, label: str) -> UIElement | None:
        """Find a UI element by its label."""
        if not self.last_analysis:
            return None
        label_lower = label.lower()
        for elem in self.last_analysis.elements:
            if label_lower in elem.label.lower():
                return elem
        return None

FAQ

How accurate are vision models at detecting UI elements compared to DOM-based approaches?

Vision models like GPT-4o achieve approximately 85-90% accuracy for common UI element detection, which is sufficient for most use cases. DOM-based approaches are more precise when available, but they require browser access and do not work for native applications, images of UIs, or design mockups. The vision-based approach is universally applicable — it works on any screenshot regardless of the technology behind the UI.

Yes. When a dropdown is open or a modal is visible, those elements appear in the screenshot and the vision model identifies them. For comprehensive analysis of a dynamic page, take multiple screenshots showing different states — the initial state, after clicking a dropdown, after opening a modal — and analyze each separately. The agent can compare analyses to build a complete picture of the UI's interactive behavior.

How do I use this for automated accessibility auditing?

Run the agent on every page of your application and collect the accessibility_issues array from each analysis. Common issues the model identifies include missing alt text on images, low contrast text, unlabeled form fields, and tiny click targets. While this does not replace a full WCAG compliance audit, it catches the most impactful issues quickly and can run as part of a CI pipeline on screenshot snapshots.

#ScreenshotAnalysis #UIDetection #Accessibility #LayoutAnalysis #Python #AgenticAI #LearnAI #AIEngineering

Screenshot Analysis Agent: Understanding UI Elements and Generating Descriptions

Why Screenshot Analysis Matters for AI Agents

Setting Up the Agent

Detecting UI Elements with Vision Models

Layout Analysis: Understanding Spatial Relationships

Generating Accessibility Descriptions

The Complete Screenshot Agent

FAQ

How accurate are vision models at detecting UI elements compared to DOM-based approaches?

How do I use this for automated accessibility auditing?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding

Why Screenshot Analysis Matters for AI Agents

Setting Up the Agent

Detecting UI Elements with Vision Models

Layout Analysis: Understanding Spatial Relationships

Generating Accessibility Descriptions

The Complete Screenshot Agent

FAQ

How accurate are vision models at detecting UI elements compared to DOM-based approaches?

Can this agent handle dynamic UI elements like dropdown menus or modals?

How do I use this for automated accessibility auditing?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding