Screenshot Analysis Agent: Understanding UI Elements and Generating Descriptions
Build a screenshot analysis agent that detects UI elements, analyzes layouts, and generates accessibility descriptions. Learn techniques for button detection, form analysis, and hierarchical layout understanding.
Why Screenshot Analysis Matters for AI Agents
Screenshot analysis is the foundation of computer use agents, automated QA testing, and accessibility tooling. An agent that can look at a screenshot and understand what UI elements are present — buttons, text fields, navigation menus, data tables — can then interact with those elements, verify their correctness, or generate descriptions for users who rely on screen readers.
Setting Up the Agent
pip install openai pillow numpy
The agent combines vision-model analysis with structured output parsing to deliver actionable UI understanding.
Detecting UI Elements with Vision Models
Rather than training custom object detection models for every UI framework, modern vision language models can identify UI elements directly from screenshots:
import openai
import base64
import io
import json
from PIL import Image
from dataclasses import dataclass, field
from pydantic import BaseModel
class UIElement(BaseModel):
element_type: str # button, input, link, text, image, etc.
label: str
bounding_box: dict # {x, y, width, height} as percentages
state: str = "default" # default, disabled, focused, error
description: str = ""
class ScreenAnalysis(BaseModel):
page_type: str # login, dashboard, form, list, etc.
elements: list[UIElement]
layout_description: str
accessibility_issues: list[str]
async def analyze_screenshot(
image_bytes: bytes,
client: openai.AsyncOpenAI,
) -> ScreenAnalysis:
"""Analyze a screenshot and identify all UI elements."""
b64 = base64.b64encode(image_bytes).decode()
response = await client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a UI analysis expert. Analyze the "
"screenshot and identify all interactive and "
"informational UI elements. For each element, "
"provide its type, label, approximate bounding "
"box as percentage coordinates (x, y from "
"top-left, width, height), current state, and "
"a brief description. Also identify the page "
"type, overall layout, and any accessibility "
"issues."
),
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{b64}"
},
},
{
"type": "text",
"text": "Analyze this UI screenshot.",
},
],
},
],
response_format=ScreenAnalysis,
)
return response.choices[0].message.parsed
Layout Analysis: Understanding Spatial Relationships
Beyond identifying individual elements, the agent must understand how elements relate to each other spatially. This is critical for generating meaningful descriptions and for computer use agents that need to navigate layouts:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
@dataclass
class LayoutRegion:
name: str # header, sidebar, main_content, footer, modal
elements: list[UIElement]
bounds: dict # {x, y, width, height}
def group_elements_by_region(
elements: list[UIElement],
) -> list[LayoutRegion]:
"""Group UI elements into layout regions based on position."""
regions = {
"header": LayoutRegion("header", [], {
"x": 0, "y": 0, "width": 100, "height": 15
}),
"sidebar": LayoutRegion("sidebar", [], {
"x": 0, "y": 15, "width": 20, "height": 70
}),
"main_content": LayoutRegion("main_content", [], {
"x": 20, "y": 15, "width": 80, "height": 70
}),
"footer": LayoutRegion("footer", [], {
"x": 0, "y": 85, "width": 100, "height": 15
}),
}
for element in elements:
box = element.bounding_box
center_x = box.get("x", 0) + box.get("width", 0) / 2
center_y = box.get("y", 0) + box.get("height", 0) / 2
assigned = False
for region in regions.values():
rb = region.bounds
if (rb["x"] <= center_x <= rb["x"] + rb["width"]
and rb["y"] <= center_y <= rb["y"] + rb["height"]):
region.elements.append(element)
assigned = True
break
if not assigned:
regions["main_content"].elements.append(element)
return [r for r in regions.values() if r.elements]
Generating Accessibility Descriptions
A key application is generating descriptions for accessibility auditing or screen reader content:
def generate_accessibility_description(
analysis: ScreenAnalysis,
) -> str:
"""Generate an accessibility-oriented description of the UI."""
regions = group_elements_by_region(analysis.elements)
lines = [
f"Page type: {analysis.page_type}",
f"Layout: {analysis.layout_description}",
"",
]
for region in regions:
lines.append(f"## {region.name.replace('_', ' ').title()}")
for elem in region.elements:
state_info = (
f" ({elem.state})" if elem.state != "default" else ""
)
lines.append(
f"- [{elem.element_type}] {elem.label}{state_info}"
)
if elem.description:
lines.append(f" {elem.description}")
lines.append("")
if analysis.accessibility_issues:
lines.append("## Accessibility Issues")
for issue in analysis.accessibility_issues:
lines.append(f"- {issue}")
return "\n".join(lines)
The Complete Screenshot Agent
class ScreenshotAnalysisAgent:
def __init__(self):
self.client = openai.AsyncOpenAI()
self.last_analysis: ScreenAnalysis | None = None
async def analyze(self, image_bytes: bytes) -> dict:
self.last_analysis = await analyze_screenshot(
image_bytes, self.client
)
description = generate_accessibility_description(
self.last_analysis
)
return {
"page_type": self.last_analysis.page_type,
"element_count": len(self.last_analysis.elements),
"description": description,
"issues": self.last_analysis.accessibility_issues,
}
def find_element(self, label: str) -> UIElement | None:
"""Find a UI element by its label."""
if not self.last_analysis:
return None
label_lower = label.lower()
for elem in self.last_analysis.elements:
if label_lower in elem.label.lower():
return elem
return None
FAQ
How accurate are vision models at detecting UI elements compared to DOM-based approaches?
Vision models like GPT-4o achieve approximately 85-90% accuracy for common UI element detection, which is sufficient for most use cases. DOM-based approaches are more precise when available, but they require browser access and do not work for native applications, images of UIs, or design mockups. The vision-based approach is universally applicable — it works on any screenshot regardless of the technology behind the UI.
Can this agent handle dynamic UI elements like dropdown menus or modals?
Yes. When a dropdown is open or a modal is visible, those elements appear in the screenshot and the vision model identifies them. For comprehensive analysis of a dynamic page, take multiple screenshots showing different states — the initial state, after clicking a dropdown, after opening a modal — and analyze each separately. The agent can compare analyses to build a complete picture of the UI's interactive behavior.
How do I use this for automated accessibility auditing?
Run the agent on every page of your application and collect the accessibility_issues array from each analysis. Common issues the model identifies include missing alt text on images, low contrast text, unlabeled form fields, and tiny click targets. While this does not replace a full WCAG compliance audit, it catches the most impactful issues quickly and can run as part of a CI pipeline on screenshot snapshots.
#ScreenshotAnalysis #UIDetection #Accessibility #LayoutAnalysis #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.