UFO's Visual Understanding: How GPT-4V Interprets Windows Application Screenshots

The Vision Pipeline

UFO's ability to interact with Windows applications rests entirely on its visual understanding pipeline. Unlike traditional automation that reads the accessibility tree or inspects element properties programmatically, UFO literally looks at the screen, understands what it sees, and decides what to do — much like a human operator.

The pipeline has four stages: capture, annotate, analyze, and map.

Stage 1: Screenshot Capture

UFO captures screenshots using the Windows UI Automation (UIA) backend or the Win32 API:

from PIL import Image
import pywinauto

def capture_application_screenshot(window_title: str) -> Image.Image:
    """Capture a screenshot of a specific application window."""
    app = pywinauto.Application(backend="uia").connect(
        title=window_title
    )
    window = app.top_window()

    # Bring window to foreground
    window.set_focus()

    # Capture using the UIA backend
    screenshot = window.capture_as_image()

    return screenshot

UFO uses the UIA backend by default because it captures windows even when they are partially obscured. A Win32 fallback is available for applications that do not support UIA capture, but it requires the window to be fully visible and unobscured.

Stage 2: UI Element Annotation

This is where UFO adds its distinctive numbered labels. It enumerates all interactive controls in the window and draws colored bounding boxes with numeric labels on the screenshot:

from PIL import ImageDraw, ImageFont

def annotate_screenshot(
    screenshot: Image.Image,
    controls: list[dict],
    colors: list[str] = None,
) -> Image.Image:
    """Draw numbered labels on interactive UI elements."""
    if colors is None:
        colors = ["#FF0000", "#00FF00", "#0000FF", "#FF00FF", "#FFFF00"]

    annotated = screenshot.copy()
    draw = ImageDraw.Draw(annotated)

    try:
        font = ImageFont.truetype("arial.ttf", 14)
    except OSError:
        font = ImageFont.load_default()

    for i, control in enumerate(controls):
        rect = control["rect"]  # (left, top, right, bottom)
        color = colors[i % len(colors)]

        # Draw bounding box
        draw.rectangle(
            [rect[0], rect[1], rect[2], rect[3]],
            outline=color,
            width=2
        )

        # Draw label background
        label = str(i + 1)
        text_bbox = draw.textbbox((0, 0), label, font=font)
        label_w = text_bbox[2] - text_bbox[0] + 6
        label_h = text_bbox[3] - text_bbox[1] + 4

        draw.rectangle(
            [rect[0], rect[1] - label_h, rect[0] + label_w, rect[1]],
            fill=color,
        )

        # Draw label text
        draw.text(
            (rect[0] + 3, rect[1] - label_h + 2),
            label,
            fill="white",
            font=font,
        )

    return annotated

The annotated screenshot is what GPT-4V actually sees. Each interactive element gets a unique number, allowing the model to reference elements precisely in its response: "click element 7" instead of trying to describe button positions.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Stage 3: Vision Model Analysis

UFO sends the annotated screenshot along with structured context to the vision model:

import base64
import io
from openai import OpenAI

def analyze_screenshot(
    annotated_image: Image.Image,
    task: str,
    history: list[dict],
    controls: list[dict],
) -> dict:
    """Send annotated screenshot to GPT-4V for action selection."""
    client = OpenAI()

    # Convert image to base64
    buffer = io.BytesIO()
    annotated_image.save(buffer, format="PNG")
    image_b64 = base64.b64encode(buffer.getvalue()).decode()

    # Build control descriptions
    control_text = "\n".join(
        f"[{i+1}] {c['type']}: '{c['name']}' (enabled={c['enabled']})"
        for i, c in enumerate(controls)
    )

    # Build history summary
    history_text = "\n".join(
        f"Step {h['step']}: {h['action']} on [{h['target']}] - {h['result']}"
        for h in history[-5:]  # Last 5 steps for context window efficiency
    )

    messages = [
        {
            "role": "system",
            "content": """You are a Windows UI automation agent.
Analyze the annotated screenshot and select the next action.
Each numbered label corresponds to an interactive UI element.

Respond with a JSON object containing:
- thought: Your reasoning about the current state
- action_type: click | set_text | keyboard | scroll | finish
- control_label: The number of the target element (if applicable)
- parameters: Action-specific parameters
- status: CONTINUE or FINISH"""
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"Task: {task}\n\nPrevious steps:\n{history_text}\n\nAvailable controls:\n{control_text}"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_b64}",
                        "detail": "high"  # High resolution for UI details
                    }
                }
            ]
        }
    ]

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=1024,
        temperature=0.1,  # Low temperature for deterministic actions
    )

    return json.loads(response.choices[0].message.content)

Stage 4: Action Mapping

The model's response is mapped back to concrete UI Automation API calls:

def map_action_to_execution(action: dict, controls: list[dict]):
    """Convert model response to executable UIA operations."""
    action_type = action["action_type"]
    label = action.get("control_label")
    params = action.get("parameters", {})

    if action_type == "click":
        control = controls[label - 1]  # Labels are 1-indexed
        element = get_uia_element(control)
        element.click_input()

    elif action_type == "set_text":
        control = controls[label - 1]
        element = get_uia_element(control)
        if params.get("clear_first", True):
            element.set_edit_text("")
        element.type_keys(params["text"], with_spaces=True)

    elif action_type == "keyboard":
        from pywinauto.keyboard import send_keys
        send_keys(params["keys"])

    elif action_type == "scroll":
        control = controls[label - 1]
        element = get_uia_element(control)
        element.scroll(params["direction"], "page", params.get("amount", 3))

    elif action_type == "finish":
        return action.get("status", "FINISH")

Screenshot Context Window

UFO can include multiple previous screenshots in the prompt to give the model temporal context. This helps in cases where a single screenshot is ambiguous:

# config.yaml
INCLUDE_LAST_SCREENSHOTS: 3    # Include last 3 screenshots
CONCAT_SCREENSHOTS: true        # Tile them side by side

With CONCAT_SCREENSHOTS: true, UFO stitches the last N screenshots horizontally, letting the model see how the UI changed over recent steps. This is particularly useful for detecting whether an action was successful (e.g., did the dialog close after clicking OK?).

FAQ

Why does UFO annotate screenshots instead of just sending raw images?

Without annotations, the vision model would need to describe element positions in natural language ("the button in the upper right corner"), which is imprecise and error-prone. Numbered labels create an unambiguous reference system — the model says "click element 7" and UFO knows exactly which control to interact with.

How does image resolution affect UFO's accuracy?

Higher resolution screenshots improve GPT-4V's ability to read small text and distinguish between closely spaced controls. UFO uses the detail: high parameter to request full-resolution image analysis. On high-DPI displays (4K monitors), screenshots may need to be scaled down to stay within token limits while preserving readability.

Can UFO work with dark mode applications?

Yes. GPT-4V handles both light and dark mode interfaces effectively. The annotation overlay colors are chosen to contrast with both light and dark backgrounds. If you notice annotation visibility issues, you can customize the annotation colors in the configuration file.

#VisualAI #GPT4Vision #ScreenshotAnalysis #UIDetection #ComputerVision #MicrosoftUFO #WindowsAutomation #MultimodalAI

UFO's Visual Understanding: How GPT-4V Interprets Windows Application Screenshots

The Vision Pipeline

Stage 1: Screenshot Capture

Stage 2: UI Element Annotation

Stage 3: Vision Model Analysis

Stage 4: Action Mapping

Screenshot Context Window

FAQ

Why does UFO annotate screenshots instead of just sending raw images?

How does image resolution affect UFO's accuracy?

Can UFO work with dark mode applications?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding