Skip to content
Learn Agentic AI15 min read0 views

Computer Use Agents: AI That Controls Browser and Desktop Applications

Learn how to build computer use agents that interact with browser and desktop applications by capturing screenshots, detecting UI elements, performing click and type actions, and verifying results through visual feedback loops.

What Are Computer Use Agents

Computer use agents are AI systems that interact with software the same way a human does — by looking at the screen, identifying UI elements, clicking buttons, typing text, and verifying that their actions produced the expected result. Unlike API-based automation, computer use agents work with any application that has a visual interface, including legacy systems with no API at all.

The Perception-Action-Verification Loop

Every computer use agent operates on a three-step loop:

  1. Perceive — capture a screenshot and understand what is on screen
  2. Act — perform a mouse click, keyboard input, or scroll action
  3. Verify — capture another screenshot and confirm the action succeeded

This loop repeats until the task is complete or the agent determines it cannot proceed.

Screen Capture and Element Detection

Start with the ability to capture the screen and identify interactive elements:

import pyautogui
import openai
import base64
import io
from PIL import Image
from pydantic import BaseModel


class ClickTarget(BaseModel):
    element_description: str
    x_percent: float  # 0-100 percentage of screen width
    y_percent: float  # 0-100 percentage of screen height
    action: str  # click, double_click, right_click


class TypeAction(BaseModel):
    text: str
    press_enter: bool = False


class AgentAction(BaseModel):
    action_type: str  # click, type, scroll, wait, done, fail
    click_target: ClickTarget | None = None
    type_action: TypeAction | None = None
    scroll_direction: str | None = None  # up, down
    reasoning: str


def capture_screen() -> bytes:
    """Capture the current screen as PNG bytes."""
    screenshot = pyautogui.screenshot()
    buf = io.BytesIO()
    screenshot.save(buf, format="PNG")
    return buf.getvalue()


async def decide_action(
    screenshot_bytes: bytes,
    task: str,
    history: list[str],
    client: openai.AsyncOpenAI,
) -> AgentAction:
    """Analyze the screen and decide the next action."""
    b64 = base64.b64encode(screenshot_bytes).decode()

    history_text = "\n".join(
        f"Step {i + 1}: {h}" for i, h in enumerate(history)
    )

    response = await client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a computer use agent. You can see the "
                    "screen and must decide the next action to "
                    "complete the given task. Provide click "
                    "coordinates as percentages of screen dimensions. "
                    "If the task is complete, set action_type to "
                    "'done'. If you cannot proceed, set it to 'fail'."
                ),
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": (
                            f"Task: {task}\n\n"
                            f"Actions taken so far:\n{history_text}"
                            f"\n\nWhat should I do next?"
                        ),
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{b64}"
                        },
                    },
                ],
            },
        ],
        response_format=AgentAction,
    )
    return response.choices[0].message.parsed

Executing Actions Safely

The action executor translates the agent's decisions into actual mouse and keyboard inputs. Safety mechanisms prevent destructive actions:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import time


SCREEN_WIDTH, SCREEN_HEIGHT = pyautogui.size()

# Safety: prevent closing system applications
FORBIDDEN_REGIONS = [
    # Top-right corner (close buttons) of full-screen windows
    {"x_min": 95, "y_min": 0, "x_max": 100, "y_max": 5},
]


def is_safe_click(x_pct: float, y_pct: float) -> bool:
    """Check if a click target is in a safe region."""
    for region in FORBIDDEN_REGIONS:
        if (region["x_min"] <= x_pct <= region["x_max"]
                and region["y_min"] <= y_pct <= region["y_max"]):
            return False
    return True


def execute_action(action: AgentAction) -> str:
    """Execute an action and return a description of what happened."""
    if action.action_type == "click" and action.click_target:
        target = action.click_target
        if not is_safe_click(target.x_percent, target.y_percent):
            return "BLOCKED: click target is in a forbidden region"

        x = int(target.x_percent / 100 * SCREEN_WIDTH)
        y = int(target.y_percent / 100 * SCREEN_HEIGHT)

        if target.action == "double_click":
            pyautogui.doubleClick(x, y)
        elif target.action == "right_click":
            pyautogui.rightClick(x, y)
        else:
            pyautogui.click(x, y)

        return f"Clicked at ({x}, {y}): {target.element_description}"

    elif action.action_type == "type" and action.type_action:
        pyautogui.typewrite(
            action.type_action.text, interval=0.02
        )
        if action.type_action.press_enter:
            pyautogui.press("enter")
        return f"Typed: {action.type_action.text}"

    elif action.action_type == "scroll":
        direction = action.scroll_direction or "down"
        clicks = -3 if direction == "down" else 3
        pyautogui.scroll(clicks)
        return f"Scrolled {direction}"

    return f"Action: {action.action_type}"

The Computer Use Agent Loop

Combine perception, action, and verification into the main agent loop:

class ComputerUseAgent:
    def __init__(self, max_steps: int = 20):
        self.client = openai.AsyncOpenAI()
        self.max_steps = max_steps
        self.history: list[str] = []

    async def execute_task(self, task: str) -> dict:
        """Execute a task by interacting with the screen."""
        for step in range(self.max_steps):
            # Perceive
            screenshot = capture_screen()

            # Decide
            action = await decide_action(
                screenshot, task, self.history, self.client
            )

            # Check termination
            if action.action_type == "done":
                return {
                    "status": "completed",
                    "steps": len(self.history),
                    "history": self.history,
                }

            if action.action_type == "fail":
                return {
                    "status": "failed",
                    "reason": action.reasoning,
                    "steps": len(self.history),
                    "history": self.history,
                }

            # Act
            result = execute_action(action)
            self.history.append(
                f"{action.reasoning} -> {result}"
            )

            # Wait for UI to update
            time.sleep(1.0)

        return {
            "status": "max_steps_reached",
            "steps": self.max_steps,
            "history": self.history,
        }

Browser-Specific Automation with Playwright

For browser tasks specifically, Playwright provides more reliable element interaction than screen coordinates:

from playwright.async_api import async_playwright


class BrowserAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    async def execute_browser_task(
        self, task: str, start_url: str
    ) -> dict:
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=False)
            page = await browser.new_page()
            await page.goto(start_url)

            history = []
            for step in range(15):
                screenshot = await page.screenshot()
                action = await decide_action(
                    screenshot, task, history, self.client
                )

                if action.action_type in ("done", "fail"):
                    await browser.close()
                    return {
                        "status": action.action_type,
                        "history": history,
                    }

                if (action.action_type == "click"
                        and action.click_target):
                    t = action.click_target
                    vp = page.viewport_size
                    x = int(t.x_percent / 100 * vp["width"])
                    y = int(t.y_percent / 100 * vp["height"])
                    await page.mouse.click(x, y)
                    history.append(f"Clicked ({x},{y})")

                elif (action.action_type == "type"
                      and action.type_action):
                    await page.keyboard.type(
                        action.type_action.text
                    )
                    if action.type_action.press_enter:
                        await page.keyboard.press("Enter")
                    history.append(
                        f"Typed: {action.type_action.text}"
                    )

                await page.wait_for_timeout(1000)

            await browser.close()
            return {"status": "max_steps", "history": history}

FAQ

How reliable are coordinate-based clicks compared to element selectors?

Coordinate-based clicks work universally but are brittle — screen resolution changes, window repositioning, or font scaling can break them. Element selectors (CSS or XPath) in Playwright are much more reliable for browser tasks. Use coordinate-based approaches only for desktop applications where selectors are not available. When using coordinates, always verify the click landed on the expected element by taking a follow-up screenshot.

What safety measures should computer use agents have?

Implement multiple safety layers: forbidden regions to prevent closing critical applications, action rate limiting (no more than one action per second), a kill switch that stops the agent immediately, confirmation prompts for destructive actions like deleting files or submitting forms, and a maximum step limit. Never let the agent access password managers, banking applications, or admin panels without explicit human approval at each step.

Can computer use agents handle multi-monitor setups?

Yes, but with additional configuration. PyAutoGUI supports multi-monitor coordinates — the second monitor's coordinates continue from where the first ends. Capture screenshots of all monitors or specify which monitor to use. For most tasks, constrain the agent to a single monitor to reduce complexity. Always verify the agent knows which monitor contains the target application.


#ComputerUse #BrowserAutomation #DesktopAutomation #UIInteraction #Python #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.