Claude Computer Use API: How Anthropic's Vision Model Controls Desktop and Browser

What Is Claude Computer Use?

Claude Computer Use is a capability that lets Claude observe a computer screen via screenshots and take actions by issuing mouse clicks, keyboard presses, and scroll commands. Unlike traditional browser automation frameworks that rely on HTML selectors and DOM traversal, Claude operates purely through vision. It looks at pixels, understands what is on screen, and decides what to click or type next.

This fundamentally changes the automation landscape. You no longer need to reverse-engineer a website's DOM structure, maintain fragile CSS selectors, or deal with shadow DOMs and iframes. Claude sees the screen the same way a human does and interacts with it accordingly.

The Screenshot-Action Feedback Loop

The core mechanism is a tight loop between observation and action:

Your application captures a screenshot of the current screen state
The screenshot is sent to Claude along with the task instructions
Claude analyzes the image and returns a tool call specifying an action (click, type, scroll, etc.)
Your application executes that action on the actual machine
A new screenshot is captured and sent back to Claude
The loop continues until Claude determines the task is complete

Here is a minimal implementation of this loop using the Anthropic Python SDK:

import anthropic
import base64
import subprocess

client = anthropic.Anthropic()

def capture_screenshot() -> str:
    """Capture screen and return base64-encoded PNG."""
    subprocess.run(["scrot", "/tmp/screen.png"], check=True)
    with open("/tmp/screen.png", "rb") as f:
        return base64.standard_b64encode(f.read()).decode()

def execute_action(action: dict):
    """Execute a computer use action via xdotool."""
    if action["type"] == "click":
        x, y = action["coordinate"]
        subprocess.run(["xdotool", "mousemove", str(x), str(y)])
        subprocess.run(["xdotool", "click", "1"])
    elif action["type"] == "type":
        subprocess.run(["xdotool", "type", "--clearmodifiers", action["text"]])
    elif action["type"] == "key":
        subprocess.run(["xdotool", "key", action["text"]])
    elif action["type"] == "scroll":
        x, y = action["coordinate"]
        subprocess.run(["xdotool", "mousemove", str(x), str(y)])
        direction = "5" if action["direction"] == "down" else "-5"
        subprocess.run(["xdotool", "click", direction])

def run_computer_use(task: str):
    messages = [{"role": "user", "content": task}]

    while True:
        screenshot_b64 = capture_screenshot()

        # Append screenshot to the conversation
        messages.append({
            "role": "user",
            "content": [{
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": screenshot_b64,
                },
            }],
        })

        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            tools=[{
                "type": "computer_20241022",
                "name": "computer",
                "display_width_px": 1920,
                "display_height_px": 1080,
                "display_number": 0,
            }],
            messages=messages,
        )

        # Check if Claude wants to use a tool or is done
        if response.stop_reason == "end_turn":
            print("Task complete:", response.content[0].text)
            break

        for block in response.content:
            if block.type == "tool_use":
                execute_action(block.input)
                messages.append({"role": "assistant", "content": response.content})
                messages.append({
                    "role": "user",
                    "content": [{"type": "tool_result", "tool_use_id": block.id, "content": "Action executed"}],
                })

The Coordinate System

Claude Computer Use operates on an absolute pixel coordinate system. When you configure the tool, you specify display_width_px and display_height_px. Claude returns coordinates within this space. The origin (0, 0) is the top-left corner of the screen.

This means that the screenshot resolution you send must match the declared display dimensions. If you declare a 1920x1080 display but send a 960x540 screenshot, Claude's coordinates will be off by a factor of two. Always ensure consistency between the tool configuration and the actual screenshot dimensions.

Supported Actions

Claude can issue the following action types through the computer use tool:

click — Moves the mouse to a coordinate and clicks. Supports left, right, and middle button clicks, as well as double-click and triple-click.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

type — Types a string of text at the current cursor position. Handles special characters and Unicode correctly.

key — Presses a keyboard key or combination. Supports modifiers like ctrl+c, alt+tab, and shift+enter.

scroll — Scrolls up or down at a specified coordinate. Useful for navigating long pages and dropdown menus.

screenshot — Requests a fresh screenshot without taking any action. Claude uses this when it needs to re-examine the screen after waiting for a page to load.

When to Use Computer Use

Computer Use excels in scenarios where traditional automation fails: websites with heavy JavaScript rendering, applications behind authentication flows that change frequently, legacy desktop applications without APIs, and workflows that span multiple applications. It is particularly powerful for tasks that require visual understanding — reading charts, interpreting layouts, or interacting with canvas-based applications.

However, it is slower and more expensive than DOM-based tools like Playwright for well-structured web applications. The right approach depends on your specific use case and the stability of the target interface.

FAQ

Does Claude Computer Use work with any operating system?

Claude itself is OS-agnostic — it processes screenshots and returns actions. Your execution layer needs to handle the platform-specific input simulation. Anthropic provides a reference Docker container based on Linux with xdotool, but you can build execution layers for macOS (using cliclick or AppleScript) or Windows (using pyautogui or AutoHotkey).

How does Claude know where to click if the layout changes?

Because Claude uses vision rather than fixed selectors, it adapts to layout changes naturally. If a button moves from the left sidebar to the top nav, Claude sees the button in its new location and clicks there. This is one of the key advantages over selector-based automation.

What resolution should I use for screenshots?

Anthropic recommends keeping screenshots within the supported range — typically 1280x800 or 1920x1080. Higher resolutions increase token consumption since larger images cost more tokens. For most use cases, 1280x800 provides a good balance between visual clarity and cost efficiency.

#ClaudeComputerUse #AnthropicAPI #BrowserAutomation #VisionAI #DesktopAutomation #AgenticAI #AIAutomation

Claude Computer Use API: How Anthropic's Vision Model Controls Desktop and Browser

What Is Claude Computer Use?

The Screenshot-Action Feedback Loop

The Coordinate System

Supported Actions

When to Use Computer Use

FAQ

Does Claude Computer Use work with any operating system?

How does Claude know where to click if the layout changes?

What resolution should I use for screenshots?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding