Claude Computer Use API: How Anthropic's Vision Model Controls Desktop and Browser
Understand how Claude's computer use tool works under the hood — the screenshot-action feedback loop, the coordinate system, supported actions, and how to integrate it via the Anthropic API.
What Is Claude Computer Use?
Claude Computer Use is a capability that lets Claude observe a computer screen via screenshots and take actions by issuing mouse clicks, keyboard presses, and scroll commands. Unlike traditional browser automation frameworks that rely on HTML selectors and DOM traversal, Claude operates purely through vision. It looks at pixels, understands what is on screen, and decides what to click or type next.
This fundamentally changes the automation landscape. You no longer need to reverse-engineer a website's DOM structure, maintain fragile CSS selectors, or deal with shadow DOMs and iframes. Claude sees the screen the same way a human does and interacts with it accordingly.
The Screenshot-Action Feedback Loop
The core mechanism is a tight loop between observation and action:
- Your application captures a screenshot of the current screen state
- The screenshot is sent to Claude along with the task instructions
- Claude analyzes the image and returns a tool call specifying an action (click, type, scroll, etc.)
- Your application executes that action on the actual machine
- A new screenshot is captured and sent back to Claude
- The loop continues until Claude determines the task is complete
Here is a minimal implementation of this loop using the Anthropic Python SDK:
import anthropic
import base64
import subprocess
client = anthropic.Anthropic()
def capture_screenshot() -> str:
"""Capture screen and return base64-encoded PNG."""
subprocess.run(["scrot", "/tmp/screen.png"], check=True)
with open("/tmp/screen.png", "rb") as f:
return base64.standard_b64encode(f.read()).decode()
def execute_action(action: dict):
"""Execute a computer use action via xdotool."""
if action["type"] == "click":
x, y = action["coordinate"]
subprocess.run(["xdotool", "mousemove", str(x), str(y)])
subprocess.run(["xdotool", "click", "1"])
elif action["type"] == "type":
subprocess.run(["xdotool", "type", "--clearmodifiers", action["text"]])
elif action["type"] == "key":
subprocess.run(["xdotool", "key", action["text"]])
elif action["type"] == "scroll":
x, y = action["coordinate"]
subprocess.run(["xdotool", "mousemove", str(x), str(y)])
direction = "5" if action["direction"] == "down" else "-5"
subprocess.run(["xdotool", "click", direction])
def run_computer_use(task: str):
messages = [{"role": "user", "content": task}]
while True:
screenshot_b64 = capture_screenshot()
# Append screenshot to the conversation
messages.append({
"role": "user",
"content": [{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
},
}],
})
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=[{
"type": "computer_20241022",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 0,
}],
messages=messages,
)
# Check if Claude wants to use a tool or is done
if response.stop_reason == "end_turn":
print("Task complete:", response.content[0].text)
break
for block in response.content:
if block.type == "tool_use":
execute_action(block.input)
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{"type": "tool_result", "tool_use_id": block.id, "content": "Action executed"}],
})
The Coordinate System
Claude Computer Use operates on an absolute pixel coordinate system. When you configure the tool, you specify display_width_px and display_height_px. Claude returns coordinates within this space. The origin (0, 0) is the top-left corner of the screen.
This means that the screenshot resolution you send must match the declared display dimensions. If you declare a 1920x1080 display but send a 960x540 screenshot, Claude's coordinates will be off by a factor of two. Always ensure consistency between the tool configuration and the actual screenshot dimensions.
Supported Actions
Claude can issue the following action types through the computer use tool:
click — Moves the mouse to a coordinate and clicks. Supports left, right, and middle button clicks, as well as double-click and triple-click.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
type — Types a string of text at the current cursor position. Handles special characters and Unicode correctly.
key — Presses a keyboard key or combination. Supports modifiers like ctrl+c, alt+tab, and shift+enter.
scroll — Scrolls up or down at a specified coordinate. Useful for navigating long pages and dropdown menus.
screenshot — Requests a fresh screenshot without taking any action. Claude uses this when it needs to re-examine the screen after waiting for a page to load.
When to Use Computer Use
Computer Use excels in scenarios where traditional automation fails: websites with heavy JavaScript rendering, applications behind authentication flows that change frequently, legacy desktop applications without APIs, and workflows that span multiple applications. It is particularly powerful for tasks that require visual understanding — reading charts, interpreting layouts, or interacting with canvas-based applications.
However, it is slower and more expensive than DOM-based tools like Playwright for well-structured web applications. The right approach depends on your specific use case and the stability of the target interface.
FAQ
Does Claude Computer Use work with any operating system?
Claude itself is OS-agnostic — it processes screenshots and returns actions. Your execution layer needs to handle the platform-specific input simulation. Anthropic provides a reference Docker container based on Linux with xdotool, but you can build execution layers for macOS (using cliclick or AppleScript) or Windows (using pyautogui or AutoHotkey).
How does Claude know where to click if the layout changes?
Because Claude uses vision rather than fixed selectors, it adapts to layout changes naturally. If a button moves from the left sidebar to the top nav, Claude sees the button in its new location and clicks there. This is one of the key advantages over selector-based automation.
What resolution should I use for screenshots?
Anthropic recommends keeping screenshots within the supported range — typically 1280x800 or 1920x1080. Higher resolutions increase token consumption since larger images cost more tokens. For most use cases, 1280x800 provides a good balance between visual clarity and cost efficiency.
#ClaudeComputerUse #AnthropicAPI #BrowserAutomation #VisionAI #DesktopAutomation #AgenticAI #AIAutomation
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.