Claude Computer Use for Agents: Automating Desktop and Browser Tasks
Learn how to build agents that use Claude's computer use capability to analyze screenshots, map coordinates, execute mouse and keyboard actions, and verify results on desktop and browser interfaces.
What is Claude Computer Use
Claude computer use allows an AI agent to interact with a computer the way a human does — by looking at screenshots and performing mouse clicks, keyboard input, and scrolling. Instead of calling APIs or parsing HTML, the agent sees the screen as an image and decides what actions to take based on visual understanding.
This capability is useful for automating legacy applications that lack APIs, testing web applications, filling out forms across multiple websites, and any workflow where a human would normally sit at a computer clicking through screens.
How Computer Use Works
The workflow follows a perception-action loop:
- Your code takes a screenshot of the screen
- The screenshot is sent to Claude as an image
- Claude analyzes the screenshot and decides what action to take
- Your code executes that action (click, type, scroll)
- A new screenshot is taken and the loop repeats
Claude uses a special computer_20250124 tool that defines the available actions. The tool specification tells Claude the screen dimensions so it can map visual elements to pixel coordinates.
Setting Up the Computer Use Tool
import anthropic
import base64
import subprocess
import json
client = anthropic.Anthropic()
# Define the computer use tool with your screen dimensions
computer_tool = {
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 0,
}
def take_screenshot() -> str:
"""Capture the screen and return base64-encoded PNG."""
subprocess.run(["scrot", "/tmp/screenshot.png", "-o"], check=True)
with open("/tmp/screenshot.png", "rb") as f:
return base64.standard_b64encode(f.read()).decode()
The display_width_px and display_height_px must match your actual screen resolution. Claude uses these dimensions to calculate pixel coordinates for clicks.
Building the Computer Use Agent Loop
The agent loop sends screenshots to Claude and executes the returned actions:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
def run_computer_agent(task: str, max_steps: int = 30):
messages = [{"role": "user", "content": task}]
for step in range(max_steps):
# Take a screenshot
screenshot_b64 = take_screenshot()
# Add screenshot to the conversation
screenshot_message = {
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
}
},
{"type": "text", "text": "Here is the current screen. What action should I take next?"}
]
}
if step > 0:
messages.append(screenshot_message)
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=[computer_tool],
messages=messages,
)
# Check if Claude is done
if response.stop_reason == "end_turn":
final_text = [b.text for b in response.content if b.type == "text"]
print(f"Agent completed: {''.join(final_text)}")
return
# Execute tool actions
messages.append({"role": "assistant", "content": response.content})
for block in response.content:
if block.type == "tool_use":
execute_computer_action(block.input)
print(f"Step {step + 1} completed")
Executing Computer Actions
Claude returns structured action commands that map to system-level input:
import pyautogui
import time
def execute_computer_action(action: dict):
"""Execute a computer use action."""
action_type = action.get("action")
if action_type == "mouse_move":
x, y = action["coordinate"]
pyautogui.moveTo(x, y)
elif action_type == "left_click":
x, y = action["coordinate"]
pyautogui.click(x, y)
elif action_type == "left_click_drag":
start = action["start_coordinate"]
end = action["coordinate"]
pyautogui.moveTo(start[0], start[1])
pyautogui.drag(end[0] - start[0], end[1] - start[1])
elif action_type == "double_click":
x, y = action["coordinate"]
pyautogui.doubleClick(x, y)
elif action_type == "right_click":
x, y = action["coordinate"]
pyautogui.rightClick(x, y)
elif action_type == "type":
pyautogui.typewrite(action["text"], interval=0.02)
elif action_type == "key":
pyautogui.hotkey(*action["text"].split("+"))
elif action_type == "screenshot":
pass # Will be handled by the next loop iteration
elif action_type == "scroll":
x, y = action["coordinate"]
pyautogui.moveTo(x, y)
direction = action.get("direction", "down")
amount = action.get("amount", 3)
scroll_val = amount if direction == "up" else -amount
pyautogui.scroll(scroll_val)
# Brief pause to let the UI update
time.sleep(0.5)
Verification Strategies
Reliable computer use agents verify that their actions worked. After each action, the next screenshot shows the result. Add verification prompts to your system message:
system_prompt = """You are a computer use agent. After every action:
1. Wait for the screen to update
2. Verify the action had the expected effect
3. If something unexpected happened, try an alternative approach
4. Never assume an action succeeded without visual confirmation
If you encounter an error dialog or unexpected state, describe what
you see and attempt to recover before continuing."""
You can also add automated verification by checking for specific visual elements:
def verify_element_present(screenshot_b64: str, description: str) -> bool:
"""Ask Claude to verify an element is visible on screen."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=100,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": "image/png", "data": screenshot_b64}
},
{"type": "text", "text": f"Is the following element visible? Answer YES or NO: {description}"}
]
}]
)
return "YES" in response.content[0].text.upper()
Safety Considerations
Computer use agents can interact with real systems, so safety is critical. Run agents in sandboxed environments like Docker containers or virtual machines. Never give a computer use agent access to sensitive credentials or production systems without human oversight.
FAQ
What screen resolution should I use for computer use?
Anthropic recommends 1024x768 for optimal performance. Lower resolutions mean smaller screenshots (fewer tokens and lower cost) while still being clear enough for Claude to identify UI elements. Higher resolutions work but increase token usage and cost.
Can computer use work with web browsers specifically?
Yes, and browsers are one of the most common use cases. Claude can navigate websites, fill forms, click buttons, and read page content from screenshots. For browser-specific automation, consider running the agent inside a headless browser environment with virtual display (Xvfb) for consistent rendering.
How reliable is coordinate-based clicking?
Claude is surprisingly accurate at mapping visual elements to coordinates, but dynamic content, pop-ups, and animations can cause misclicks. Build retry logic into your agent — if a click does not produce the expected result, Claude can analyze the new screenshot and try again. Using lower resolutions and waiting for page loads both improve reliability.
#Claude #ComputerUse #BrowserAutomation #DesktopAutomation #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.