Building a Claude Desktop Automation Agent: Beyond the Browser to Native Applications
Extend Claude Computer Use from browser automation to full desktop control — switching between applications, navigating native menus, performing file operations, and orchestrating workflows across multiple desktop programs.
Beyond the Browser
Claude Computer Use is not limited to web browsers. Because it operates on screenshots and issues keyboard/mouse commands, it can control any application visible on screen — spreadsheets, email clients, file managers, terminal windows, design tools, and legacy enterprise software.
This opens up powerful automation scenarios that span multiple applications: extracting data from a web portal, pasting it into Excel, generating a chart, copying the chart into a Word document, and emailing the result. All driven by a single Claude agent that sees and interacts with each application in turn.
Desktop Screenshot and Input on Linux
For desktop automation, we use system-level tools instead of Playwright. On Linux, scrot captures screenshots and xdotool handles input:
import subprocess
import base64
import time
class DesktopController:
def __init__(self, display: str = ":0"):
self.display = display
self.env = {"DISPLAY": display}
def screenshot(self) -> str:
"""Capture the entire screen."""
path = "/tmp/desktop_screenshot.png"
subprocess.run(
["scrot", path],
env=self.env,
check=True,
)
with open(path, "rb") as f:
return base64.standard_b64encode(f.read()).decode()
def click(self, x: int, y: int, button: int = 1):
"""Click at coordinates. button: 1=left, 2=middle, 3=right."""
subprocess.run(
["xdotool", "mousemove", str(x), str(y)],
env=self.env,
)
subprocess.run(
["xdotool", "click", str(button)],
env=self.env,
)
def double_click(self, x: int, y: int):
subprocess.run(
["xdotool", "mousemove", str(x), str(y)],
env=self.env,
)
subprocess.run(
["xdotool", "click", "--repeat", "2", "--delay", "100", "1"],
env=self.env,
)
def type_text(self, text: str, delay_ms: int = 50):
subprocess.run(
["xdotool", "type", "--delay", str(delay_ms), text],
env=self.env,
)
def press_key(self, key: str):
"""Press a key combination, e.g., 'ctrl+s', 'alt+Tab', 'Return'."""
subprocess.run(
["xdotool", "key", key],
env=self.env,
)
def scroll(self, x: int, y: int, direction: str, clicks: int = 3):
subprocess.run(
["xdotool", "mousemove", str(x), str(y)],
env=self.env,
)
button = "5" if direction == "down" else "4"
subprocess.run(
["xdotool", "click", "--repeat", str(clicks), button],
env=self.env,
)
Application Switching
A desktop automation agent needs to switch between applications reliably. The agent uses visual identification combined with keyboard shortcuts:
class AppSwitcher:
def __init__(self, desktop: DesktopController, claude_client):
self.desktop = desktop
self.client = claude_client
def switch_to_app(self, app_name: str) -> bool:
"""Switch to a running application by name."""
# Try wmctrl first for reliable window activation
result = subprocess.run(
["wmctrl", "-a", app_name],
capture_output=True,
)
if result.returncode == 0:
time.sleep(0.5)
return True
# Fall back to Alt+Tab with visual verification
self.desktop.press_key("alt+Tab")
time.sleep(0.5)
screenshot_b64 = self.desktop.screenshot()
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=256,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
}},
{"type": "text", "text": f"Is the application '{app_name}' currently in the foreground? "
f"Return JSON: {{"is_active": bool, "current_app": str}}"},
],
}],
)
result = json.loads(response.content[0].text)
return result["is_active"]
def launch_app(self, command: str):
"""Launch an application."""
subprocess.Popen(
command.split(),
env={**self.desktop.env, "PATH": "/usr/bin:/usr/local/bin"},
)
time.sleep(3) # Wait for application to start
Menu Navigation
Native application menus require careful visual navigation — open the menu bar, find the right item, navigate submenus:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import anthropic
import json
class MenuNavigator:
def __init__(self, desktop: DesktopController, claude_client):
self.desktop = desktop
self.client = claude_client
def navigate_menu(self, menu_path: list[str]):
"""Navigate a menu hierarchy, e.g., ['File', 'Export', 'PDF']."""
for i, item in enumerate(menu_path):
screenshot_b64 = self.desktop.screenshot()
prompt = f"Click on the menu item labeled '{item}'."
if i == 0:
prompt = f"Click on the '{item}' menu in the menu bar at the top of the application."
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
}},
{"type": "text", "text": f"{prompt} Return coordinates: {{"x": int, "y": int}}"},
],
}],
)
coords = json.loads(response.content[0].text)
self.desktop.click(coords["x"], coords["y"])
time.sleep(0.5)
Multi-Application Workflow Example
Here is a complete workflow that extracts data from a web page, creates a spreadsheet, and saves it:
import asyncio
class CrossAppWorkflow:
def __init__(self):
self.desktop = DesktopController()
self.client = anthropic.Anthropic()
self.switcher = AppSwitcher(self.desktop, self.client)
self.menu = MenuNavigator(self.desktop, self.client)
def web_to_spreadsheet(self, url: str, output_path: str):
"""Extract table from web page and create a spreadsheet."""
# Step 1: Open browser and navigate to URL
self.switcher.launch_app(f"firefox {url}")
time.sleep(3)
# Step 2: Extract table data using Claude vision
screenshot_b64 = self.desktop.screenshot()
table_data = self._extract_table(screenshot_b64)
# Step 3: Open LibreOffice Calc
self.switcher.launch_app("libreoffice --calc")
time.sleep(4)
# Step 4: Enter data into the spreadsheet
self._populate_spreadsheet(table_data)
# Step 5: Save the file
self.desktop.press_key("ctrl+s")
time.sleep(1)
self.desktop.type_text(output_path)
self.desktop.press_key("Return")
def _populate_spreadsheet(self, data: list[dict]):
"""Type data into the active spreadsheet cell by cell."""
if not data:
return
headers = list(data[0].keys())
# Type headers
for h in headers:
self.desktop.type_text(h)
self.desktop.press_key("Tab")
self.desktop.press_key("Return")
# Type data rows
for row in data:
for h in headers:
self.desktop.type_text(str(row.get(h, "")))
self.desktop.press_key("Tab")
self.desktop.press_key("Return")
Handling File Dialogs
File save/open dialogs are a common challenge in desktop automation. Claude can navigate them visually:
def save_file_dialog(self, file_path: str):
"""Handle a file save dialog by navigating to the path and saving."""
time.sleep(1)
screenshot_b64 = self.desktop.screenshot()
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
}},
{"type": "text", "text": f"""A file save dialog is open.
I need to save to: {file_path}
Find the filename input field coordinates.
Return JSON: {{"filename_field": {{"x": int, "y": int}}, "save_button": {{"x": int, "y": int}}}}"""},
],
}],
)
result = json.loads(response.content[0].text)
self.desktop.click(result["filename_field"]["x"], result["filename_field"]["y"])
self.desktop.press_key("ctrl+a")
self.desktop.type_text(file_path)
self.desktop.click(result["save_button"]["x"], result["save_button"]["y"])
FAQ
Does desktop automation work in headless servers?
You need a display server for Claude to take screenshots of. On headless Linux servers, use Xvfb (X Virtual Frame Buffer) to create a virtual display. Anthropic's reference Docker image includes Xvfb configured and running, which is the recommended approach for server-side desktop automation.
How do I handle applications with different themes or skins?
Claude adapts to visual changes since it understands UI semantics, not pixel patterns. Whether a button is blue or gray, rounded or square, Claude recognizes it as a button. However, highly customized or non-standard UIs may need more specific instructions in the prompt.
What about automating Windows applications?
The same architecture works on Windows. Replace xdotool with pyautogui or PowerShell commands for input simulation, and use Windows screenshot APIs. The Claude API interaction remains identical since it only deals with screenshots and action descriptions.
#DesktopAutomation #ClaudeComputerUse #NativeApps #RPA #CrossAppWorkflow #AIAutomation #PythonDesktopAgent
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.