Computer Use in GPT-5.4: Building AI Agents That Navigate Desktop Applications
Technical guide to GPT-5.4's computer use capabilities for building AI agents that interact with desktop UIs, browser automation, and real-world application workflows.
Why Computer Use Matters for AI Agents
APIs are the ideal way for software to communicate, but the reality of enterprise environments is that many critical systems have no API at all. Legacy ERP systems, government portals, internal tools built on decade-old frameworks, and desktop applications like Excel, SAP GUI, and proprietary industry software — these are the systems where most enterprise work actually happens.
Computer use gives AI agents the ability to interact with any software the same way a human does: by looking at the screen, understanding UI elements, clicking buttons, typing text, and navigating menus. GPT-5.4's computer use capability builds on earlier research (including Anthropic's computer use and OpenAI's Operator) to deliver reliable, production-grade desktop interaction.
How GPT-5.4 Computer Use Works
The computer use protocol follows a perception-action loop. The agent receives a screenshot, reasons about what it sees, and emits one or more actions (clicks, keystrokes, scrolls). The host system executes these actions and sends back a new screenshot. This loop continues until the task is complete.
import openai
import base64
import pyautogui
import time
from PIL import ImageGrab
client = openai.OpenAI()
def capture_screenshot() -> str:
"""Capture the current screen and return as base64."""
screenshot = ImageGrab.grab()
screenshot = screenshot.resize((1920, 1080))
import io
buffer = io.BytesIO()
screenshot.save(buffer, format="PNG")
return base64.b64encode(buffer.getvalue()).decode("utf-8")
def execute_action(action: dict):
"""Execute a computer use action on the local machine."""
action_type = action["type"]
if action_type == "click":
pyautogui.click(action["x"], action["y"])
elif action_type == "double_click":
pyautogui.doubleClick(action["x"], action["y"])
elif action_type == "type":
pyautogui.typewrite(action["text"], interval=0.02)
elif action_type == "key":
pyautogui.press(action["key"])
elif action_type == "hotkey":
pyautogui.hotkey(*action["keys"])
elif action_type == "scroll":
pyautogui.scroll(action["amount"], action["x"], action["y"])
elif action_type == "move":
pyautogui.moveTo(action["x"], action["y"])
time.sleep(0.5) # Wait for UI to update
def computer_use_loop(task: str, max_steps: int = 20) -> str:
"""Run a computer use agent loop."""
messages = [
{
"role": "system",
"content": """You are an AI agent that controls a computer.
You receive screenshots and emit actions to accomplish tasks.
Available actions:
- click(x, y): Left click at coordinates
- double_click(x, y): Double click at coordinates
- type(text): Type text at current cursor position
- key(key): Press a key (enter, tab, escape, etc.)
- hotkey(keys): Press key combination (e.g., ctrl+c)
- scroll(amount, x, y): Scroll at position (positive=up)
Always describe what you see and your reasoning before acting.
When the task is complete, respond with DONE: followed by a
summary of what you accomplished."""
},
{
"role": "user",
"content": [
{"type": "text", "text": f"Task: {task}"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{capture_screenshot()}"
}
}
]
}
]
for step in range(max_steps):
response = client.chat.completions.create(
model="gpt-5.4",
messages=messages,
tools=[{
"type": "computer_use",
"display_width": 1920,
"display_height": 1080
}],
max_tokens=1024
)
choice = response.choices[0]
messages.append(choice.message)
# Check if task is complete
if choice.message.content and "DONE:" in choice.message.content:
return choice.message.content
# Execute computer actions
if hasattr(choice.message, 'computer_actions'):
for action in choice.message.computer_actions:
execute_action(action)
# Capture new screenshot after actions
new_screenshot = capture_screenshot()
messages.append({
"role": "user",
"content": [
{"type": "text", "text": "Screenshot after actions:"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{new_screenshot}"
}
}
]
})
return "Task did not complete within maximum steps."
Browser Automation with Computer Use
One of the most practical applications of computer use is browser automation. While tools like Playwright and Selenium work well for structured web pages, they break on dynamic SPAs, pages with anti-bot measures, and applications behind authentication flows that resist programmatic access. Computer use bypasses all of these issues because it interacts with the rendered page exactly as a human would.
import subprocess
import time
class BrowserAgent:
def __init__(self):
self.browser_process = None
def launch_browser(self, url: str):
"""Launch Chrome and navigate to URL."""
self.browser_process = subprocess.Popen([
"google-chrome",
"--window-size=1920,1080",
"--window-position=0,0",
url
])
time.sleep(3) # Wait for page load
def automate_task(self, task: str) -> str:
"""Use GPT-5.4 computer use to automate a browser task."""
return computer_use_loop(task)
# Example: Fill out a complex multi-step form
agent = BrowserAgent()
agent.launch_browser("https://internal-portal.company.com/onboarding")
result = agent.automate_task("""
Complete the new employee onboarding form:
1. Fill in Name: John Smith
2. Fill in Department: Engineering
3. Select Start Date: April 1, 2026
4. Upload the resume (file is on the Desktop named resume.pdf)
5. Check the "I agree to terms" checkbox
6. Click Submit
""")
print(result)
Handling Dynamic UIs and Wait States
Real-world UIs are not static. Pages load asynchronously, modals appear and disappear, and buttons may be disabled until certain conditions are met. A robust computer use agent needs to handle these states gracefully.
def wait_for_element(
description: str,
timeout: int = 10,
check_interval: float = 1.0
) -> bool:
"""Wait for a UI element to appear on screen."""
start_time = time.time()
while time.time() - start_time < timeout:
screenshot_b64 = capture_screenshot()
response = client.chat.completions.create(
model="gpt-5.4-mini", # Use mini for fast checks
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Is this element visible on screen: "
f"'{description}'? Reply YES or NO only."
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{screenshot_b64}"
}
}
]
}
],
max_tokens=5
)
if "yes" in response.choices[0].message.content.lower():
return True
time.sleep(check_interval)
return False
# Usage in an agent workflow
def fill_form_with_waits(data: dict):
"""Fill a form that loads dynamically."""
# Wait for the form to load
if not wait_for_element("Name input field"):
raise TimeoutError("Form did not load within timeout")
# Fill each field
for field_name, value in data.items():
# Click the field
computer_use_loop(f"Click on the '{field_name}' input field")
# Type the value
pyautogui.hotkey('ctrl', 'a') # Select all existing text
pyautogui.typewrite(value, interval=0.02)
# Wait for any validation
time.sleep(0.5)
# Wait for submit button to be enabled
if wait_for_element("enabled Submit button"):
computer_use_loop("Click the Submit button")
Desktop Application Automation
Beyond browsers, computer use enables automation of desktop applications. This is transformative for enterprises that rely on applications like SAP, Oracle, or industry-specific software that predates modern APIs.
class DesktopAppAgent:
"""Agent that automates desktop application workflows."""
def __init__(self, app_name: str):
self.app_name = app_name
self.context = []
def launch_app(self):
"""Launch the target application."""
import subprocess
subprocess.Popen([self.app_name])
time.sleep(5) # Wait for app to load
def execute_workflow(self, steps: list[str]) -> list[str]:
"""Execute a multi-step workflow in the desktop app."""
results = []
for i, step in enumerate(steps):
print(f"Step {i+1}/{len(steps)}: {step}")
result = computer_use_loop(
f"In the {self.app_name} application, {step}. "
f"Previous steps completed: {results}"
)
results.append(result)
# Screenshot for audit trail
screenshot = ImageGrab.grab()
screenshot.save(f"audit/step_{i+1}.png")
return results
# Example: Automate a report generation workflow in Excel
excel_agent = DesktopAppAgent("excel")
excel_agent.launch_app()
results = excel_agent.execute_workflow([
"Open the file Q1_Sales_Report.xlsx from the Documents folder",
"Select the data range A1:F50 in the Sales sheet",
"Create a pivot table summarizing total sales by region",
"Generate a bar chart from the pivot table data",
"Save the chart as a PNG image on the Desktop",
"Save and close the workbook"
])
Building Reliable Computer Use Agents
Error Recovery
Computer use agents must handle UI errors gracefully — unexpected dialogs, permission prompts, and application crashes. Build error recovery into your agent loop:
def resilient_computer_use(task: str, max_retries: int = 3) -> str:
"""Computer use loop with error recovery."""
for attempt in range(max_retries):
try:
result = computer_use_loop(task, max_steps=20)
if "DONE:" in result:
return result
# Task did not complete — check for error states
screenshot_b64 = capture_screenshot()
error_check = client.chat.completions.create(
model="gpt-5.4-mini",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "Is there an error dialog, warning, or "
"unexpected popup visible? If yes, describe "
"it. If no, say CLEAR."
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{screenshot_b64}"
}
}
]
}],
max_tokens=200
)
error_desc = error_check.choices[0].message.content
if "CLEAR" not in error_desc:
# Dismiss the error and retry
computer_use_loop(
f"There is an error on screen: {error_desc}. "
f"Dismiss it and try again: {task}"
)
except Exception as e:
print(f"Attempt {attempt+1} failed: {e}")
time.sleep(2)
return "Task failed after maximum retries."
Coordinate Calibration
A common pitfall with computer use is coordinate drift — the model's predicted click coordinates do not match the actual UI layout due to display scaling, window positioning, or resolution differences. Always ensure your screenshot resolution matches your action coordinate space.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Safety Boundaries
Computer use agents have access to the entire desktop, which creates significant security risks. Implement these safeguards:
- Restrict to specific applications: Only allow the agent to interact with designated application windows
- Block sensitive areas: Define screen regions that are off-limits (e.g., the system tray, admin panels)
- Audit all actions: Log every click, keystroke, and screenshot for review
- Human confirmation for destructive actions: Require human approval before the agent clicks "Delete," "Submit Payment," or similar irreversible buttons
BLOCKED_REGIONS = [
(0, 1050, 1920, 1080), # Taskbar
(1800, 0, 1920, 40), # System tray
]
DESTRUCTIVE_KEYWORDS = [
"delete", "remove", "submit payment",
"confirm purchase", "send email"
]
def safe_execute_action(action: dict, context: str = ""):
"""Execute action with safety checks."""
# Check blocked regions
if action["type"] in ("click", "double_click"):
x, y = action["x"], action["y"]
for rx1, ry1, rx2, ry2 in BLOCKED_REGIONS:
if rx1 <= x <= rx2 and ry1 <= y <= ry2:
raise PermissionError(
f"Action blocked: click at ({x},{y}) is in a restricted region"
)
# Check for destructive actions
context_lower = context.lower()
for keyword in DESTRUCTIVE_KEYWORDS:
if keyword in context_lower:
approval = input(
f"Agent wants to perform: {context}. Approve? (y/n): "
)
if approval.lower() != 'y':
raise PermissionError("Action rejected by human operator")
execute_action(action)
Performance Optimization
Computer use is inherently slower than API calls because each step requires a screenshot capture, a vision model inference, and a UI interaction. Here are strategies to minimize latency:
Batch actions: When possible, emit multiple actions in a single model call. GPT-5.4 can plan a sequence like "click field, type text, press tab, type next field" in one turn.
Reduce screenshot resolution: Downscale screenshots to 1280x720 or even 960x540 for simpler UIs. This reduces token usage significantly while preserving enough detail for accurate interactions.
Use Mini for visual checks: Use GPT-5.4 mini for simple visual confirmations ("is the dialog gone?") and reserve GPT-5.4 for complex reasoning about what to do next.
Cache UI layouts: If the application's layout does not change between runs, cache the coordinates of common elements and skip the visual recognition step for known interactions.
FAQ
How accurate is GPT-5.4's click targeting?
In controlled benchmarks, GPT-5.4 achieves approximately 94% accuracy on click targeting for standard UI elements (buttons, text fields, checkboxes) at 1920x1080 resolution. Accuracy drops for very small elements (under 20px) and dense UIs with many overlapping interactive regions. Implementing a retry mechanism with slightly offset coordinates handles most misclicks.
Can computer use work with remote desktop sessions like RDP or VNC?
Yes. Computer use works with any visual display, including remote desktop sessions. The agent receives screenshots from the remote session and emits actions that are translated into RDP/VNC input events. This is actually a common deployment pattern because it provides natural isolation — the agent operates in a remote VM that can be restricted and monitored.
How does GPT-5.4 computer use compare to Anthropic's Claude computer use?
Both achieve similar accuracy on standard benchmarks. GPT-5.4 has an edge in handling Windows desktop applications and Microsoft Office, likely due to training data composition. Claude's computer use tends to perform better on web-based applications and Linux environments. The choice often depends on which applications your agent needs to automate.
What is the token cost of a typical computer use session?
A typical 10-step computer use session consumes approximately 50K-80K tokens — primarily from the screenshot images, which are the most token-intensive part. At GPT-5.4 pricing, a 10-step session costs roughly $0.30-0.50. For high-volume automation, consider whether a traditional scripting approach (Selenium, AutoHotKey) can handle the specific workflow at lower cost, reserving computer use for the tasks that truly require visual understanding.
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.