Claude's Computer Use Hits 72.5% on OSWorld — Approaching Human-Level Desktop Operation
Claude Sonnet 4.6 scores 72.5% on the OSWorld benchmark for desktop computer operation, up from under 15% in late 2024, nearly matching human performance.
From 15% to 72.5% in 15 Months
Claude's ability to operate a computer like a human has improved dramatically, with Sonnet 4.6 scoring 72.5% on OSWorld — up from under 15% in late 2024. The benchmark measures an AI's ability to complete real desktop tasks.
What OSWorld Tests
OSWorld evaluates whether an AI can:
- Navigate complex spreadsheets
- Complete web forms
- Switch between applications
- Follow multi-step instructions
- Handle unexpected dialog boxes and errors
A score of 72.5% means Claude can successfully complete nearly three-quarters of these real-world desktop tasks — approaching the level of a competent human operator.
How They Got Here
Two key factors drove the improvement:
- Model training improvements in the 4.6 generation focused on spatial understanding and interaction patterns
- Vercept acquisition — the desktop AI startup whose team and technology now contribute directly to Claude's computer use capabilities
Comparison Across Models
| Model | OSWorld Score |
|---|---|
| Claude Sonnet 4.6 | 72.5% |
| Claude Opus 4.6 | 72.7% |
| Previous generation | ~50% |
| Late 2024 | <15% |
Practical Implications
At this performance level, Claude can realistically automate routine desktop work: data entry, form filling, report generation, and application navigation. The gap between "demo impressive" and "production useful" has closed.
Source: Anthropic | NxCode | DataCamp | Natural 20
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.