From 15% to 72.5% in 15 Months

Claude's ability to operate a computer like a human has improved dramatically, with Sonnet 4.6 scoring 72.5% on OSWorld — up from under 15% in late 2024. The benchmark measures an AI's ability to complete real desktop tasks.

What OSWorld Tests

OSWorld evaluates whether an AI can:

Navigate complex spreadsheets
Complete web forms
Switch between applications
Follow multi-step instructions
Handle unexpected dialog boxes and errors

A score of 72.5% means Claude can successfully complete nearly three-quarters of these real-world desktop tasks — approaching the level of a competent human operator.

How They Got Here

Two key factors drove the improvement:

Model training improvements in the 4.6 generation focused on spatial understanding and interaction patterns
Vercept acquisition — the desktop AI startup whose team and technology now contribute directly to Claude's computer use capabilities

Comparison Across Models

Model	OSWorld Score
Claude Sonnet 4.6	72.5%
Claude Opus 4.6	72.7%
Previous generation	~50%
Late 2024	<15%

Practical Implications

At this performance level, Claude can realistically automate routine desktop work: data entry, form filling, report generation, and application navigation. The gap between "demo impressive" and "production useful" has closed.

Source: Anthropic | NxCode | DataCamp | Natural 20

Claude's Computer Use Hits 72.5% on OSWorld — Approaching Human-Level Desktop Operation

From 15% to 72.5% in 15 Months

What OSWorld Tests

How They Got Here

Comparison Across Models

Practical Implications

Try CallSphere AI Voice Agents

Related Articles

QuitGPT Movement Plans In-Person Protest at OpenAI HQ as 1.5 Million Take Action

'Cancel ChatGPT' Movement Goes Viral as Users Flee to Claude Over Pentagon Deal

Claude Launches Memory Import: Switch from ChatGPT Without Losing Your Data