Autonomous computer use - where an AI agent sees your screen and controls mouse and keyboard to complete tasks on your behalf - has gone from a party trick to a competitive frontier in about 18 months. Every major lab now ships some version of it: Anthropic's Computer Use API, OpenAI's Operator and now native GPT-5.4 computer control, Google's Gemini Agent, and more open-source frameworks built on Qwen and UI-TARS. The marketing claims are loud and often untestable. The benchmarks, imperfect as they are, give us something to argue about with actual numbers.

This leaderboard covers three benchmarks: OSWorld (the most widely adopted, with independent evaluation), OSWorld-Verified (a stricter variant where companies self-submit scores), and ScreenSpot-Pro (which tests whether a model can even find the right pixel to click on a complex professional screen). Together they reveal where the frontier actually sits.

The headline worth knowing: on March 5, 2026, OpenAI released GPT-5.4 with a self-reported 75.0% on OSWorld-Verified - the first time any model has claimed to beat the human baseline of 72.4% on that benchmark. Whether that number holds up under independent review is a different question, and one I'll get into.

TL;DR

On independently assessed OSWorld, Claude Opus 4.6 leads at 72.7% with Claude Sonnet 4.6 just behind at 72.5%
GPT-5.4 claims 75.0% on OSWorld-Verified (March 2026), surpassing the 72.4% human baseline - but that score is self-reported and awaiting independent verification
Qwen3 VL 235B is the strongest open-source model at 66.7%, a meaningful gap below the Claude frontier
GUI grounding (finding the right pixel to click) remains a bottleneck: even top models score under 70% on ScreenSpot-Pro in professional software environments

The Benchmarks

OSWorld

OSWorld launched in 2024 from researchers at XLANG lab and has become the standard for computer use evaluation. It runs 369 tasks across Ubuntu, Windows, and macOS in real operating system environments - not sandboxes or simulations. Agents receive natural language instructions and interact through screenshots plus keyboard/mouse commands, the same interface a human would use. Scores are assessed by the research team under controlled settings.

When the benchmark launched, GPT-4 with computer use scored around 12%. The human baseline sits at 72.4%. The gap has closed fast.

Importantly, OSWorld results on the public leaderboard are assessed by the xlang.ai team, not by the model providers themselves. That distinction matters when comparing to OSWorld-Verified below.

OSWorld-Verified

OSWorld-Verified is an enhanced version of the original benchmark with improved task quality and additional evaluation dimensions. Companies and researchers self-submit scores - the xlang.ai team hasn't yet independently verified most of the newer entries. When you see a provider say their model scored X on OSWorld-Verified, that number comes from their own evaluation. Some are verified, most recent submissions are not.

ScreenSpot-Pro

ScreenSpot-Pro tests a specific sub-skill: GUI grounding, or knowing exactly where on the screen to click. It uses 1,581 expert-annotated screenshots from 23 professional applications - think MATLAB, Photoshop, Visual Studio Code - where target UI elements are often tiny and buried in complex layouts. A model that can answer questions correctly but clicks the wrong button is useless for real automation. ScreenSpot-Pro exists to expose that gap.

OSWorld Rankings

The table below covers the top models on the standard OSWorld leaderboard as evaluated by the xlang.ai team. Scores represent overall task success rate across all 369 tasks.

Rank	Model	Provider	OSWorld Score
1	Claude Opus 4.6	Anthropic	72.7%
2	Claude Sonnet 4.6	Anthropic	72.5%
3	Qwen3 VL 235B A22B Instruct	Alibaba / Qwen	66.7%
4	Claude Opus 4.5	Anthropic	66.3%
5	Claude Sonnet 4.5	Anthropic	61.4%
6	Claude Haiku 4.5	Anthropic	50.7%
7	Qwen3 VL 32B Thinking	Alibaba / Qwen	41.0%
8	Qwen3 VL 235B A22B Thinking	Alibaba / Qwen	38.1%
9	Qwen3 VL 8B Instruct	Alibaba / Qwen	33.9%
10	Qwen3 VL 32B Instruct	Alibaba / Qwen	32.6%
11	Qwen2.5 VL 72B Instruct	Alibaba / Qwen	8.8%

Human baseline: 72.4%

Anthropic sweeps the top six positions. That's not close - Qwen3 VL 235B at rank three is 6 points behind Claude Sonnet 4.6 and 22 points behind Claude Haiku 4.5 at rank six. The gap between Anthropic's current generation and everyone else on this benchmark is real. (See our Claude Opus 4.6 review for hands-on testing beyond the benchmark numbers.)

One result worth flagging: the Qwen3 VL models in Thinking mode consistently underperform their Instruct counterparts on OSWorld. Qwen3 VL 32B Thinking at 41.0% versus Qwen3 VL 32B Instruct at 32.6% is the exception - Thinking mode wins here. But Qwen3 VL 235B Thinking (38.1%) falls far behind Qwen3 VL 235B Instruct (66.7%). Extended reasoning helps some computer use tasks and hurts others, especially time-sensitive GUI interactions where overthinking slows execution.

OSWorld benchmark evaluation showing an AI agent interacting with a real desktop environment OSWorld tests agents on real operating systems across Ubuntu, Windows, and macOS, requiring actual keyboard and mouse control rather than API calls. Source: os-world.github.io

OSWorld-Verified: The Self-Reported Claims

OSWorld-Verified uses a stricter evaluation setup. The table below reflects self-reported numbers - providers run evaluations internally and submit results. The xlang.ai team has not yet independently verified all of the entries listed here.

Rank	Model	Provider	OSWorld-Verified Score	Verified?
1	GPT-5.4	OpenAI	75.0%	Self-reported
2	GPT-5.3 Codex	OpenAI	64.7%	Self-reported
3	Qwen3.5-122B-A10B	Alibaba / Qwen	58.0%	Self-reported
4	Qwen3.5-27B	Alibaba / Qwen	56.2%	Self-reported
5	Qwen3.5-35B-A3B	Alibaba / Qwen	54.5%	Self-reported
6	UiPath Screen Agent (Claude Opus 4.5)	UiPath / Anthropic	53.6%	Verified

Human baseline: 72.4%

GPT-5.4's 75.0% claim is the number everyone is quoting. Take it seriously but not as settled fact. OpenAI's own release announcement stated the figure explicitly and noted it topped the 72.4% human baseline, which would make GPT-5.4 the first model to cross that threshold on OSWorld-Verified. But the xlang.ai team hasn't yet published their own independent evaluation of GPT-5.4.

For contrast: UiPath's Screen Agent (powered by Claude Opus 4.5) submitted to OSWorld-Verified in January 2026 and scored 53.6%, verified by the research team. That was the #1 position at the time.

Self-reported benchmarks aren't lies - they're just unaudited. The history of AI benchmarks is full of numbers that looked different once an independent team ran the eval.

The UiPath result shows something more useful than a raw foundation model score: what happens when a purpose-built enterprise automation framework sits on top of a strong base model. Their agentic scaffolding - task planning, error recovery, verification loops - adds meaningful points beyond what Claude Opus 4.5 achieves alone. For anyone building on top of these APIs, the infrastructure layer matters almost as much as the underlying model.

GUI Grounding: ScreenSpot-Pro

Passing OSWorld requires completing full tasks. ScreenSpot-Pro tests a lower-level precondition: can the model correctly identify where on a complex professional screen the target element lives? If your agent can't reliably locate the right button in Photoshop or AutoCAD, the task completion score is irrelevant.

Rank	Model	Size	ScreenSpot-Pro Score
1	MAI-UI	32B	67.9%
2	MAI-UI	8B	65.7%
3	MAI-UI	2B	57.4%
4	OS-Atlas + ScreenSeekeR	7B	48.1%
5	OS-Atlas	7B	18.9%
6	UGround	7B	16.5%
7	AriaUI	-	11.3%
8	Qwen2-VL	7B	<2%
9	GPT-4o	-	<2%

MAI-UI, developed by Alibaba's Tongyi team, dominates all three size classes. Its 32B model at 67.9% is nearly 20 points ahead of the next best standalone model (OS-Atlas at 18.9%). The gap is stark.

The ScreenSeekeR method (proposed in the ScreenSpot-Pro paper) shows how much a smart visual search strategy can help: it boosts OS-Atlas-7B from 18.9% to 48.1% without any retraining. The approach uses a planner to guide a cascaded search, zooming into likely regions before making a final click prediction. It's a reminder that prompt engineering and scaffolding can rescue a model that would otherwise be useless for professional GUI work.

GPT-4o and Qwen2-VL scoring below 2% on ScreenSpot-Pro is not a failure of general intelligence - those models handle complex reasoning tasks. It's a failure of specialized visual grounding on dense, cluttered UIs that look nothing like typical training distributions. Professional software interfaces are genuinely different from consumer apps.

ScreenSpot-Pro benchmark showing complex professional software interfaces that challenge GUI grounding models ScreenSpot-Pro uses 23 professional applications like Photoshop and MATLAB. Even small target elements in dense UIs defeat models that perform well on consumer app tasks. Source: arxiv.org

What the Numbers Actually Mean

Anthropic's Lead Is Real, With Caveats

On the independently evaluated OSWorld leaderboard, Anthropic holds the top six positions. That's reproducible - the xlang.ai team runs their evaluations under controlled conditions, not lab-specific settings. For anyone using computer use APIs in production today, Claude Sonnet 4.6 at 72.5% is the practical reference point. Sonnet costs markedly less than Opus and lands virtually the same score on OSWorld; use Opus only if you have a specific reason to.

The Anthropic story extends beyond raw model performance. The Vercept acquisition - Anthropic's purchase of a computer vision startup focused on GUI understanding - signals they're building the training infrastructure to stay ahead on this benchmark specifically. A model that scores well on OSWorld because it was trained on computer use data is a different product than one that happens to be a good reasoner that also works on computers.

GPT-5.4 Could Change the Ranking

GPT-5.4 released on March 5, 2026 with native computer use built directly into the model rather than layered on top via scaffolding. Read our GPT-5.4 review for a hands-on look at how that plays out in practice.

If the 75.0% OSWorld-Verified claim holds under independent evaluation, it represents a real step - not just past the human baseline but past Claude Opus 4.6's independently-verified 72.7%. The score is plausible. GPT-5.4 jumped from GPT-5.2's 47.3% on OSWorld-Verified, a 27-point improvement that tracks with what happens when a company redesigns a model specifically for computer use rather than treating it as an add-on. OpenAI's internal evaluation methodology is not public, so until xlang.ai runs their own evaluation of GPT-5.4, the 75.0% sits in a different category than the independently-verified Anthropic scores.

Open Source Is Not Yet Competitive at the Top

Qwen3 VL 235B at 66.7% is impressive for an open model. It's also 6 points behind Claude Sonnet 4.6 and closer to Claude Haiku 4.5 than to Claude Opus 4.6. For teams that need on-premises deployment or can't route through Anthropic's API, Qwen3 VL 235B is the current answer, but you accept a meaningful accuracy penalty on difficult tasks.

UI-TARS-2, while not on the standard OSWorld leaderboard yet, scores 47.5% on OSWorld and shows strong performance on other agentic benchmarks - 73.3% on AndroidWorld and 50.6% on WindowsAgentArena. It's specialized for GUI interaction and may close the gap on future OSWorld evaluations.

Benchmark Saturation Isn't Here Yet

OSWorld isn't saturated. Humans at 72.4% still define the ceiling, and only one model claims to exceed it (GPT-5.4, unverified). The original benchmark launched with AI at 12%. The progress in 18 months is genuine and fast. But the remaining gap in real deployments is larger than these numbers suggest, because OSWorld tasks are curated and repeatable - production workflows aren't.

Practical Guidance

For production computer use agents: Claude Sonnet 4.6 is the default choice. It matches Opus 4.6 within 0.2 points on OSWorld at a fraction of the cost. If you're seeing task failures on complex multi-application workflows, upgrade to Opus 4.6 before switching providers.

For enterprise RPA deployments: The UiPath Screen Agent pattern - Claude Opus 4.5 as the backbone, enterprise scaffolding on top - achieved verified 53.6% on OSWorld-Verified and may be the right model for organizations that need audit trails, compliance controls, and SLA guarantees around automation.

For open-source or on-prem requirements: Qwen3 VL 235B Instruct is the answer. Avoid the Thinking-mode variants for interactive GUI tasks; Instruct mode beats clearly at this model scale.

For GUI grounding specifically: If your agent needs to work reliably in professional desktop software (CAD, creative tools, scientific software), invest in the ScreenSeekeR visual search approach or use MAI-UI if fine-tuning is an option. General-purpose frontier models will underperform on dense professional UIs regardless of their OSWorld scores.

FAQ

Which model is best for computer use tasks?

Claude Opus 4.6 leads independently-verified evaluations at 72.7% on OSWorld. GPT-5.4 claims 75.0% but that score is self-reported by OpenAI and awaits independent verification.

What is the human baseline on OSWorld?

Humans score about 72.4% on OSWorld tasks. GPT-5.4 self-reports 75.0% on OSWorld-Verified, which would make it the first model to surpass this baseline.

Is Qwen3 VL good enough for computer use?

Qwen3 VL 235B Instruct scores 66.7% on OSWorld, making it the strongest open-source option. That's 6 points below Claude Sonnet 4.6 - meaningful for complex tasks.

What is OSWorld-Verified vs standard OSWorld?

OSWorld-Verified is a stricter version of the benchmark. Crucially, most recent submissions to OSWorld-Verified are self-reported by providers, not independently assessed by the research team.

Why do Thinking-mode models underperform on OSWorld?

Extended reasoning helps with planning but can hurt GUI execution tasks that require fast, iterative action-observation loops. Thinking mode adds latency and sometimes overthinks straightforward click sequences.

What is ScreenSpot-Pro testing?

ScreenSpot-Pro tests GUI grounding: whether a model correctly identifies where to click on a complex professional software screen. It's a sub-skill that determines whether task completion scores translate into real automation.

Sources: