Web agents are the part of the AI stack where the rubber actually meets the road. Not a chat window - a model that opens a browser, reads what's on the screen, clicks buttons, fills forms, and either completes the task or fails. The benchmarks that measure this are messy, fragmented, and far harder to game than static multiple-choice evals. That's exactly why they matter.

This leaderboard covers the major browser-specific benchmarks as of April 2026. Each suite tests something slightly different: task complexity, website diversity, the role of vision vs. text, and tolerance for ambiguity. No single number tells the full story. Read the methodology sections before drawing conclusions.

TL;DR

On WebArena's multi-step task suite, Claude Mythos Preview leads tracked models at 68.7%, with specialized agentic frameworks (OpAgent, DeepSeek v3.2) pushing past 71-74% via online RL pipelines
WebVoyager scores near 97-98% for the top commercial agents in 2026, making that benchmark effectively saturated - look to WebChoreArena and BrowseComp for meaningful signal
BrowseComp is the hardest browsing eval in wide use: Deep Research scored 51.5% at launch, and the current top score is 86.9% for Claude Mythos Preview on the llm-stats.com tracker
Best budget pick for web tasks: open-source Browser Use framework running on GPT-4o hit 89.1% on WebVoyager, outperforming OpenAI's own Operator product (87%)

Why Web Agent Benchmarks Differ from General LLM Evals

General benchmarks like MMLU or GPQA test whether a model knows things. Web agent benchmarks test whether a model can do things - navigate a real or simulated browser, interpret dynamic page content, chain actions across multiple steps, and recover from errors without human help.

A laptop and monitor workstation setup representing the web browsing environment AI agents must navigate AI agents operate on real or simulated browser environments, making web agent benchmarks a distinct category from static knowledge tests. Source: unsplash.com

This distinction matters for benchmark selection. A model that tops GPQA may be terrible at clicking through a checkout flow. The correlation between raw LLM capability and web agent performance exists but isn't tight - scaffolding, observation format, and action space all contribute as much as the underlying model.

For context on how web agents relate to full desktop automation, see the Computer Use Leaderboard: Desktop AI Agent Rankings.

Benchmark Overview

WebArena

The oldest major web agent eval. 812 tasks (241 templates, ~3.3 variations each) across four domains: e-commerce, social forums, code repositories, and content management. Tasks are long-horizon - "Find when your last order shipped and post an update to the forum thread" - with programmatic grading, no LLM judge involved. Scores are pass/fail success rate across all 812 tasks.

A verified, Docker-hosted version (webarena-verified) became available in February 2026, improving reproducibility. The original leaderboard at webarena.dev lists community submissions.

VisualWebArena

910 tasks built specifically for multimodal agents, where understanding what's on screen visually (images, product photos, UI layouts) is required to complete the task. Released by the same CMU group behind WebArena. At original publication in early 2024, the best VLM agent scored 16.4% against 88.7% human performance. Most current published results still cite the original paper rather than a live leaderboard.

WebVoyager

643 tasks across 15 popular websites - Google, Amazon, GitHub, Reddit, Wikipedia among them. Uses a dual evaluation approach: human annotation plus automated GPT-4V judgment. Published in 2024 with an original GPT-4V agent scoring 59.1%. The Steel.dev leaderboard now tracks live agent submissions against this benchmark.

Mind2Web / Mind2Web 2

The original Mind2Web (NeurIPS 2023) introduced 2,000+ open-ended tasks across 137 websites in 31 domains. Mind2Web 2, published at NeurIPS 2025, raised the bar far: 130 long-horizon tasks requiring real-time browsing and cross-site information synthesis, constructed with over 1,000 hours of human annotation. It uses an Agent-as-a-Judge framework with tree-structured rubrics. Best current system is OpenAI Deep Research at 50-70% of human performance.

Online-Mind2Web, a 2025 evaluation of 300 live tasks, showed that most commercially available agents underperform the academic SeeAct baseline from early 2024. The exceptions: Claude Computer Use 3.7 and OpenAI Operator at roughly 61% success.

BrowseComp

Released by OpenAI in April 2025. 1,266 hard information-retrieval problems designed to be nearly unsolvable by keyword search alone - tasks require multi-hop reasoning across multiple retrieved pages. At launch, GPT-4o with browsing scored 1.9%, o1 scored 9.9%, and Deep Research hit 51.5%. The full paper is available as a PDF. Updated leaderboard data tracked by llm-stats.com puts scores far higher for 2026 frontier models.

WorkArena / WorkArena++

ServiceNow's enterprise-focused benchmark. WorkArena covers 33 atomic tasks in a ServiceNow instance. WorkArena++ scales this to 682 multi-step compositional tasks requiring planning, retrieval, arithmetic reasoning, and memory across browser sessions. Human performance is 93.9% on WorkArena++. GPT-4o managed only 2.1%, and no model hits meaningful performance on the L3 (ticket-like, context-rich) task tier.

WebChoreArena

A 2025 extension of WebArena with 532 tasks focused on tedious, labor-intensive work: massive memory retrieval, calculation across pages, and long-term cross-page reasoning. Gemini 2.5 Pro scores 54.8% on WebArena but drops to 37.8% on WebChoreArena. GPT-4o manages only 2.6% on WebChoreArena versus 44.9% for Gemini 2.5 Pro, exposing a much wider performance gap than the base benchmark.

Rankings

WebArena - Tracked Model Scores (April 2026)

Scores from benchlm.ai, which tracks 15 models against the standard 812-task suite.

Rank	Model	Provider	WebArena Score
1	Claude Mythos Preview	Anthropic	68.7%
2	GPT-5.4 Pro	OpenAI	65.8%
3	Claude Opus 4.6	Anthropic	64.5%
4	GPT-5.4	OpenAI	62.3%
5	Claude Sonnet 4.6	Anthropic	59.2%
6	Gemini 3.1 Pro	Google	58.4%
7	Qwen3.6 Plus	Alibaba	57.2%
8	Qwen3.5 397B	Alibaba	55.8%
9	Grok 4.1	xAI	53.7%
10	Gemini 3 Pro	Google	52.1%
11	Kimi K2.5	Moonshot AI	51.3%
12	GLM-5 Reasoning	Z.AI	49.8%
13	DeepSeek V3.2 Thinking	DeepSeek	48.6%
14	Llama 4 Behemoth	Meta	46.2%
15	o4-mini (high)	OpenAI	44.5%

Note: Specialized agentic frameworks that wrap models can score higher. OpAgent (CodeFuse AI) reached 71.6% on WebArena using a Planner-Grounder-Reflector-Summarizer multi-agent pipeline with online reinforcement learning, holding the #1 leaderboard position in January 2026. DeepSeek v3.2 as an agent backbone (not raw model) hit 74.3% in the Steel.dev results index, which tracks end-to-end agent systems rather than raw model calls.

WebVoyager - Top Agent Systems (April 2026)

From the Steel.dev Browser Agent Leaderboard, tracking end-to-end agent submissions.

Rank	Agent	Organization	WebVoyager Score
1	Alumnium	Alumnium	98.5%
2	Surfer 2	H Company	97.1%
3	Magnitude	Magnitude	93.9%
4	AIME Browser-Use	Aime	92.3%
5	Surfer-H + Holo1	H Company	92.2%
6	Browserable	Browserable	90.4%
7	Browser Use	Browser Use	89.1%
8	Operator	OpenAI	87.0%
9	Skyvern 2.0	Skyvern	85.9%
10	Project Mariner	Google	83.5%
-	WebVoyager (original)	Academic	59.1%

WebVoyager scores are approaching saturation. Scores above 90% are common enough that the benchmark no longer differentiates the top tier well.

BrowseComp - Model Scores (April 2026)

From the llm-stats.com tracker, which covers 40 models. BrowseComp scores are reported as fractions (0.0-1.0).

Rank	Model	Provider	BrowseComp Score
1	Claude Mythos Preview	Anthropic	0.869
2	Gemini 3.1 Pro	Google	0.859
3	Claude Opus 4.6	Anthropic	0.840
4	GPT-5.4	OpenAI	0.827
5	GLM-5.1	Zhipu AI	0.793
6	GPT-5.2 Pro	OpenAI	0.779
7	Seed 2.0 Pro	ByteDance	0.773
8	MiniMax M2.5	MiniMax	0.763
9	GLM-5	Zhipu AI	0.759
10	Kimi K2.5	Moonshot AI	0.749
11	Claude Sonnet 4.6	Anthropic	0.747
12	Qwen3.5-397B	Alibaba	0.690
-	Deep Research (launch)	OpenAI	0.515
-	o3	OpenAI	0.497
-	GPT-4o with browsing	OpenAI	0.019

The jump from 0.515 (Deep Research at launch) to 0.869 in under a year is steep. BrowseComp remains the hardest widely-used browsing eval and hasn't saturated yet.

A web performance analytics dashboard showing real-time data - similar to what web agents must interpret and reason about during task completion Web agents must interpret and act on complex data-rich interfaces like dashboards. Benchmarks like WorkArena++ specifically test this kind of enterprise SaaS interaction. Source: unsplash.com

Key Takeaways

Anthropic and OpenAI Trade Leads Depending on the Benchmark

Anthropic's models dominate WebArena and BrowseComp. OpenAI's ecosystem leads on WebVoyager through commercial products like Operator. Google's Gemini 3.1 Pro scores second on BrowseComp at 0.859 and shows competitive WebArena numbers. The gap between providers has closed significantly since late 2024.

Agentic Frameworks Beat Raw Model Calls by a Wide Margin

The 74.3% DeepSeek v3.2 agent score on WebArena versus the same model's 48.6% raw-model score shows what a well-designed scaffolding layer contributes. OpAgent's online RL pipeline - where the agent learns from task failures on the fly - represents the current state of the art for WebArena. Raw model API calls are not the right comparison point for production web agent deployments.

WebVoyager Has Saturated

Near-perfect WebVoyager scores no longer distinguish good systems from excellent ones. Scores in the 90-98% range are clustered, and the benchmark uses GPT-4V as a judge - which may not reliably distinguish between a 92% and a 97% agent. Researchers and practitioners should weight WebChoreArena (the harder variant) and BrowseComp more heavily.

Enterprise Web Tasks Remain Very Hard

WorkArena++ at 2.1% for GPT-4o makes the frontier model gap with humans (93.9%) vivid. The benchmark's L3 tasks - which mirror real ServiceNow workflows with complex context requirements - have effectively zero LLM success now. Any vendor claiming "autonomous enterprise agent" capabilities should be pressed on WorkArena++ L3 numbers.

Open-Source Agents Are Competitive

Browser Use (open-source) beat OpenAI's commercial Operator on WebVoyager, 89.1% vs 87%. The performance gap between open and closed systems that defined 2024 has mostly closed at the framework level. The underlying models still favor closed-source providers, but the agent scaffolding is no longer a meaningful differentiator.

Practical Guidance

For general web task automation

If you're building browser agents on top of frontier model APIs, Claude Opus 4.6 or GPT-5.4 as the backbone gives you the strongest raw capability. Pair either with the Browser Use open-source framework (see our Best AI Browser Automation Tools roundup) rather than building scaffolding from scratch. The open-source frameworks now match or exceed proprietary agent products on standard benchmarks.

For research on web agents

BrowseComp and WebChoreArena are the benchmarks worth tracking in 2026. WebVoyager provides a useful regression check but shouldn't be the primary eval. For enterprise-specific scenarios, WorkArena++ L2 is realistic; L3 results tell you how far you still have to go. Mind2Web 2 is the right choice for agentic search and long-horizon information tasks.

For commercial web agent products

If you're evaluating products like OpenAI Operator, Google Project Mariner, or Skyvern, ask for BrowseComp scores rather than WebVoyager scores. BrowseComp's hard information-retrieval problems separate agents that actually reason from agents that pattern-match. Our Best AI Browser Agents guide covers the commercial product landscape.

For open-source work

The Browser Use framework running on GPT-4o is the strongest open-source baseline. MolMo-Web (AI2) is a notable recent open-source web agent worth watching if you need a permissively licensed option. The BrowserGym ecosystem from ServiceNow provides a unified test harness across WebArena, WorkArena, VisualWebArena, and others - useful if you want reproducible comparisons across benchmarks.

FAQ

Which web agent benchmark should I use for evaluating my system?

Use BrowseComp for hard information-retrieval tasks, WebChoreArena for tedious multi-step tasks, and WorkArena++ for enterprise SaaS workflows. WebVoyager is near-saturated for top systems; use it as a baseline check only.

What's the difference between WebArena and WebVoyager?

WebArena uses four simulated websites with programmatic grading and 812 multi-step tasks. WebVoyager uses 15 live real-world websites with 643 tasks and GPT-4V automated judging. WebArena is more reproducible; WebVoyager tests live-web generalization.

Can open-source agents match commercial products on web tasks?

Yes, at the framework level. Browser Use (open-source) scored 89.1% on WebVoyager vs 87% for OpenAI Operator. The underlying model still matters, but the scaffolding gap has closed significantly.

How do WorkArena++ scores compare to real enterprise automation?

WorkArena++ L3 tasks have near-zero LLM success rates vs 93.9% for humans. No current LLM-based agent should be trusted for unsupervised enterprise workflow automation without significant human oversight.

How often do these rankings change?

WebVoyager and BrowseComp update frequently as new agent submissions arrive. WebArena updates more slowly. WorkArena++ results have been stable since mid-2025. Check Steel.dev and benchlm.ai for current snapshots.

Sources: