Agentic AI Benchmarks Leaderboard - GAIA, WebArena, BFCL, and Tau2-Bench

2026 is the year AI stopped just answering questions and started doing things. Every major lab now ships "agentic" capabilities - tool calling, multi-step planning, web browsing, code execution - and every press release claims their model is the best at it. But how do you actually measure whether an AI agent can complete a real task from start to finish without falling apart halfway through?

That's what this leaderboard tracks. We cover four benchmarks that test agentic capabilities from different angles: GAIA (general assistant tasks), WebArena (autonomous web navigation), BFCL V4 (function calling accuracy), and Tau2-bench (multi-turn customer service with tool use). Together, they paint the most complete picture available of which models can actually work autonomously - and which ones just talk a good game.

TL;DR

Claude Sonnet 4.5 leads the GAIA benchmark at 74.6% overall, with Anthropic models sweeping the top 6 positions
OpAgent (built on Qwen3-VL + RL) hits 71.6% on WebArena, surpassing agents backed by GPT-5 and Claude
Claude Opus 4.1 edges out Claude Sonnet 4 on BFCL V4 function calling at 70.4% - but open-source GLM-4.5 tops both at 70.9%
Claude Opus 4.6 dominates Tau2-bench with 99.3% on telecom and 91.9% on retail, the highest scores recorded

The Benchmarks Explained

GAIA (General AI Assistants)

GAIA was created by Meta and HuggingFace researchers to test whether AI systems can handle the kind of messy, multi-step tasks that humans do every day. It contains 466 questions divided into three difficulty levels. Level 1 tasks might require a single web search. Level 3 tasks chain together web browsing, file parsing, calculations, and reasoning over multiple documents. Every question has a single, unambiguous correct answer - no subjective grading.

When GAIA launched in late 2023, GPT-4 with plugins scored 15%. Humans scored 92%. That gap has narrowed significantly, but even today's best agents top out around 75%.

WebArena

WebArena is a self-hosted web environment that simulates real websites - an e-commerce store, a content management system, a Reddit-like forum, a GitLab instance, and a map application. Agents receive natural language instructions like "Find the cheapest red jacket in my size and add it to my cart" and must autonomously navigate the sites to complete tasks. It has 812 tasks, and success requires the agent to actually achieve the goal - partial credit doesn't exist.

Human success rate on WebArena sits around 78%. The first AI agents scored about 14% when the benchmark launched. Two years later, the best agents are now above 70%.

An AI agent navigating complex digital environments to complete multi-step tasks Agentic benchmarks test whether AI can autonomously complete multi-step tasks in realistic digital environments - not just answer questions.

BFCL V4 (Berkeley Function Calling Leaderboard)

The Berkeley Function Calling Leaderboard evaluates how accurately models can translate natural language requests into structured API calls. BFCL V4 tests across six dimensions: simple function calls, parallel invocations, multiple function selection, relevance detection (knowing when not to call a function), multi-turn interactions, and multi-step reasoning. It uses 2,000+ question-function-answer pairs across multiple programming languages.

V4 added agentic evaluation categories including web search with multi-hop reasoning, agent memory management, and format sensitivity testing. The gap between acing single-turn calls and handling realistic multi-turn tool use remains one of the benchmark's sharpest findings.

Tau2-Bench

Tau2-bench from Sierra Research simulates customer service scenarios where an AI agent must use API tools to resolve user requests while following company policy. It covers three domains - retail, airline, and telecom - each with realistic constraints like return windows, fare rules, and account verification. A simulated user (powered by another LLM) interacts conversationally with the agent across multiple turns.

The benchmark uses a Pass@k metric, measuring how consistently the agent succeeds across repeated runs. Even small reliability gaps compound: a model that succeeds 80% of the time on individual turns will fail multi-turn conversations far more often than you would expect.

GAIA Rankings

Rank	Agent	Model	Overall	Level 1	Level 2	Level 3
1	HAL Generalist Agent	Claude Sonnet 4.5	74.6%	82.1%	72.7%	65.4%
2	HAL Generalist Agent	Claude Sonnet 4.5 (High)	70.9%	77.4%	74.4%	46.2%
3	HAL Generalist Agent	Claude Opus 4.1 (High)	68.5%	71.7%	70.9%	53.9%
4	HAL Generalist Agent	Claude Opus 4 (High)	64.9%	71.7%	67.4%	42.3%
5	HAL Generalist Agent	Claude 3.7 Sonnet (High)	64.2%	67.9%	64.0%	57.7%
6	HAL Generalist Agent	Claude Opus 4.1	64.2%	71.7%	66.3%	42.3%
7	HF Open Deep Research	GPT-5 Medium	62.8%	73.6%	62.8%	38.5%
8	HAL Generalist Agent	GPT-5 Medium	59.4%	67.9%	58.1%	46.2%
9	HAL Generalist Agent	o4-mini (Low)	58.2%	71.7%	51.2%	53.9%
10	HF Open Deep Research	Claude Opus 4	57.6%	66.0%	57.0%	42.3%

Source: GAIA Leaderboard at Princeton HAL, accessed February 2026.

WebArena Rankings

Rank	Agent	Key Model	Overall	Shopping	CMS	Reddit	GitLab	Maps
1	OpAgent	Qwen3-VL-32B + RL	71.6%	59.2%	71.3%	86.0%	75.9%	71.4%
2	ColorBrowserAgent	GPT-5	71.2%	-	-	-	-	-
3	GBOX AI	Claude Code	68.0%	-	-	-	-	-
4	DeepSky Agent	Proprietary	66.9%	-	-	-	-	-
5	Narada AI	Proprietary	64.2%	-	-	-	-	-
6	IBM CUGA	Proprietary	61.7%	-	-	-	-	-
7	WebOperator	GPT-4o	54.6%	-	-	-	-	-

Source: OpAgent paper (arXiv:2602.13559), January 2026. Dashes indicate per-domain scores not reported for that agent.

AI models being evaluated on tool use and function calling tasks Function calling accuracy - measured by BFCL V4 - remains one of the hardest capabilities for models to get right consistently across multi-turn conversations.

BFCL V4 Function Calling Rankings

Rank	Model	Provider	Overall Accuracy
1	GLM-4.5 (FC)	Zhipu AI	70.9%
2	Claude Opus 4.1	Anthropic	70.4%
3	Claude Sonnet 4	Anthropic	70.3%
4	Llama 3.1 405B	Meta	~68% (estimated)
5	GPT-5	OpenAI	59.2%

Source: BFCL V4 Leaderboard and Klavis AI analysis, accessed February 2026. Note: not all models have been evaluated under identical BFCL V4 conditions.

Tau2-Bench Rankings

Rank	Model	Provider	Retail	Telecom
1	Claude Opus 4.6	Anthropic	91.9%	99.3%
2	Claude Opus 4.5	Anthropic	88.9%	98.2%
3	GPT-5.2 Thinking	OpenAI	82.0%	98.7%
4	Claude 3.7 Sonnet	Anthropic	-	~49% (Pass@1)
5	o4-mini	OpenAI	-	~42% (Pass@1)
6	GPT-4.1	OpenAI	-	~34% (Pass@1)

Sources: Claude Opus 4.6 benchmarks (Vellum) and GPT-5.2 benchmarks (Vellum). Note: rows 4-6 are from the original Tau2-bench paper using the newer telecom domain and aren't directly comparable to the Tau2-bench retail/telecom scores of the newer models.

Key Takeaways

Anthropic Dominates GAIA - But the Agent Framework Matters More Than You Think

The GAIA leaderboard's top 6 positions all belong to Anthropic models running inside the HAL Generalist Agent framework from Princeton. That is striking, but dig into the numbers and a nuance emerges: the same Claude Opus 4 model scores 64.9% inside HAL but only 57.6% inside HuggingFace's Open Deep Research framework. That's a 7-point gap from the agent harness alone, not the model. The takeaway is that for real-world agentic deployments, the orchestration layer - how you manage tool calls, context, retries, and error recovery - matters nearly as much as which model you plug in. If you're building an agent system, invest in the scaffolding, not just the model API key.

For those assessing Claude models specifically, our Claude Opus 4.6 review covers hands-on agentic testing beyond benchmark scores.

An Open-Source Agent Leads WebArena

OpAgent's 71.6% on WebArena is the current state of the art, and it's built on the open-source Qwen3-VL-32B model fine-tuned with reinforcement learning in live web environments. It edges out ColorBrowserAgent (GPT-5) at 71.2% and GBOX AI (Claude Code) at 68.0%. The architecture matters here: OpAgent uses a Planner-Grounder-Reflector-Summarizer pipeline specifically optimized for web navigation, with an RL training phase that exposed the model to real websites rather than synthetic data.

The WebArena numbers also reveal domain-specific strengths. OpAgent scores 86.0% on Reddit-like tasks but only 59.2% on shopping tasks, where precise form filling and checkout flows trip it up. Human success rate on WebArena is about 78%, so the gap is closing but agents still struggle with the most structured, transactional tasks.

AI agents navigating web interfaces and completing complex digital workflows WebArena tests agents on realistic web tasks like shopping, code review, and content management - with no partial credit for almost getting there.

Function Calling: The Single-Turn vs Multi-Turn Gap

BFCL V4's most important finding is not who ranks first - it's the gap between single-turn and multi-turn performance. Models that ace one-shot function calls consistently stumble when they need to maintain context across turns, manage memory, and decide when not to call a function. GLM-4.5 leads at 70.9% overall, with Claude Opus 4.1 and Sonnet 4 close behind. But GPT-5, despite being OpenAI's most capable model, sits at 59.2% - a surprisingly wide gap that suggests OpenAI's architecture focuses on other capabilities over structured tool use.

The practical implication: if your application involves single-turn API calls (a chatbot that looks up weather or checks order status), most frontier models will work fine. If you're building multi-turn agent workflows where the model needs to chain function calls, handle errors, and maintain state, model selection matters enormously. See our best AI agent frameworks guide for framework-level recommendations.

Tau2-Bench: Reliability Under Pressure

Claude Opus 4.6 posts 99.3% on the telecom domain of Tau2-bench - near-perfect tool use across multi-turn customer service conversations. GPT-5.2 Thinking is close at 98.7% on the same domain. But the retail domain tells a different story: Claude Opus 4.6 scores 91.9% while GPT-5.2 drops to 82.0%. Retail tasks involve more complex policy reasoning (return windows, discount stacking, inventory checks), and that 10-point gap shows where Claude's advantage in following nuanced instructions pays off.

The original Tau2-bench results from 2024 are a useful baseline. Claude 3.7 Sonnet scored about 49% Pass@1 on the telecom domain. The jump to Claude Opus 4.6's 99.3% in under two years is one of the most dramatic capability improvements in any AI benchmark.

Practical Guidance

Building a general-purpose AI assistant? The GAIA results point clearly to Claude Sonnet 4.5 as the strongest backbone model, especially inside a well-engineered agent framework. At $3/$15 per million tokens (input/output), it also offers better cost efficiency than Opus-tier models for high-volume deployments.

Need web automation? OpAgent's open-source architecture shows that you do not need a frontier API model to lead WebArena. If you're willing to host and fine-tune Qwen3-VL-32B, you can match or beat GPT-5-based agents. For a managed solution, Claude Code inside the GBOX framework at 68.0% is the best commercially available option.

Integrating function calling into your app? Claude Opus 4.1 and Sonnet 4 are the safest bets among proprietary models. If you're in the open-source ecosystem, GLM-4.5's 70.9% on BFCL makes it worth assessing - especially since it runs locally.

Customer service automation? Tau2-bench results strongly favor Claude Opus 4.6 for reliability-critical deployments. The 99.3% telecom and 91.9% retail scores are the highest recorded. GPT-5.2 Thinking is a reasonable alternative, especially on the telecom domain where it nearly matches Claude.

On a budget? Check our cost efficiency leaderboard for the best performance-per-dollar breakdown. For agentic workloads specifically, Claude Sonnet 4.5 at $3 input / $15 output hits the sweet spot between capability and cost.

Methodology Notes

A few caveats worth flagging. GAIA results depend heavily on the agent framework wrapping the model, making pure model-to-model comparison difficult. WebArena scores come from different papers and may use slightly different evaluation harnesses. BFCL V4 has not evaluated every model under identical conditions - some scores are from the official leaderboard, others from independent reproductions. Tau2-bench results for newer models (Opus 4.6, GPT-5.2) come from provider-reported benchmarks and haven't been independently replicated at scale.

Benchmark contamination is also a growing concern. As these benchmarks become popular, there's risk that training data increasingly overlaps with test cases. GAIA mitigates this with factual answers that change over time (web-dependent questions), and BFCL uses live enterprise data. But WebArena's fixed task set is more vulnerable to overfitting, especially for agents trained with RL in similar web environments.

I will update this leaderboard as new results come in. If you spot an error or a missing result, reach out - getting the numbers right matters more than being first.

Sources: