Function calling is one of the capabilities that matters most for anyone building real AI applications - and it's one of the least straightforward to assess. A model that gets 90% on general reasoning can fall apart the moment you ask it to correctly format a nested JSON payload, choose between two semantically similar tools, or recover gracefully when an API returns an error mid-task.

This leaderboard tracks how the major frontier models actually perform on dedicated function calling and tool-use benchmarks. We cover five distinct evaluations: BFCL v3 (single and multi-turn structured calls), tau-bench (multi-turn tool use in realistic customer service scenarios), ToolBench (multi-step real-world API tasks), FinTrace (long-horizon financial tool calling), and MCP-Bench (agentic tool use via Model Context Protocol). Each captures something the others miss.

If you're coming from our agentic AI benchmarks leaderboard, the coverage here is more focused: we're looking specifically at the tool-calling layer, not the full agent trajectory.

TL;DR

GLM 4.5 leads BFCL v3 at 76.7%, ahead of Qwen3 32B (75.7%) - Claude Opus 4 scores a surprisingly low 25.3%
Claude Sonnet 4.5 controls tau-bench with 0.700 on airline tasks and 0.862 on retail - the strongest multi-turn tool use score published
GPT-5.2 Thinking hits 98.7% on tau2-bench telecom, though that benchmark's narrow scope inflates the headline number
Budget pick: Qwen3 235B A22B (74.9% BFCL v3) runs open-weight and trades near-frontier scores for zero API costs

The Benchmarks Explained

Before reading the tables, it helps to know what each benchmark actually tests - because the gap between a 76% BFCL score and a 70% tau-bench score isn't just a number difference. They measure different things entirely.

BFCL v3 (Berkeley Function Calling Leaderboard)

The Berkeley Function Calling Leaderboard from UC Berkeley's Sky Computing Lab is the most widely cited tool-use benchmark. Version 3 added multi-turn interactions on top of the v2 static evaluation. The dataset contains over 2,000 question-function-answer pairs spanning Python, Java, JavaScript, and REST APIs, testing six categories: simple single calls, parallel calls (multiple functions fired simultaneously), multiple function selection, relevance detection (knowing when to refuse a tool call), multi-turn interactions, and multi-step reasoning.

The evaluation method matters here. BFCL uses Abstract Syntax Tree (AST) comparison to check structural correctness of the produced call, not just string matching. That catches paraphrasing games where a model rewrites the function name slightly and a naive eval would still pass it. BFCL v4 has since added web search and memory tasks, but v3 remains the most-assessed version for cross-model comparison. Version 4 scores aren't broadly available yet across 2026 frontier models.

tau-bench (Tool-Agent-User Interaction)

tau-bench from Sierra Research simulates live customer service scenarios. An AI agent is given a set of API tools and must complete multi-turn conversations with a simulated user - resolving requests like fare changes, return windows, and account modifications while following strict policy guidelines. A Pass@k metric measures how consistently the agent succeeds across k repeated runs. That penalizes brittleness: a model with 80% single-turn accuracy might only complete multi-turn scenarios reliably 50% of the time because errors compound.

There are two primary domains - airline and retail - with a newer telecom variant (tau2-bench) from the same team. Scores across domains aren't directly comparable because task complexity differs, so we report them separately.

ToolBench and ToolLLM

ToolBench, published by Tsinghua's NLP lab and accepted at ICLR 2024, tests models against over 120,000 instruction-API pairs drawn from 16,000+ real-world APIs across 49 categories. It stresses tool generalization: models are assessed on unseen tools, unseen instructions, and unseen API categories to test whether they've truly learned to use APIs or just memorized training examples. The ToolLLM paper describes the fine-tuned model trained on ToolBench data; the benchmark itself runs any model through the same evaluation harness.

FinTrace

Published in April 2026, FinTrace from Carnegie Mellon and a consortium of financial AI researchers targets a specific failure mode: models that pick the right tool but then fail to use the output correctly. The benchmark contains 800 expert-annotated trajectories across 34 financial task categories, scored across nine metrics along four axes - action correctness, execution efficiency, process quality, and output quality. It's the only benchmark in this roundup that explicitly measures whether the model reasons well about tool outputs, not just whether it invokes the right function.

MCP-Bench

MCP-Bench from Accenture Labs connects LLMs to 28 live MCP servers spanning 250 tools across domains including finance, scientific computing, and travel. Accepted to NeurIPS 2025, it tests real multi-hop tool coordination: "book a flight to the cheapest city with a conference this month" requires chaining calendar, flight search, and pricing tools in the correct order. This benchmark is newer and has fewer assessed models than BFCL, but it's the closest thing to a production agentic workload in the published benchmark suite.

BFCL v3 Rankings

Data from pricepertoken.com, accessed April 2026. 23 models assessed; average score 55.9.

Rank	Model	Provider	BFCL v3 Score
1	GLM 4.5 Thinking	Z AI	76.7%
2	Qwen3 32B Thinking	Alibaba	75.7%
2	Qwen3 32B	Alibaba	75.7%
4	Qwen3 Max	Alibaba	74.9%
5	GLM-4.7-Flash Thinking	Z AI	74.6%
5	GLM-4.7-Flash	Z AI	74.6%
7	GLM 4.5 Air	Z AI	69.1%
8	Nova Pro 1.0	Amazon	67.9%
9	Kimi K2.5 Thinking	Moonshot AI	64.5%
9	Kimi K2.5	Moonshot AI	64.5%
11	INTELLECT-3	Prime Intellect	63.5%
12	Llama 4 Scout	Meta	55.7%
13	Gemini 3 Flash Preview Thinking	Google	53.5%
14	MiniMax M1	MiniMax	47.8%
15	Nemotron 3 Nano 30B A3B Thinking	NVIDIA	41.6%
15	Nemotron 3 Nano 30B A3B	NVIDIA	41.6%
17	Phi 4	Microsoft Azure	40.8%
18	Claude Opus 4 Thinking	Anthropic	25.3%
18	Claude Opus 4	Anthropic	25.3%
18	Kimi K2 0711	Moonshot AI	25.3%

The Claude Opus 4 result at 25.3% deserves attention. Anthropic's flagship model, which leads multi-turn agentic benchmarks and beats everyone on tau-bench, scores at the bottom of the BFCL v3 table. That isn't a fluke - it reflects a real tradeoff in how the model handles structured formatting under test conditions. BFCL rewards precise, rigid JSON outputs. Claude tends toward conversational wrapping that can trip up AST parsing even when the underlying tool selection is correct. The discrepancy is a known issue worth understanding before drawing conclusions about real-world usability.

Code on screen representing API function calls and tool schemas BFCL uses Abstract Syntax Tree comparison to evaluate function call correctness - catching subtle structural errors that string matching misses. Source: pexels.com

tau-bench Rankings

Airline Domain

Data from llm-stats.com, accessed April 2026. 23 models assessed; average score 0.495.

Rank	Model	Provider	Airline Score
1	Claude Sonnet 4.5	Anthropic	0.700
2	MiniMax M1 80K	MiniMax	0.620
3	GLM-4.5-Air	Zhipu AI	0.608
4	GLM-4.5	Zhipu AI	0.604
5	MiniMax M1 40K	MiniMax	0.600
5	Claude Sonnet 4	Anthropic	0.600
5	Qwen3-Coder 480B A35B	Alibaba	0.600
8	Claude Opus 4	Anthropic	0.596
9	Claude 3.7 Sonnet	Anthropic	0.584
10	Claude Opus 4.1	Anthropic	0.560
11	o1	OpenAI	0.500
11	GPT-4.5	OpenAI	0.500
13	GPT-4.1	OpenAI	0.494
14	o4-mini	OpenAI	0.492
15	Qwen3-Next-80B-A3B-Thinking	Alibaba	0.490
16	Qwen3-235B-A22B-Thinking-2507	Alibaba	0.460
17	GPT-4o	OpenAI	0.428
18	GPT-4.1 mini	OpenAI	0.360
19	GPT-4.1 nano	OpenAI	0.140

Retail Domain

Data from llm-stats.com, accessed April 2026. 25 models assessed; average score varies.

Rank	Model	Provider	Retail Score
1	Claude Sonnet 4.5	Anthropic	0.862
2	Claude Opus 4.1	Anthropic	0.824
3	Claude Opus 4	Anthropic	0.814
4	Claude 3.7 Sonnet	Anthropic	0.812
5	Claude Sonnet 4	Anthropic	0.805
6	GLM-4.5	Zhipu AI	0.797
7	GLM-4.5-Air	Zhipu AI	0.779
8	Qwen3-Coder 480B A35B	Alibaba	0.775
9	o4-mini	OpenAI	0.718
10	o1	OpenAI	0.708
11	Qwen3-Next-80B-A3B-Thinking	Alibaba	0.696
12	Claude 3.5 Sonnet	Anthropic	0.692
13	GPT-4.5	OpenAI	0.684
14	GPT-4.1	OpenAI	0.680
15	GPT-4o	OpenAI	0.603

The pattern is consistent: Anthropic holds the top five retail positions, with Claude Sonnet 4.5 posting 0.862 - 3.8 points ahead of the next model. GLM-4.5 at sixth (0.797) is the first non-Anthropic model, and it scores higher than GPT-4.5 (0.684), which is worth noting for cost-sensitive buyers.

tau2-bench Telecom (Extended Domain)

The tau2-bench telecom dataset, tracked at artificialanalysis.ai, extends the original benchmark into telecommunications customer service with more complex policy trees. GPT-5.2 Thinking set a high bar at 98.7%, and GLM models have since pushed the reported ceiling even higher.

Rank	Model	Provider	Telecom Score
1	GLM-4.7-Flash (Reasoning)	Alibaba	98.8%
2	GLM-5V Turbo (Reasoning)	Alibaba	98.5%
3	GLM-5-Turbo	Alibaba	98.5%
4	GPT-5.2 Thinking	OpenAI	98.7%

One important caveat: these telecom scores look spectacular, but the benchmark's specific policy constraints also make it more susceptible to overfitting. No model scored above 49% when the paper was first published. The rapid climb to 98%+ suggests some combination of genuine capability improvement and possible training data exposure. Treat telecom scores as directionally useful, not as definitive.

A team working on agentic AI workflows and tool orchestration strategy Multi-turn tool use benchmarks like tau-bench are harder to game than single-turn evaluations because errors compound across multiple conversation turns. Source: pexels.com

FinTrace Rankings (Financial Tool Calling)

FinTrace, published April 2026 (arXiv:2604.10015), evaluated 13 LLMs across 800 annotated financial task trajectories. The rubric grades nine metrics along four axes - action correctness, execution efficiency, process quality, and output quality.

Rank	Model	Provider	FinTrace Score
1	Claude Opus 4.6	Anthropic	0.788
2	Claude Sonnet 4.6	Anthropic	0.750
3	GPT-5.4	OpenAI	0.737
Mid-tier	Gemini 3 Flash	Google	~0.450

The FinTrace authors found that frontier models handle tool selection reliably but consistently struggle with information use and final answer quality. Picking the right function to call isn't the hard part anymore - doing something useful with the result is. That finding holds across all 13 models tested, including the top scorers.

Scale Labs Enterprise Tool Use

For a broader single-turn tool use perspective, Scale Labs runs an enterprise evaluation against 35 models. Top performers as of the most recent snapshot at labs.scale.com/leaderboard/tool_use_enterprise:

Rank	Model	Provider	Score
1	o1 (Dec 2024)	OpenAI	70.1 ±5.3
2	Gemini 2.5 Pro Experimental	Google	68.8 ±5.4
3	o1 Pro	OpenAI	67.0 ±5.4
4	o1-preview	OpenAI	66.4 ±5.5
5	DeepSeek-R1	DeepSeek	65.3 ±5.5
6	Claude 3.7 Sonnet Thinking	Anthropic	65.3 ±5.5
7	GPT-4o (May 2024)	OpenAI	64.6 ±5.5
8	GPT-4.5 Preview	OpenAI	63.8 ±5.6
9	Llama 3.1 405B Instruct	Meta	50.4 ±5.8
10	GPT-4o mini	OpenAI	51.7 ±5.8

This leaderboard uses older model checkpoints (the most recent version evaluated is from early 2025), so it doesn't reflect current frontier capability. It's still useful as a reference for enterprise buyers comparing older deployments and for seeing how reasoning models (o1, o1 Pro) compare to standard models on structured API tasks.

Key Takeaways

The BFCL-tau-bench split tells the real story

The biggest insight from these numbers is that BFCL and tau-bench don't agree on model ranking - and they shouldn't, because they're testing different things. BFCL rewards single-call precision; tau-bench rewards sustained reliability across a multi-turn conversation. Claude Opus 4 scores 25.3% on BFCL and 0.814 on tau-bench retail. That isn't a contradiction - it means Claude handles multi-turn tool orchestration well but formats outputs in ways that trip up BFCL's AST parser.

For most production use cases, tau-bench scores matter more. Real agents don't make a single tool call and stop.

Open-weight models are competitive on structured calling

GLM-4.5 tops BFCL v3 at 76.7%, and Qwen3 32B is right behind at 75.7%. Both are available open-weight. Anthropic's closed models dominate tau-bench, but for applications that need raw function-call accuracy without the conversational overhead, the open-weight options are genuinely strong. Our guide to running open-source LLMs locally covers setup if you want to benchmark these yourself.

The "thinking" premium is marginal for tool calling

Several model pairs in the BFCL v3 table show identical scores for the base and "Thinking" variant (GLM-4.7-Flash and GLM-4.7-Flash Thinking both score 74.6%; Kimi K2.5 and Kimi K2.5 Thinking both score 64.5%). Extended chain-of-thought doesn't help on well-formed single-call evaluations. It helps on complex multi-step planning - which is what tau-bench and FinTrace measure. So the decision of whether to use a reasoning model should depend on your task structure, not just the model tier.

FinTrace exposes the output problem

The FinTrace finding - that all models struggle with information utilization more than tool selection - points to the next frontier in function calling research. Models have gotten good at choosing the right tool. They haven't gotten comparably good at reasoning over the result before taking the next step. This matters enormously for financial, medical, and legal agent workflows where a tool returns a document and the agent needs to extract the right number from it before continuing.

Practical Guidance

Building a customer service or task automation agent: Use Claude Sonnet 4.5 or Claude Opus 4. Both consistently lead tau-bench across domains, and the multi-turn reliability gap between them and GPT-4.1 is large enough to matter in production. See our review of Claude Opus 4.6 for a deeper look at Anthropic's flagship line.

Structured data extraction or API integration with exact schema requirements: Check the BFCL v3 table and prioritize GLM-4.5 or Qwen3 32B if output format compliance is your bottleneck. If you're running closed APIs through OpenAI, Nova Pro 1.0 (67.9% on BFCL v3) sits well above GPT-4.1 on structured calling.

Multi-hop tool chains (MCP, complex pipelines): MCP-Bench data is limited, but its findings indicate that cross-tool coordination and parameter precision are still hard problems regardless of model tier. Our guide on what MCP is and how to use it covers the protocol layer if you're designing the tool interface rather than just selecting a model.

Financial or high-stakes domain tool use: FinTrace scores put Claude Opus 4.6 (0.788) and Claude Sonnet 4.6 (0.750) ahead of GPT-5.4 (0.737). The gap isn't large, and the benchmark itself is new enough that you should treat these numbers as signals rather than verdicts. Fine-tune on domain-specific tool trajectories if stakes are high - FinTrace-Training (8,196 annotated trajectories) is public.

Methodology Notes and Caveats

A few things to keep in mind when interpreting these tables:

BFCL v3 evaluates models as of the snapshot date and doesn't account for system prompt changes, context windows, or whether the model is being called with tool-use instructions or not. Provider settings matter.

Tau-bench is stochastic - it uses an LLM to simulate the user, and scores vary across runs. The Pass@k metric helps, but single published numbers should carry error bars that most leaderboard aggregators don't show. The tau-bench airline and retail numbers above come from llm-stats.com's snapshot and may not match Sierra Research's official site for newer models.

FinTrace and MCP-Bench are 2025-2026 publications with limited model coverage. They're worth watching as evaluation harnesses, but neither yet has the breadth of BFCL.

The Scale Labs enterprise leaderboard has excellent methodology documentation but runs on older model versions. Don't use it to compare GPT-5 vs. Claude 4.x.

FAQ

Which model is best for function calling overall?

No single model leads every benchmark. For multi-turn tool use, Claude Sonnet 4.5. For single-call format precision, GLM 4.5. For financial workflows, Claude Opus 4.6.

Why does Claude score so low on BFCL but high on tau-bench?

BFCL uses AST parsing to check output format. Claude wraps responses conversationally, which fails the parser even when tool selection is correct. Tau-bench measures task completion, where Claude's reasoning over outputs is an advantage.

Are open-weight models competitive with closed APIs for tool use?

On BFCL v3, yes - GLM 4.5 and Qwen3 32B beat every closed API in the table. On tau-bench multi-turn tasks, Anthropic closed models hold a consistent lead, though GLM-4.5 runs competitively at rank 3-4.

How often do these rankings change?

BFCL updates with new model submissions irregularly - major frontier releases tend to appear within weeks of launch. Tau-bench and FinTrace are less frequently updated. Check the source leaderboards linked in the Sources section for the latest snapshots.

What's the difference between tau-bench and tau2-bench?

The original tau-bench covers airline and retail customer service. Tau2-bench adds a telecom domain with more complex policy constraints. Scores aren't comparable across domains.

Does function calling performance transfer to MCP tool use?

Partially. BFCL and tau-bench measure direct function calls in controlled setups. MCP-Bench adds tool discovery and cross-server coordination. Models that score well on BFCL don't automatically handle MCP workflows well, according to the Accenture paper's findings.

Sources: