Function Calling Benchmarks Leaderboard 2026

Rankings of top LLMs on function calling and tool use benchmarks including BFCL v3, tau-bench, ToolBench, and FinTrace as of April 2026.

Function Calling Benchmarks Leaderboard 2026

Function calling is one of the capabilities that matters most for anyone building real AI applications - and it's one of the least straightforward to assess. A model that gets 90% on general reasoning can fall apart the moment you ask it to correctly format a nested JSON payload, choose between two semantically similar tools, or recover gracefully when an API returns an error mid-task.

This leaderboard tracks how the major frontier models actually perform on dedicated function calling and tool-use benchmarks. We cover five distinct evaluations: BFCL v3 (single and multi-turn structured calls), tau-bench (multi-turn tool use in realistic customer service scenarios), ToolBench (multi-step real-world API tasks), FinTrace (long-horizon financial tool calling), and MCP-Bench (agentic tool use via Model Context Protocol). Each captures something the others miss.

If you're coming from our agentic AI benchmarks leaderboard, the coverage here is more focused: we're looking specifically at the tool-calling layer, not the full agent trajectory.

TL;DR

  • GLM 4.5 leads BFCL v3 at 76.7%, ahead of Qwen3 32B (75.7%) - Claude Opus 4 scores a surprisingly low 25.3%
  • Claude Sonnet 4.5 controls tau-bench with 0.700 on airline tasks and 0.862 on retail - the strongest multi-turn tool use score published
  • GPT-5.2 Thinking hits 98.7% on tau2-bench telecom, though that benchmark's narrow scope inflates the headline number
  • Budget pick: Qwen3 235B A22B (74.9% BFCL v3) runs open-weight and trades near-frontier scores for zero API costs

The Benchmarks Explained

Before reading the tables, it helps to know what each benchmark actually tests - because the gap between a 76% BFCL score and a 70% tau-bench score isn't just a number difference. They measure different things entirely.

BFCL v3 (Berkeley Function Calling Leaderboard)

The Berkeley Function Calling Leaderboard from UC Berkeley's Sky Computing Lab is the most widely cited tool-use benchmark. Version 3 added multi-turn interactions on top of the v2 static evaluation. The dataset contains over 2,000 question-function-answer pairs spanning Python, Java, JavaScript, and REST APIs, testing six categories: simple single calls, parallel calls (multiple functions fired simultaneously), multiple function selection, relevance detection (knowing when to refuse a tool call), multi-turn interactions, and multi-step reasoning.

The evaluation method matters here. BFCL uses Abstract Syntax Tree (AST) comparison to check structural correctness of the produced call, not just string matching. That catches paraphrasing games where a model rewrites the function name slightly and a naive eval would still pass it. BFCL v4 has since added web search and memory tasks, but v3 remains the most-assessed version for cross-model comparison. Version 4 scores aren't broadly available yet across 2026 frontier models.

tau-bench (Tool-Agent-User Interaction)

tau-bench from Sierra Research simulates live customer service scenarios. An AI agent is given a set of API tools and must complete multi-turn conversations with a simulated user - resolving requests like fare changes, return windows, and account modifications while following strict policy guidelines. A Pass@k metric measures how consistently the agent succeeds across k repeated runs. That penalizes brittleness: a model with 80% single-turn accuracy might only complete multi-turn scenarios reliably 50% of the time because errors compound.

There are two primary domains - airline and retail - with a newer telecom variant (tau2-bench) from the same team. Scores across domains aren't directly comparable because task complexity differs, so we report them separately.

ToolBench and ToolLLM

ToolBench, published by Tsinghua's NLP lab and accepted at ICLR 2024, tests models against over 120,000 instruction-API pairs drawn from 16,000+ real-world APIs across 49 categories. It stresses tool generalization: models are assessed on unseen tools, unseen instructions, and unseen API categories to test whether they've truly learned to use APIs or just memorized training examples. The ToolLLM paper describes the fine-tuned model trained on ToolBench data; the benchmark itself runs any model through the same evaluation harness.

FinTrace

Published in April 2026, FinTrace from Carnegie Mellon and a consortium of financial AI researchers targets a specific failure mode: models that pick the right tool but then fail to use the output correctly. The benchmark contains 800 expert-annotated trajectories across 34 financial task categories, scored across nine metrics along four axes - action correctness, execution efficiency, process quality, and output quality. It's the only benchmark in this roundup that explicitly measures whether the model reasons well about tool outputs, not just whether it invokes the right function.

MCP-Bench

MCP-Bench from Accenture Labs connects LLMs to 28 live MCP servers spanning 250 tools across domains including finance, scientific computing, and travel. Accepted to NeurIPS 2025, it tests real multi-hop tool coordination: "book a flight to the cheapest city with a conference this month" requires chaining calendar, flight search, and pricing tools in the correct order. This benchmark is newer and has fewer assessed models than BFCL, but it's the closest thing to a production agentic workload in the published benchmark suite.


BFCL v3 Rankings

Data from pricepertoken.com, accessed April 2026. 23 models assessed; average score 55.9.

RankModelProviderBFCL v3 Score
1GLM 4.5 ThinkingZ AI76.7%
2Qwen3 32B ThinkingAlibaba75.7%
2Qwen3 32BAlibaba75.7%
4Qwen3 MaxAlibaba74.9%
5GLM-4.7-Flash ThinkingZ AI74.6%
5GLM-4.7-FlashZ AI74.6%
7GLM 4.5 AirZ AI69.1%
8Nova Pro 1.0Amazon67.9%
9Kimi K2.5 ThinkingMoonshot AI64.5%
9Kimi K2.5Moonshot AI64.5%
11INTELLECT-3Prime Intellect63.5%
12Llama 4 ScoutMeta55.7%
13Gemini 3 Flash Preview ThinkingGoogle53.5%
14MiniMax M1MiniMax47.8%
15Nemotron 3 Nano 30B A3B ThinkingNVIDIA41.6%
15Nemotron 3 Nano 30B A3BNVIDIA41.6%
17Phi 4Microsoft Azure40.8%
18Claude Opus 4 ThinkingAnthropic25.3%
18Claude Opus 4Anthropic25.3%
18Kimi K2 0711Moonshot AI25.3%

The Claude Opus 4 result at 25.3% deserves attention. Anthropic's flagship model, which leads multi-turn agentic benchmarks and beats everyone on tau-bench, scores at the bottom of the BFCL v3 table. That isn't a fluke - it reflects a real tradeoff in how the model handles structured formatting under test conditions. BFCL rewards precise, rigid JSON outputs. Claude tends toward conversational wrapping that can trip up AST parsing even when the underlying tool selection is correct. The discrepancy is a known issue worth understanding before drawing conclusions about real-world usability.

Code on screen representing API function calls and tool schemas BFCL uses Abstract Syntax Tree comparison to evaluate function call correctness - catching subtle structural errors that string matching misses. Source: pexels.com


tau-bench Rankings

Airline Domain

Data from llm-stats.com, accessed April 2026. 23 models assessed; average score 0.495.

RankModelProviderAirline Score
1Claude Sonnet 4.5Anthropic0.700
2MiniMax M1 80KMiniMax0.620
3GLM-4.5-AirZhipu AI0.608
4GLM-4.5Zhipu AI0.604
5MiniMax M1 40KMiniMax0.600
5Claude Sonnet 4Anthropic0.600
5Qwen3-Coder 480B A35BAlibaba0.600
8Claude Opus 4Anthropic0.596
9Claude 3.7 SonnetAnthropic0.584
10Claude Opus 4.1Anthropic0.560
11o1OpenAI0.500
11GPT-4.5OpenAI0.500
13GPT-4.1OpenAI0.494
14o4-miniOpenAI0.492
15Qwen3-Next-80B-A3B-ThinkingAlibaba0.490
16Qwen3-235B-A22B-Thinking-2507Alibaba0.460
17GPT-4oOpenAI0.428
18GPT-4.1 miniOpenAI0.360
19GPT-4.1 nanoOpenAI0.140

Retail Domain

Data from llm-stats.com, accessed April 2026. 25 models assessed; average score varies.

RankModelProviderRetail Score
1Claude Sonnet 4.5Anthropic0.862
2Claude Opus 4.1Anthropic0.824
3Claude Opus 4Anthropic0.814
4Claude 3.7 SonnetAnthropic0.812
5Claude Sonnet 4Anthropic0.805
6GLM-4.5Zhipu AI0.797
7GLM-4.5-AirZhipu AI0.779
8Qwen3-Coder 480B A35BAlibaba0.775
9o4-miniOpenAI0.718
10o1OpenAI0.708
11Qwen3-Next-80B-A3B-ThinkingAlibaba0.696
12Claude 3.5 SonnetAnthropic0.692
13GPT-4.5OpenAI0.684
14GPT-4.1OpenAI0.680
15GPT-4oOpenAI0.603

The pattern is consistent: Anthropic holds the top five retail positions, with Claude Sonnet 4.5 posting 0.862 - 3.8 points ahead of the next model. GLM-4.5 at sixth (0.797) is the first non-Anthropic model, and it scores higher than GPT-4.5 (0.684), which is worth noting for cost-sensitive buyers.


tau2-bench Telecom (Extended Domain)

The tau2-bench telecom dataset, tracked at artificialanalysis.ai, extends the original benchmark into telecommunications customer service with more complex policy trees. GPT-5.2 Thinking set a high bar at 98.7%, and GLM models have since pushed the reported ceiling even higher.

RankModelProviderTelecom Score
1GLM-4.7-Flash (Reasoning)Alibaba98.8%
2GLM-5V Turbo (Reasoning)Alibaba98.5%
3GLM-5-TurboAlibaba98.5%
4GPT-5.2 ThinkingOpenAI98.7%

One important caveat: these telecom scores look spectacular, but the benchmark's specific policy constraints also make it more susceptible to overfitting. No model scored above 49% when the paper was first published. The rapid climb to 98%+ suggests some combination of genuine capability improvement and possible training data exposure. Treat telecom scores as directionally useful, not as definitive.

A team working on agentic AI workflows and tool orchestration strategy Multi-turn tool use benchmarks like tau-bench are harder to game than single-turn evaluations because errors compound across multiple conversation turns. Source: pexels.com


FinTrace Rankings (Financial Tool Calling)

FinTrace, published April 2026 (arXiv:2604.10015), evaluated 13 LLMs across 800 annotated financial task trajectories. The rubric grades nine metrics along four axes - action correctness, execution efficiency, process quality, and output quality.

RankModelProviderFinTrace Score
1Claude Opus 4.6Anthropic0.788
2Claude Sonnet 4.6Anthropic0.750
3GPT-5.4OpenAI0.737
Mid-tierGemini 3 FlashGoogle~0.450

The FinTrace authors found that frontier models handle tool selection reliably but consistently struggle with information use and final answer quality. Picking the right function to call isn't the hard part anymore - doing something useful with the result is. That finding holds across all 13 models tested, including the top scorers.


Scale Labs Enterprise Tool Use

For a broader single-turn tool use perspective, Scale Labs runs an enterprise evaluation against 35 models. Top performers as of the most recent snapshot at labs.scale.com/leaderboard/tool_use_enterprise:

RankModelProviderScore
1o1 (Dec 2024)OpenAI70.1 ±5.3
2Gemini 2.5 Pro ExperimentalGoogle68.8 ±5.4
3o1 ProOpenAI67.0 ±5.4
4o1-previewOpenAI66.4 ±5.5
5DeepSeek-R1DeepSeek65.3 ±5.5
6Claude 3.7 Sonnet ThinkingAnthropic65.3 ±5.5
7GPT-4o (May 2024)OpenAI64.6 ±5.5
8GPT-4.5 PreviewOpenAI63.8 ±5.6
9Llama 3.1 405B InstructMeta50.4 ±5.8
10GPT-4o miniOpenAI51.7 ±5.8

This leaderboard uses older model checkpoints (the most recent version evaluated is from early 2025), so it doesn't reflect current frontier capability. It's still useful as a reference for enterprise buyers comparing older deployments and for seeing how reasoning models (o1, o1 Pro) compare to standard models on structured API tasks.


Key Takeaways

The BFCL-tau-bench split tells the real story

The biggest insight from these numbers is that BFCL and tau-bench don't agree on model ranking - and they shouldn't, because they're testing different things. BFCL rewards single-call precision; tau-bench rewards sustained reliability across a multi-turn conversation. Claude Opus 4 scores 25.3% on BFCL and 0.814 on tau-bench retail. That isn't a contradiction - it means Claude handles multi-turn tool orchestration well but formats outputs in ways that trip up BFCL's AST parser.

For most production use cases, tau-bench scores matter more. Real agents don't make a single tool call and stop.

Open-weight models are competitive on structured calling

GLM-4.5 tops BFCL v3 at 76.7%, and Qwen3 32B is right behind at 75.7%. Both are available open-weight. Anthropic's closed models dominate tau-bench, but for applications that need raw function-call accuracy without the conversational overhead, the open-weight options are genuinely strong. Our guide to running open-source LLMs locally covers setup if you want to benchmark these yourself.

The "thinking" premium is marginal for tool calling

Several model pairs in the BFCL v3 table show identical scores for the base and "Thinking" variant (GLM-4.7-Flash and GLM-4.7-Flash Thinking both score 74.6%; Kimi K2.5 and Kimi K2.5 Thinking both score 64.5%). Extended chain-of-thought doesn't help on well-formed single-call evaluations. It helps on complex multi-step planning - which is what tau-bench and FinTrace measure. So the decision of whether to use a reasoning model should depend on your task structure, not just the model tier.

FinTrace exposes the output problem

The FinTrace finding - that all models struggle with information utilization more than tool selection - points to the next frontier in function calling research. Models have gotten good at choosing the right tool. They haven't gotten comparably good at reasoning over the result before taking the next step. This matters enormously for financial, medical, and legal agent workflows where a tool returns a document and the agent needs to extract the right number from it before continuing.


Practical Guidance

Building a customer service or task automation agent: Use Claude Sonnet 4.5 or Claude Opus 4. Both consistently lead tau-bench across domains, and the multi-turn reliability gap between them and GPT-4.1 is large enough to matter in production. See our review of Claude Opus 4.6 for a deeper look at Anthropic's flagship line.

Structured data extraction or API integration with exact schema requirements: Check the BFCL v3 table and prioritize GLM-4.5 or Qwen3 32B if output format compliance is your bottleneck. If you're running closed APIs through OpenAI, Nova Pro 1.0 (67.9% on BFCL v3) sits well above GPT-4.1 on structured calling.

Multi-hop tool chains (MCP, complex pipelines): MCP-Bench data is limited, but its findings indicate that cross-tool coordination and parameter precision are still hard problems regardless of model tier. Our guide on what MCP is and how to use it covers the protocol layer if you're designing the tool interface rather than just selecting a model.

Financial or high-stakes domain tool use: FinTrace scores put Claude Opus 4.6 (0.788) and Claude Sonnet 4.6 (0.750) ahead of GPT-5.4 (0.737). The gap isn't large, and the benchmark itself is new enough that you should treat these numbers as signals rather than verdicts. Fine-tune on domain-specific tool trajectories if stakes are high - FinTrace-Training (8,196 annotated trajectories) is public.


Methodology Notes and Caveats

A few things to keep in mind when interpreting these tables:

BFCL v3 evaluates models as of the snapshot date and doesn't account for system prompt changes, context windows, or whether the model is being called with tool-use instructions or not. Provider settings matter.

Tau-bench is stochastic - it uses an LLM to simulate the user, and scores vary across runs. The Pass@k metric helps, but single published numbers should carry error bars that most leaderboard aggregators don't show. The tau-bench airline and retail numbers above come from llm-stats.com's snapshot and may not match Sierra Research's official site for newer models.

FinTrace and MCP-Bench are 2025-2026 publications with limited model coverage. They're worth watching as evaluation harnesses, but neither yet has the breadth of BFCL.

The Scale Labs enterprise leaderboard has excellent methodology documentation but runs on older model versions. Don't use it to compare GPT-5 vs. Claude 4.x.


FAQ

Which model is best for function calling overall?

No single model leads every benchmark. For multi-turn tool use, Claude Sonnet 4.5. For single-call format precision, GLM 4.5. For financial workflows, Claude Opus 4.6.

Why does Claude score so low on BFCL but high on tau-bench?

BFCL uses AST parsing to check output format. Claude wraps responses conversationally, which fails the parser even when tool selection is correct. Tau-bench measures task completion, where Claude's reasoning over outputs is an advantage.

Are open-weight models competitive with closed APIs for tool use?

On BFCL v3, yes - GLM 4.5 and Qwen3 32B beat every closed API in the table. On tau-bench multi-turn tasks, Anthropic closed models hold a consistent lead, though GLM-4.5 runs competitively at rank 3-4.

How often do these rankings change?

BFCL updates with new model submissions irregularly - major frontier releases tend to appear within weeks of launch. Tau-bench and FinTrace are less frequently updated. Check the source leaderboards linked in the Sources section for the latest snapshots.

What's the difference between tau-bench and tau2-bench?

The original tau-bench covers airline and retail customer service. Tau2-bench adds a telecom domain with more complex policy constraints. Scores aren't comparable across domains.

Does function calling performance transfer to MCP tool use?

Partially. BFCL and tau-bench measure direct function calls in controlled setups. MCP-Bench adds tool discovery and cross-server coordination. Models that score well on BFCL don't automatically handle MCP workflows well, according to the Accenture paper's findings.


Sources:

✓ Last verified April 17, 2026

Function Calling Benchmarks Leaderboard 2026
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.