Function Calling Benchmarks Leaderboard 2026
Rankings of top LLMs on function calling and tool use benchmarks including BFCL v3, tau-bench, ToolBench, and FinTrace as of April 2026.

Function calling is one of the capabilities that matters most for anyone building real AI applications - and it's one of the least straightforward to assess. A model that gets 90% on general reasoning can fall apart the moment you ask it to correctly format a nested JSON payload, choose between two semantically similar tools, or recover gracefully when an API returns an error mid-task.
This leaderboard tracks how the major frontier models actually perform on dedicated function calling and tool-use benchmarks. We cover five distinct evaluations: BFCL v3 (single and multi-turn structured calls), tau-bench (multi-turn tool use in realistic customer service scenarios), ToolBench (multi-step real-world API tasks), FinTrace (long-horizon financial tool calling), and MCP-Bench (agentic tool use via Model Context Protocol). Each captures something the others miss.
If you're coming from our agentic AI benchmarks leaderboard, the coverage here is more focused: we're looking specifically at the tool-calling layer, not the full agent trajectory.
TL;DR
- GLM 4.5 leads BFCL v3 at 76.7%, ahead of Qwen3 32B (75.7%) - Claude Opus 4 scores a surprisingly low 25.3%
- Claude Sonnet 4.5 controls tau-bench with 0.700 on airline tasks and 0.862 on retail - the strongest multi-turn tool use score published
- GPT-5.2 Thinking hits 98.7% on tau2-bench telecom, though that benchmark's narrow scope inflates the headline number
- Budget pick: Qwen3 235B A22B (74.9% BFCL v3) runs open-weight and trades near-frontier scores for zero API costs
The Benchmarks Explained
Before reading the tables, it helps to know what each benchmark actually tests - because the gap between a 76% BFCL score and a 70% tau-bench score isn't just a number difference. They measure different things entirely.
BFCL v3 (Berkeley Function Calling Leaderboard)
The Berkeley Function Calling Leaderboard from UC Berkeley's Sky Computing Lab is the most widely cited tool-use benchmark. Version 3 added multi-turn interactions on top of the v2 static evaluation. The dataset contains over 2,000 question-function-answer pairs spanning Python, Java, JavaScript, and REST APIs, testing six categories: simple single calls, parallel calls (multiple functions fired simultaneously), multiple function selection, relevance detection (knowing when to refuse a tool call), multi-turn interactions, and multi-step reasoning.
The evaluation method matters here. BFCL uses Abstract Syntax Tree (AST) comparison to check structural correctness of the produced call, not just string matching. That catches paraphrasing games where a model rewrites the function name slightly and a naive eval would still pass it. BFCL v4 has since added web search and memory tasks, but v3 remains the most-assessed version for cross-model comparison. Version 4 scores aren't broadly available yet across 2026 frontier models.
tau-bench (Tool-Agent-User Interaction)
tau-bench from Sierra Research simulates live customer service scenarios. An AI agent is given a set of API tools and must complete multi-turn conversations with a simulated user - resolving requests like fare changes, return windows, and account modifications while following strict policy guidelines. A Pass@k metric measures how consistently the agent succeeds across k repeated runs. That penalizes brittleness: a model with 80% single-turn accuracy might only complete multi-turn scenarios reliably 50% of the time because errors compound.
There are two primary domains - airline and retail - with a newer telecom variant (tau2-bench) from the same team. Scores across domains aren't directly comparable because task complexity differs, so we report them separately.
ToolBench and ToolLLM
ToolBench, published by Tsinghua's NLP lab and accepted at ICLR 2024, tests models against over 120,000 instruction-API pairs drawn from 16,000+ real-world APIs across 49 categories. It stresses tool generalization: models are assessed on unseen tools, unseen instructions, and unseen API categories to test whether they've truly learned to use APIs or just memorized training examples. The ToolLLM paper describes the fine-tuned model trained on ToolBench data; the benchmark itself runs any model through the same evaluation harness.
FinTrace
Published in April 2026, FinTrace from Carnegie Mellon and a consortium of financial AI researchers targets a specific failure mode: models that pick the right tool but then fail to use the output correctly. The benchmark contains 800 expert-annotated trajectories across 34 financial task categories, scored across nine metrics along four axes - action correctness, execution efficiency, process quality, and output quality. It's the only benchmark in this roundup that explicitly measures whether the model reasons well about tool outputs, not just whether it invokes the right function.
MCP-Bench
MCP-Bench from Accenture Labs connects LLMs to 28 live MCP servers spanning 250 tools across domains including finance, scientific computing, and travel. Accepted to NeurIPS 2025, it tests real multi-hop tool coordination: "book a flight to the cheapest city with a conference this month" requires chaining calendar, flight search, and pricing tools in the correct order. This benchmark is newer and has fewer assessed models than BFCL, but it's the closest thing to a production agentic workload in the published benchmark suite.
BFCL v3 Rankings
Data from pricepertoken.com, accessed April 2026. 23 models assessed; average score 55.9.
| Rank | Model | Provider | BFCL v3 Score |
|---|---|---|---|
| 1 | GLM 4.5 Thinking | Z AI | 76.7% |
| 2 | Qwen3 32B Thinking | Alibaba | 75.7% |
| 2 | Qwen3 32B | Alibaba | 75.7% |
| 4 | Qwen3 Max | Alibaba | 74.9% |
| 5 | GLM-4.7-Flash Thinking | Z AI | 74.6% |
| 5 | GLM-4.7-Flash | Z AI | 74.6% |
| 7 | GLM 4.5 Air | Z AI | 69.1% |
| 8 | Nova Pro 1.0 | Amazon | 67.9% |
| 9 | Kimi K2.5 Thinking | Moonshot AI | 64.5% |
| 9 | Kimi K2.5 | Moonshot AI | 64.5% |
| 11 | INTELLECT-3 | Prime Intellect | 63.5% |
| 12 | Llama 4 Scout | Meta | 55.7% |
| 13 | Gemini 3 Flash Preview Thinking | 53.5% | |
| 14 | MiniMax M1 | MiniMax | 47.8% |
| 15 | Nemotron 3 Nano 30B A3B Thinking | NVIDIA | 41.6% |
| 15 | Nemotron 3 Nano 30B A3B | NVIDIA | 41.6% |
| 17 | Phi 4 | Microsoft Azure | 40.8% |
| 18 | Claude Opus 4 Thinking | Anthropic | 25.3% |
| 18 | Claude Opus 4 | Anthropic | 25.3% |
| 18 | Kimi K2 0711 | Moonshot AI | 25.3% |
The Claude Opus 4 result at 25.3% deserves attention. Anthropic's flagship model, which leads multi-turn agentic benchmarks and beats everyone on tau-bench, scores at the bottom of the BFCL v3 table. That isn't a fluke - it reflects a real tradeoff in how the model handles structured formatting under test conditions. BFCL rewards precise, rigid JSON outputs. Claude tends toward conversational wrapping that can trip up AST parsing even when the underlying tool selection is correct. The discrepancy is a known issue worth understanding before drawing conclusions about real-world usability.
BFCL uses Abstract Syntax Tree comparison to evaluate function call correctness - catching subtle structural errors that string matching misses.
Source: pexels.com
tau-bench Rankings
Airline Domain
Data from llm-stats.com, accessed April 2026. 23 models assessed; average score 0.495.
| Rank | Model | Provider | Airline Score |
|---|---|---|---|
| 1 | Claude Sonnet 4.5 | Anthropic | 0.700 |
| 2 | MiniMax M1 80K | MiniMax | 0.620 |
| 3 | GLM-4.5-Air | Zhipu AI | 0.608 |
| 4 | GLM-4.5 | Zhipu AI | 0.604 |
| 5 | MiniMax M1 40K | MiniMax | 0.600 |
| 5 | Claude Sonnet 4 | Anthropic | 0.600 |
| 5 | Qwen3-Coder 480B A35B | Alibaba | 0.600 |
| 8 | Claude Opus 4 | Anthropic | 0.596 |
| 9 | Claude 3.7 Sonnet | Anthropic | 0.584 |
| 10 | Claude Opus 4.1 | Anthropic | 0.560 |
| 11 | o1 | OpenAI | 0.500 |
| 11 | GPT-4.5 | OpenAI | 0.500 |
| 13 | GPT-4.1 | OpenAI | 0.494 |
| 14 | o4-mini | OpenAI | 0.492 |
| 15 | Qwen3-Next-80B-A3B-Thinking | Alibaba | 0.490 |
| 16 | Qwen3-235B-A22B-Thinking-2507 | Alibaba | 0.460 |
| 17 | GPT-4o | OpenAI | 0.428 |
| 18 | GPT-4.1 mini | OpenAI | 0.360 |
| 19 | GPT-4.1 nano | OpenAI | 0.140 |
Retail Domain
Data from llm-stats.com, accessed April 2026. 25 models assessed; average score varies.
| Rank | Model | Provider | Retail Score |
|---|---|---|---|
| 1 | Claude Sonnet 4.5 | Anthropic | 0.862 |
| 2 | Claude Opus 4.1 | Anthropic | 0.824 |
| 3 | Claude Opus 4 | Anthropic | 0.814 |
| 4 | Claude 3.7 Sonnet | Anthropic | 0.812 |
| 5 | Claude Sonnet 4 | Anthropic | 0.805 |
| 6 | GLM-4.5 | Zhipu AI | 0.797 |
| 7 | GLM-4.5-Air | Zhipu AI | 0.779 |
| 8 | Qwen3-Coder 480B A35B | Alibaba | 0.775 |
| 9 | o4-mini | OpenAI | 0.718 |
| 10 | o1 | OpenAI | 0.708 |
| 11 | Qwen3-Next-80B-A3B-Thinking | Alibaba | 0.696 |
| 12 | Claude 3.5 Sonnet | Anthropic | 0.692 |
| 13 | GPT-4.5 | OpenAI | 0.684 |
| 14 | GPT-4.1 | OpenAI | 0.680 |
| 15 | GPT-4o | OpenAI | 0.603 |
The pattern is consistent: Anthropic holds the top five retail positions, with Claude Sonnet 4.5 posting 0.862 - 3.8 points ahead of the next model. GLM-4.5 at sixth (0.797) is the first non-Anthropic model, and it scores higher than GPT-4.5 (0.684), which is worth noting for cost-sensitive buyers.
tau2-bench Telecom (Extended Domain)
The tau2-bench telecom dataset, tracked at artificialanalysis.ai, extends the original benchmark into telecommunications customer service with more complex policy trees. GPT-5.2 Thinking set a high bar at 98.7%, and GLM models have since pushed the reported ceiling even higher.
| Rank | Model | Provider | Telecom Score |
|---|---|---|---|
| 1 | GLM-4.7-Flash (Reasoning) | Alibaba | 98.8% |
| 2 | GLM-5V Turbo (Reasoning) | Alibaba | 98.5% |
| 3 | GLM-5-Turbo | Alibaba | 98.5% |
| 4 | GPT-5.2 Thinking | OpenAI | 98.7% |
One important caveat: these telecom scores look spectacular, but the benchmark's specific policy constraints also make it more susceptible to overfitting. No model scored above 49% when the paper was first published. The rapid climb to 98%+ suggests some combination of genuine capability improvement and possible training data exposure. Treat telecom scores as directionally useful, not as definitive.
Multi-turn tool use benchmarks like tau-bench are harder to game than single-turn evaluations because errors compound across multiple conversation turns.
Source: pexels.com
FinTrace Rankings (Financial Tool Calling)
FinTrace, published April 2026 (arXiv:2604.10015), evaluated 13 LLMs across 800 annotated financial task trajectories. The rubric grades nine metrics along four axes - action correctness, execution efficiency, process quality, and output quality.
| Rank | Model | Provider | FinTrace Score |
|---|---|---|---|
| 1 | Claude Opus 4.6 | Anthropic | 0.788 |
| 2 | Claude Sonnet 4.6 | Anthropic | 0.750 |
| 3 | GPT-5.4 | OpenAI | 0.737 |
| Mid-tier | Gemini 3 Flash | ~0.450 |
The FinTrace authors found that frontier models handle tool selection reliably but consistently struggle with information use and final answer quality. Picking the right function to call isn't the hard part anymore - doing something useful with the result is. That finding holds across all 13 models tested, including the top scorers.
Scale Labs Enterprise Tool Use
For a broader single-turn tool use perspective, Scale Labs runs an enterprise evaluation against 35 models. Top performers as of the most recent snapshot at labs.scale.com/leaderboard/tool_use_enterprise:
| Rank | Model | Provider | Score |
|---|---|---|---|
| 1 | o1 (Dec 2024) | OpenAI | 70.1 ±5.3 |
| 2 | Gemini 2.5 Pro Experimental | 68.8 ±5.4 | |
| 3 | o1 Pro | OpenAI | 67.0 ±5.4 |
| 4 | o1-preview | OpenAI | 66.4 ±5.5 |
| 5 | DeepSeek-R1 | DeepSeek | 65.3 ±5.5 |
| 6 | Claude 3.7 Sonnet Thinking | Anthropic | 65.3 ±5.5 |
| 7 | GPT-4o (May 2024) | OpenAI | 64.6 ±5.5 |
| 8 | GPT-4.5 Preview | OpenAI | 63.8 ±5.6 |
| 9 | Llama 3.1 405B Instruct | Meta | 50.4 ±5.8 |
| 10 | GPT-4o mini | OpenAI | 51.7 ±5.8 |
This leaderboard uses older model checkpoints (the most recent version evaluated is from early 2025), so it doesn't reflect current frontier capability. It's still useful as a reference for enterprise buyers comparing older deployments and for seeing how reasoning models (o1, o1 Pro) compare to standard models on structured API tasks.
Key Takeaways
The BFCL-tau-bench split tells the real story
The biggest insight from these numbers is that BFCL and tau-bench don't agree on model ranking - and they shouldn't, because they're testing different things. BFCL rewards single-call precision; tau-bench rewards sustained reliability across a multi-turn conversation. Claude Opus 4 scores 25.3% on BFCL and 0.814 on tau-bench retail. That isn't a contradiction - it means Claude handles multi-turn tool orchestration well but formats outputs in ways that trip up BFCL's AST parser.
For most production use cases, tau-bench scores matter more. Real agents don't make a single tool call and stop.
Open-weight models are competitive on structured calling
GLM-4.5 tops BFCL v3 at 76.7%, and Qwen3 32B is right behind at 75.7%. Both are available open-weight. Anthropic's closed models dominate tau-bench, but for applications that need raw function-call accuracy without the conversational overhead, the open-weight options are genuinely strong. Our guide to running open-source LLMs locally covers setup if you want to benchmark these yourself.
The "thinking" premium is marginal for tool calling
Several model pairs in the BFCL v3 table show identical scores for the base and "Thinking" variant (GLM-4.7-Flash and GLM-4.7-Flash Thinking both score 74.6%; Kimi K2.5 and Kimi K2.5 Thinking both score 64.5%). Extended chain-of-thought doesn't help on well-formed single-call evaluations. It helps on complex multi-step planning - which is what tau-bench and FinTrace measure. So the decision of whether to use a reasoning model should depend on your task structure, not just the model tier.
FinTrace exposes the output problem
The FinTrace finding - that all models struggle with information utilization more than tool selection - points to the next frontier in function calling research. Models have gotten good at choosing the right tool. They haven't gotten comparably good at reasoning over the result before taking the next step. This matters enormously for financial, medical, and legal agent workflows where a tool returns a document and the agent needs to extract the right number from it before continuing.
Practical Guidance
Building a customer service or task automation agent: Use Claude Sonnet 4.5 or Claude Opus 4. Both consistently lead tau-bench across domains, and the multi-turn reliability gap between them and GPT-4.1 is large enough to matter in production. See our review of Claude Opus 4.6 for a deeper look at Anthropic's flagship line.
Structured data extraction or API integration with exact schema requirements: Check the BFCL v3 table and prioritize GLM-4.5 or Qwen3 32B if output format compliance is your bottleneck. If you're running closed APIs through OpenAI, Nova Pro 1.0 (67.9% on BFCL v3) sits well above GPT-4.1 on structured calling.
Multi-hop tool chains (MCP, complex pipelines): MCP-Bench data is limited, but its findings indicate that cross-tool coordination and parameter precision are still hard problems regardless of model tier. Our guide on what MCP is and how to use it covers the protocol layer if you're designing the tool interface rather than just selecting a model.
Financial or high-stakes domain tool use: FinTrace scores put Claude Opus 4.6 (0.788) and Claude Sonnet 4.6 (0.750) ahead of GPT-5.4 (0.737). The gap isn't large, and the benchmark itself is new enough that you should treat these numbers as signals rather than verdicts. Fine-tune on domain-specific tool trajectories if stakes are high - FinTrace-Training (8,196 annotated trajectories) is public.
Methodology Notes and Caveats
A few things to keep in mind when interpreting these tables:
BFCL v3 evaluates models as of the snapshot date and doesn't account for system prompt changes, context windows, or whether the model is being called with tool-use instructions or not. Provider settings matter.
Tau-bench is stochastic - it uses an LLM to simulate the user, and scores vary across runs. The Pass@k metric helps, but single published numbers should carry error bars that most leaderboard aggregators don't show. The tau-bench airline and retail numbers above come from llm-stats.com's snapshot and may not match Sierra Research's official site for newer models.
FinTrace and MCP-Bench are 2025-2026 publications with limited model coverage. They're worth watching as evaluation harnesses, but neither yet has the breadth of BFCL.
The Scale Labs enterprise leaderboard has excellent methodology documentation but runs on older model versions. Don't use it to compare GPT-5 vs. Claude 4.x.
FAQ
Which model is best for function calling overall?
No single model leads every benchmark. For multi-turn tool use, Claude Sonnet 4.5. For single-call format precision, GLM 4.5. For financial workflows, Claude Opus 4.6.
Why does Claude score so low on BFCL but high on tau-bench?
BFCL uses AST parsing to check output format. Claude wraps responses conversationally, which fails the parser even when tool selection is correct. Tau-bench measures task completion, where Claude's reasoning over outputs is an advantage.
Are open-weight models competitive with closed APIs for tool use?
On BFCL v3, yes - GLM 4.5 and Qwen3 32B beat every closed API in the table. On tau-bench multi-turn tasks, Anthropic closed models hold a consistent lead, though GLM-4.5 runs competitively at rank 3-4.
How often do these rankings change?
BFCL updates with new model submissions irregularly - major frontier releases tend to appear within weeks of launch. Tau-bench and FinTrace are less frequently updated. Check the source leaderboards linked in the Sources section for the latest snapshots.
What's the difference between tau-bench and tau2-bench?
The original tau-bench covers airline and retail customer service. Tau2-bench adds a telecom domain with more complex policy constraints. Scores aren't comparable across domains.
Does function calling performance transfer to MCP tool use?
Partially. BFCL and tau-bench measure direct function calls in controlled setups. MCP-Bench adds tool discovery and cross-server coordination. Models that score well on BFCL don't automatically handle MCP workflows well, according to the Accenture paper's findings.
Sources:
- Berkeley Function Calling Leaderboard (BFCL) V4
- BFCL v3 model rankings - pricepertoken.com
- BFCL benchmark - llm-stats.com
- tau-bench GitHub repository
- tau-bench airline - llm-stats.com
- tau-bench retail - llm-stats.com
- BenchLM tau-bench snapshot
- tau2-bench telecom leaderboard - Artificial Analysis
- FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling (arXiv:2604.10015)
- MCP-Bench: Benchmarking Tool-Using LLM Agents (arXiv:2508.20453)
- ToolBench - OpenBMB GitHub
- Scale Labs Enterprise Tool Use Leaderboard
- Best AI for Tool Calling - llm-stats.com
- BenchLM LLM Agent and Tool-Use Benchmarks
- Nexus Function Calling Leaderboard - Hugging Face
- ToolSpec: Accelerating Tool Calling (arXiv:2604.13519)
✓ Last verified April 17, 2026
