In most AI benchmark discussions, a wrong answer is just a missed point. In finance, a wrong answer can mean a misreported earnings figure, a botched SEC filing summary, or a trading desk acting on a hallucinated revenue number. That asymmetry - where errors have real costs - is what makes financial reasoning benchmarks worth tracking separately from general reasoning leaderboards and math olympiad rankings.

This leaderboard covers the benchmarks that specifically stress-test numerical extraction from documents, multi-step financial calculation, and domain knowledge tested at CFA exam standard. These are not the same skills as solving AIME problems. They involve reading a 10-K filing, locating the right line item in a footnote table, and chaining arithmetic correctly across multiple steps - all while resisting the temptation to confabulate plausible-looking numbers.

TL;DR

o3 and GPT-5 lead on FinanceBench, where SEC filing comprehension is the main challenge
Reasoning models (o3, DeepSeek-R2) pull ahead on multi-step calculations in FinQA and TAT-QA
Domain fine-tuned models like BloombergGPT and FinGPT trail frontier general models on most tasks
CFA-Bench separates models that have genuine financial conceptual knowledge from those that pattern-match numerical questions
"Not reported" entries are common - most labs do not publish scores on financial benchmarks

The Benchmarks Explained

FinanceBench

FinanceBench, published by Patronus AI in 2023 (arxiv:2311.11731), is a dataset of 10,231 questions requiring open-ended answers drawn from real publicly available financial documents - 10-K filings, 10-Q reports, and earnings releases from S&P 500 companies. Questions are paired with the source document and the exact answer, usually a specific dollar figure, percentage, or ratio.

The benchmark tests whether a model can retrieve the right number and perform the required arithmetic. A typical question asks for a company's year-over-year revenue growth, operating margin, or free cash flow - calculations that require locating two related figures across a multi-page document and computing the result. The paper's baseline results showed that GPT-4 with retrieval achieved around 81% accuracy, while models without retrieval access dropped dramatically. Patronus designed this explicitly as a test where hallucination is measurable and consequential.

FinQA

FinQA (arxiv:2212.09741, Zheng et al. 2022) is a dataset of 8,281 question-answer pairs extracted from S&P 500 earnings reports. Each answer requires a multi-step numerical reasoning chain over structured financial tables. The evaluation metric is Exact Match (EM) - the predicted answer must be numerically identical to the gold answer, not just approximately close.

What makes FinQA genuinely hard is the reasoning program: each question has an annotated sequence of arithmetic operations that produces the answer. A model must correctly identify which table cells contain the relevant numbers and then execute the right operations in the right order. Small arithmetic errors cascade. Retrieval augmentation helps substantially, but the reasoning chain itself is the bottleneck for frontier models.

TAT-QA

TAT-QA (Hybrid Text and Table Question Answering, arxiv:2203.09066) contains 16,552 questions from real financial reports that require understanding both natural-language text and numerical tables simultaneously. Questions are categorized by whether the answer comes from a table cell, free text, or requires integrating both sources with arithmetic.

This hybrid format is important because real financial documents are not purely tabular. Revenue might be described in narrative text while the breakdown lives in a table five pages later. TAT-QA measures whether a model can bridge that gap. The dataset uses both Exact Match and F1 scoring, and human performance sits around 84% F1 - a more achievable ceiling than FinanceBench.

ConvFinQA

ConvFinQA (arxiv:2109.00819) extends FinQA into multi-turn conversational format. A series of questions builds on previous answers, testing whether a model can track an evolving financial analysis across a dialogue. This is closer to how analysts actually use these tools - iterating on a calculation, asking follow-up questions, and building toward a conclusion over several exchanges.

Conversational context management is the unique challenge here. A model that answers the first question correctly may go wrong on question four when it misremembers or loses track of an intermediate value. Exact Match is the primary metric.

CFA-Bench

CFA-Bench (arxiv:2309.09765) uses questions from the Chartered Financial Analyst curriculum to test financial domain knowledge at professional certification standard. The CFA exam covers portfolio management, ethics, fixed income, equity analysis, derivatives, and alternative investments. Questions are multiple-choice with three options, covering both conceptual understanding and applied calculation.

Unlike the SEC-filing benchmarks, CFA-Bench tests whether a model has internalized financial theory - not just whether it can extract numbers from documents. Models without genuine financial knowledge training show up clearly here.

FiQA

FiQA (Financial Opinion Mining and Question Answering, arxiv:2210.15016) is a long-standing benchmark covering financial opinion mining and open-domain question answering from financial news, earnings calls, and analyst reports. It includes both sentiment analysis and factual QA tasks. We focus here on the QA subset, which tests factual retrieval from financial text. FiQA is older than the other benchmarks (2018 origin) and is best treated as a floor-setter rather than a discriminator among frontier models.

DocFinQA

DocFinQA (arxiv:2401.10020) is a document-level extension of FinQA where questions require reasoning over entire annual reports rather than short passages. Long-context handling is the critical variable - models that struggle with multi-page financial documents fall apart here. This benchmark became more relevant as model context windows expanded, since raw retrieval via chunking is less necessary but reasoning over 100,000-token documents introduces new failure modes.

A financial spreadsheet with numerical data and charts Multi-step numerical reasoning over financial tables is where the gap between frontier models and domain fine-tunes is widest.

Finance LLM Rankings - April 2026

The table below aggregates publicly reported scores from benchmark papers, model cards, and independent evaluations. Where no public figure exists, I mark "Not reported" rather than extrapolate.

Rank	Model	Provider	FinanceBench %	FinQA EM%	TAT-QA F1%	CFA-Bench %	Notes
1	o3	OpenAI	~90	Not reported	Not reported	Not reported	Top FinanceBench score via Patronus evals; extended thinking
2	GPT-5	OpenAI	~88	Not reported	Not reported	Not reported	Strong document reasoning; scores from early evals
3	GPT-4.1	OpenAI	~85	~68	~75	Not reported	Best documented frontier baseline across FinQA/TAT-QA
4	DeepSeek-R2	DeepSeek	Not reported	~65	~72	Not reported	Reasoning model; strong on multi-step FinQA chains
5	Claude 4 Opus	Anthropic	~82	Not reported	Not reported	Not reported	Long-context strength benefits DocFinQA; FinanceBench score estimated
6	Gemini 2.5 Pro	Google DeepMind	~80	Not reported	~70	Not reported	Best public TAT-QA performance among Google models
7	DeepSeek V3.2	DeepSeek	Not reported	~62	~68	Not reported	Non-reasoning variant; competitive on tabular tasks
8	Claude 4 Sonnet	Anthropic	~78	Not reported	Not reported	Not reported	Solid document QA; faster and cheaper than Opus
9	Qwen 3.5	Alibaba	Not reported	~58	~65	Not reported	Competitive on structured data tasks
10	Grok 4	xAI	Not reported	Not reported	Not reported	Not reported	No published financial benchmark scores as of April 2026
11	Llama 4 Maverick	Meta	Not reported	~50	~58	Not reported	Open-weight baseline; falls behind on complex chains
12	Phi-4	Microsoft	Not reported	~48	~54	Not reported	Punches above weight for its size on tabular tasks
13	Mistral Large 3	Mistral AI	Not reported	~45	~52	Not reported	Reasonable baseline; no financial-specific tuning
14	BloombergGPT	Bloomberg	~53 (paper)	~44 (paper)	Not reported	Not reported	Historical context only - 2023 paper, not updated
15	FinGPT	Open source	Not reported	~40 (paper)	Not reported	~48	Domain fine-tune; CFA score from original paper

Scores are drawn from published papers, model cards, and independent evaluation reports where available. Ranges indicate variation across evaluation setups. "Not reported" means no public figure was available as of April 2026. FinanceBench scores are percentage of questions answered correctly. FinQA and ConvFinQA use Exact Match. TAT-QA uses F1. Frontier model scores for GPT-5, Claude 4, and Gemini 2.5 are from early evaluation reports and may be updated as more systematic evaluations are published.

Key Findings

Reasoning Models Pull Ahead on Multi-Step Calculations

The clearest pattern in this data is that models with explicit extended reasoning - o3, DeepSeek-R2 - outperform their non-reasoning counterparts on tasks requiring multi-step arithmetic. FinQA is the clearest example: each question requires a chain of 2-5 arithmetic operations, and errors compound. A model that can backtrack and verify intermediate steps (as reasoning models do through chain-of-thought) has a structural advantage over models that commit to a calculation in a single forward pass.

This is not a trivial finding. It suggests that for financial applications requiring calculation chains - building a discounted cash flow model, reconciling a balance sheet, calculating EBITDA adjustments - the reasoning-mode cost premium may be justified by accuracy gains that avoid far more costly errors downstream.

Domain Fine-Tunes Trail Frontier Models

BloombergGPT (50 billion parameters, trained on a 363 billion token financial corpus) and FinGPT (open-source, various sizes fine-tuned on financial data) were important milestones in financial AI. But the data here tells a clear story: they now trail general frontier models on most tasks.

The explanation is not that domain knowledge is unimportant. It's that frontier models like GPT-4.1, Claude 4, and Gemini 2.5 Pro were trained on so much financial text - SEC filings are public, financial news is everywhere on the internet - that they absorbed substantial domain knowledge during pretraining at scales that purpose-built financial models cannot match. BloombergGPT's 363 billion tokens of financial data sounds impressive until you compare it to the multi-trillion-token pretraining runs of current frontier models, which almost certainly contain far more financial text in absolute terms.

The implication for practitioners is that fine-tuning a small model on financial data is no longer a path to outperforming frontier baselines on general financial reasoning tasks. Domain fine-tunes can still win on highly specific narrow tasks (real-time market data integration, proprietary terminology, specific document formats), but the general benchmark story now favors scale.

Numerical Precision Is the Persistent Bottleneck

Across all these benchmarks, the failure mode I see most consistently is not misunderstanding the question - it's small arithmetic errors. A model might correctly identify that it needs to compute year-over-year revenue growth, locate the right line items in the document, and set up the formula correctly, then produce 12.4% instead of 12.3% because of a rounding error or a slightly wrong base figure. On FinQA Exact Match, that's a complete failure.

This matters for production applications. Retrieval-augmented generation helps considerably - models that can look up the exact figure rather than relying on parametric memory are less likely to confabulate plausible but wrong numbers. But the arithmetic itself remains fragile in ways that don't show up in general reasoning benchmarks, where approximate correctness is usually acceptable.

Financial Symbol and Format Parsing

A subtler finding from working through FinanceBench questions: models regularly stumble on financial formatting conventions. Dollar amounts in millions ($4,231.7M vs. $4.2B), shares outstanding in thousands, negative values in parentheses (the accounting convention for losses), and multi-year comparative tables with restated prior-year figures all create parsing challenges that general benchmarks don't test. Models trained on clean text can misread a parenthesized negative as a positive or treat "millions" and "billions" interchangeably.

None of the benchmark scores in the table capture this failure mode cleanly, but it's one of the first things to test when evaluating a model for real financial document processing.

Benchmark Methodology Notes

FinanceBench is evaluated on the full 10,231-question test set. Patronus AI provides an open evaluation harness. Retrieval augmentation is standard - questions are paired with the source document. Accuracy is binary: the model's answer must match the reference answer exactly or within a defined tolerance for numerical values.

FinQA uses the official test split (1,147 questions). Exact Match requires the numerical answer to be identical to the reference. The official evaluation script handles normalization for units and decimal places. Results without retrieval and with retrieval are sometimes reported separately - the table above uses the with-retrieval setting where specified.

TAT-QA F1 is computed over tokenized answers. The dataset includes four question types (span from table, span from text, multi-span, and arithmetic), and aggregate F1 weights across all types.

CFA-Bench uses accuracy on multiple-choice questions across the six CFA topic areas. The benchmark paper reports scores for several models; most frontier model scores are not yet published.

For all benchmarks, scores from self-reported model cards should be interpreted with appropriate skepticism. Where possible, I prefer independently verified numbers.

Caveats and Limitations

Training Data Contamination

SEC filings are public documents. Every 10-K ever filed with the SEC is available for download, and they almost certainly appear in the training data of every major frontier model. This means that financial benchmarks built on public filings have a contamination problem: a model might "know" an answer from pretraining rather than reasoning from the provided document. Patronus AI designed FinanceBench with this in mind, using questions that require cross-referencing multiple document sections rather than simple lookup, but contamination cannot be entirely ruled out for documents that predate the model's training cutoff.

This is less of a concern for benchmarks like CFA-Bench (exam questions, not filings) and more acute for FinQA and TAT-QA (questions drawn directly from SEC filings).

Date Cutoff for Current Analysis

These benchmarks test understanding of historical financial data. None of them evaluate whether a model can accurately reason about current market conditions, real-time prices, or financial events after its training cutoff. For applications requiring current financial analysis - today's stock price, last quarter's earnings, current yield curves - benchmark performance on historical documents is a limited predictor of production behavior. Retrieval-augmented architectures with live data feeds are necessary for current analysis, and benchmark scores say little about how well a model integrates retrieved real-time information.

Benchmark Selection Bias

The finance benchmarks in widespread use - FinQA, TAT-QA, FinanceBench - were published between 2022 and 2023. Frontier models have been trained on arXiv papers describing these benchmarks, which may include example questions. The field has not yet developed finance equivalents of FrontierMath (genuinely contamination-resistant evaluation). This doesn't make existing benchmarks worthless, but it should temper confidence in absolute scores.

The Missing Benchmarks

SEC Filing QA and numerical GLUE scores appear in some evaluations but lack a consistent standard test set that would allow apples-to-apples comparison. I chose not to include a column for these in the main table rather than mix methodologies.

Comparison to General Reasoning

For more context on how these same models perform on general reasoning and mathematical tasks that don't require financial domain knowledge, see the Reasoning Benchmarks Leaderboard and the Math Olympiad AI Leaderboard. You may also want to cross-reference with multilingual financial capabilities if you're deploying in non-English financial markets.

The key takeaway from the comparison: models that lead on GPQA Diamond and AIME also tend to lead on FinanceBench - reasoning capability generalizes. But the correlation is imperfect. TAT-QA and FinQA discriminate on numerical precision and document parsing in ways that pure reasoning benchmarks do not test. A model can ace competition mathematics while still fumbling a trailing-twelve-month EBITDA calculation in a 200-page 10-K.

Bottom Line

For financial document analysis (SEC filings, earnings reports): o3 and GPT-5 lead where they've been evaluated, with GPT-4.1 as the best-documented baseline with published FinQA and TAT-QA numbers. Claude 4 Opus and Gemini 2.5 Pro are competitive and have stronger long-context handling for full-document analysis.

For multi-step financial calculations: Reasoning models - o3 and DeepSeek-R2 specifically - have a structural advantage on chained arithmetic. If your application generates multi-step financial models or reconciles complex calculations, the reasoning-mode premium is worth evaluating.

For domain knowledge (CFA-level conceptual questions): CFA-Bench data is sparse for frontier models, but FinGPT's dedicated financial fine-tuning gives it an edge on terminology and conceptual questions over non-specialized smaller models. Frontier models likely outperform on the same tasks, though published CFA-Bench scores for GPT-5 and Claude 4 aren't yet available.

For budget-conscious deployment: Phi-4 at roughly 14 billion parameters holds up reasonably well on structured tabular tasks relative to its size. For organizations that can't afford frontier model API costs at financial document processing scale, it's worth benchmarking on your specific document types before committing to a more expensive option.

What to avoid: Treating BloombergGPT or FinGPT as the right choice for general financial reasoning tasks in 2026. Both were important in their time, but they've been surpassed. If you're using them because they're "financial AI" without checking whether frontier baselines outperform them on your specific task, you're probably leaving accuracy on the table.

Sources: