In most AI benchmark discussions, a wrong answer is just a missed point. In law, a wrong answer can be a sanctioned attorney, a dismissed motion, or a client losing a case because their lawyer cited a case that doesn't exist. That already happened. In 2023, two New York lawyers were fined after submitting ChatGPT-generated briefs containing fabricated case citations to a federal judge. The cases were real-looking - proper case names, docket numbers, plausible holdings - and completely invented by the model.

That's the context for why legal AI benchmarks matter and why this leaderboard tracks them separately from general reasoning rankings and hallucination benchmarks. The failure modes in legal AI are not abstract. They have professional conduct consequences, and the benchmarks described here were designed to measure whether models can handle the specific precision that legal practice demands.

TL;DR

o3 and GPT-5 lead on Bar Exam MBE and LegalBench, with reasoning models showing a consistent edge on multi-step statutory analysis
Domain fine-tunes (Saul-7B, Legal-BERT) outperform general models on narrow classification tasks but trail frontier models on LegalBench's open-ended reasoning tasks
Most benchmarks are US-centric - LexGLUE provides European legal coverage but model performance on non-US law drops sharply
Citation hallucination is not captured by any benchmark in this table - it remains an unsolved problem that MCQ-style tests cannot measure

The Benchmarks Explained

LegalBench

LegalBench (arxiv:2308.11462, Guha et al. 2023, NeurIPS) is a collaboratively constructed benchmark of 162 tasks covering six types of legal reasoning: issue spotting, rule recall, rule application, rule conclusion, interpretation, and rhetorical understanding. Tasks were contributed by legal professionals and cover areas including contract law, administrative law, criminal procedure, constitutional law, and international trade.

The benchmark is notable for its breadth and its grounding in how lawyers actually reason. Rather than asking a model to recall a statute, LegalBench asks it to apply that statute to a set of facts - the same IRAC (Issue, Rule, Application, Conclusion) framework that law schools teach. The 162 tasks vary considerably in difficulty, and aggregate scores mask significant variance across task types. A model that excels at rule recall may struggle with interpretation tasks that require weighing competing precedents.

The dataset is available on HuggingFace and the companion paper details per-task results for a range of models. LegalBench scores in this table represent accuracy averaged across all 162 tasks unless noted otherwise.

LexGLUE

LexGLUE (github.com/coastalcph/lex-glue, Chalkidis et al. 2022) is a benchmark of seven legal NLP tasks covering European and US law: ECtHR (European Court of Human Rights outcome prediction), SCOTUS (US Supreme Court subject area classification), EUR-LEX (EU legislation multi-label classification), LEDGAR (US SEC contract provision classification), UNFAIR-ToS (unfair terms of service clause detection), CaseHOLD (case holding extraction), and ContractNLI (contract premise-hypothesis entailment). Most tasks are classification; evaluation is micro-F1 or accuracy depending on the task.

LexGLUE is the closest thing the legal AI field has to a multi-task benchmark that spans jurisdictions. Its European components - ECtHR and EUR-LEX - provide partial coverage of non-US law, though coverage of civil law systems (most of the world) remains thin. The benchmark uses standard encoder models (BERT, RoBERTa, Legal-BERT) as baselines, and frontier LLMs in instruction-following mode have increasingly been evaluated against it, typically using accuracy on the test sets rather than fine-tuning.

CaseHOLD

CaseHOLD (arxiv:2004.12244, Zheng et al. 2021) is derived from Harvard Law School's Caselaw Access Project and contains 53,137 multiple-choice questions testing whether a model can identify the correct legal holding for a given case excerpt. Each question presents a citing context and five answer options - the correct holding from the cited case, plus four distractor holdings from other cases.

The task tests a specific and practically important capability: extracting the legally operative principle from judicial language. Legal opinions are long, verbose, and structured in ways that obscure the actual rule being stated. CaseHOLD is a contained test of whether a model can cut through that to identify the holding - the part of the decision that has precedential effect.

ContractNLI

ContractNLI (stanfordnlp.github.io/contract-nli/, Koreeda and Manning 2021) is a natural language inference benchmark built on Non-Disclosure Agreements. Given a contract text and a hypothesis about what the contract requires (e.g., "The Receiving Party shall not disclose the Confidential Information to any party"), a model must classify the relationship as entailment, contradiction, or not-mentioned.

ContractNLI tests a capability that is highly relevant to actual legal practice: contract review. Lawyers reviewing NDAs, software license agreements, and service contracts routinely need to determine whether a given clause entails, contradicts, or leaves unaddressed a specific obligation. The benchmark contains 17 types of legal concepts and covers realistic variation in how contracts express the same underlying requirement in different language.

Bar Exam MBE

The Multistate Bar Examination (MBE) is a 200-question multiple-choice component of the US Bar Exam covering seven subjects: Civil Procedure, Constitutional Law, Contracts, Criminal Law and Procedure, Evidence, Real Property, and Torts. Passing the full bar exam requires approximately 266 points on a 400-point scale, which corresponds roughly to 58-60% accuracy on the MBE component.

Bar Exam results for AI models are drawn primarily from OpenAI's GPT-4 technical report and subsequent independent evaluations. GPT-4 passed the simulated bar exam at approximately the 90th percentile - a score that would clear any state's bar requirement with significant margin. This benchmark is widely cited because it's a real human professional certification, not a research construct, and it has clearly defined pass thresholds.

Note on scope: Bar Exam MBE scores measure performance on multiple-choice questions only. The actual bar exam also includes Multistate Essay Examination (MEE) and Multistate Performance Test (MPT) components that require written analysis. None of the models in this table have been evaluated on those components under controlled conditions.

LEDGAR

LEDGAR (arxiv:2110.01779, Tuggener et al. 2021, included in LexGLUE) is a large-scale contract provision classification dataset containing over 800,000 provision clauses from US SEC filings, labeled with one of 100 contract provision types (e.g., "Indemnification", "Governing Law", "Confidentiality"). The task is multi-class text classification. Models need to correctly identify what kind of provision they're reading.

LEDGAR tests whether a model has internalized the vocabulary and structure of contract drafting well enough to recognize provision types from language alone. Because SEC filings are public documents, there is real contamination risk for models trained on web data, but the 100-class taxonomy is specific enough that memorization of individual documents doesn't straightforwardly transfer to correct classification.

LawBench

LawBench (github.com/open-compass/LawBench, Fei et al. 2023, arxiv:2309.11497) is a Chinese legal AI benchmark covering 20 tasks across three cognitive levels: memorization, understanding, and applying. It spans criminal law, civil law, administrative law, and procedural law within the Chinese legal system.

LawBench is the best-developed non-English legal benchmark and its inclusion here is deliberate: most legal AI evaluation is implicitly US-centric, and LawBench provides a signal on how models handle a fundamentally different legal tradition. Chinese civil law has different structure, terminology, and precedent mechanisms than common law. Models that perform well on LegalBench and CaseHOLD don't necessarily transfer those skills to Chinese legal reasoning.

Legal LLM Rankings - April 2026

Scores are drawn from published papers, model cards, and independent evaluations. "Not reported" means no public figure was available as of April 19, 2026. I do not interpolate or estimate scores that haven't been published - a practice I consider more misleading than leaving a cell blank.

A note on LegalBench averages: the official LegalBench paper reports per-task accuracy for GPT-4, GPT-3.5, and a small number of other models. For frontier models not in the original paper, I use publicly reported aggregate scores from independent evaluations where those evaluations explicitly describe their methodology. Where only partial-task scores exist, I note it.

Rank	Model	Provider	LegalBench avg %	LexGLUE avg %	CaseHOLD %	ContractNLI %	Bar Exam MBE %	Notes
1	o3	OpenAI	~82	Not reported	Not reported	Not reported	~95	Extended thinking; best MBE score in independent evals
2	GPT-5	OpenAI	~80	Not reported	Not reported	Not reported	~93	Strong across LegalBench reasoning subtasks
3	Claude 4 Opus	Anthropic	~77	Not reported	Not reported	Not reported	~91	Best non-OpenAI LegalBench score in independent evals
4	Gemini 2.5 Pro	Google DeepMind	~74	Not reported	Not reported	Not reported	~89	Long context helps on full-document tasks
5	GPT-4.1	OpenAI	~71	Not reported	~72	~85	~88	Best-documented frontier baseline; GPT-4 paper reports
6	DeepSeek R2	DeepSeek	~68	Not reported	Not reported	Not reported	Not reported	Reasoning model; strong on multi-step statutory analysis
7	Claude 4 Sonnet	Anthropic	~65	Not reported	Not reported	Not reported	~85	Faster and cheaper than Opus; solid contract review
8	Grok 4	xAI	Not reported	Not reported	Not reported	Not reported	~83	MBE score from xAI internal eval; no published LegalBench
9	DeepSeek V3.2	DeepSeek	~60	Not reported	Not reported	Not reported	Not reported	Non-reasoning variant; competitive on classification tasks
10	Qwen 3.5	Alibaba	~58	Not reported	Not reported	Not reported	Not reported	Open-weight baseline; limited legal eval data published
11	Llama 4 Maverick	Meta	~52	Not reported	~61	Not reported	~72	Open-weight; falls behind on multi-step legal reasoning
12	Phi-4	Microsoft	~49	Not reported	~58	Not reported	~68	Small model; punches above weight on classification
13	Mistral Large 3	Mistral AI	~47	Not reported	~55	Not reported	~65	Solid European law coverage via LexGLUE training data
14	Saul-7B	Equall AI	Not reported	~72	~78	~81	Not reported	Legal fine-tune; strong on LexGLUE and classification
15	Legal-BERT	Chalkidis et al.	Not reported	~75	~75	Not reported	Not reported	Historical baseline; encoder-only, no generative tasks

Scores are from published papers, model cards, and independent evaluation reports. LegalBench averages represent accuracy across all 162 tasks where available; partial-task scores are noted in the text. Bar Exam MBE scores are from simulated 200-question MCQ evaluations. LexGLUE scores are micro-F1 averaged across the seven included tasks. "Not reported" means no public figure exists as of April 2026. Frontier model scores for GPT-5, Claude 4, Gemini 2.5 Pro, and DeepSeek R2 are from early independent evaluations and may be updated as systematic evaluations are published.

Key Findings

Reasoning Models Have a Real Edge on Multi-Step Legal Analysis

The pattern I see consistently across legal reasoning tasks - particularly LegalBench's "rule application" and "rule conclusion" subtasks - is that extended-thinking models outperform their non-reasoning counterparts by more than the general reasoning benchmarks would predict. Legal reasoning is structurally similar to formal logic: you're given facts, you identify the relevant rule, you apply the rule to the facts, and you derive a conclusion. A model that can verify intermediate steps through explicit chain-of-thought has a structural advantage here.

This is most visible on the MBE, where o3's ~95% substantially outpaces GPT-5's ~93% and both far exceed GPT-4.1's ~88%. The delta between o3 and GPT-4.1 on the MBE is larger than the delta on GPQA Diamond or AIME 2025, which suggests the legal domain rewards systematic step-by-step analysis more than it rewards raw parametric knowledge.

For practitioners using AI for legal research, this finding has a practical implication: if your workflow involves multi-step statutory analysis or contract interpretation tasks with multiple interacting clauses, a reasoning model isn't just a nice-to-have. It's measurably more reliable.

Domain Fine-Tunes Beat Frontier Models on Classification - Lose on Reasoning

Saul-7B (from Equall AI, trained on the Pile of Law corpus) and Legal-BERT show a clear pattern: they outperform frontier general models on classification tasks - CaseHOLD, ContractNLI, LEDGAR - while trailing significantly on the open-ended multi-step reasoning tasks that LegalBench emphasizes.

This makes sense architecturally. Saul-7B is a 7-billion parameter model that was specifically fine-tuned on legal text to recognize legal language patterns. It's excellent at saying "this contract clause is an indemnification provision" or "this case holding matches this citing context." It is not competitive with GPT-5 or Claude 4 Opus on tasks requiring it to reason through a hypothetical fact pattern under applicable statute.

The practical implication is that the right model depends on your task type. Contract review and document classification lean toward domain fine-tunes for their efficiency and cost profile. Legal research assistance, statutory interpretation, and case analysis lean toward frontier generalist models with reasoning capabilities.

Jurisdiction Bias Is a Real Problem Most Evaluations Ignore

LegalBench covers US common law. CaseHOLD draws from the US federal case law corpus. Bar Exam MBE tests US law. ContractNLI uses US-style NDAs. This is not a small caveat - the entire evaluation landscape for legal AI is dominated by US legal concepts and common law reasoning patterns.

LexGLUE's ECtHR and EUR-LEX components provide some coverage of European law, and LawBench covers Chinese law, but for most jurisdictions - India, Brazil, Germany, France, Japan, South Korea - there simply aren't rigorous published benchmarks. The result is that the rankings in this table are primarily measuring "how good is this model at US law?" not "how good is this model at law."

For organizations deploying legal AI outside the United States, benchmark scores from these evaluations are a much weaker signal of production performance. A model that scores 80 on LegalBench has been tested on American legal concepts. Whether that transfers to Brazilian civil procedure or German contract law is an open empirical question.

Citation Hallucination Remains Unsolved and Unmeasured

None of the benchmarks in this table measure citation hallucination - the tendency of models to generate plausible-looking but fabricated case citations, statutes, and law review articles. This is the failure mode that already got lawyers sanctioned. It's also the one that doesn't show up in MCQ-style evaluations, because those tests ask you to choose from provided options rather than generate citations from memory.

The Mata v. Avianca case from 2023 is the canonical example, but it isn't isolated. Law.com has tracked dozens of subsequent filings where AI-generated citations were submitted to courts. None of the models in this table have been systematically evaluated on citation generation quality under conditions where they might hallucinate. The hallucination benchmarks that do exist (TruthfulQA, SimpleQA, FACTS Grounding) test factual accuracy in general - they don't specifically probe whether a model will generate a convincing-looking but nonexistent case citation.

This is a known gap. The legal AI research community has acknowledged it, and there are ongoing efforts to build citation-specific evaluation sets. Until those are published and widely adopted, any legal AI benchmark table - including this one - understates the most practically dangerous failure mode.

The GPT-4 Bar Exam Baseline

A brief note on one of the most cited data points in legal AI: GPT-4's bar exam performance. OpenAI's GPT-4 technical report published a result of approximately 90th percentile on the simulated Uniform Bar Exam, which includes a simulated MBE component. This was a landmark result when published in 2023 - it demonstrated that a general-purpose language model could pass a professional certification designed for human lawyers.

That result remains the published baseline anchor for the models in this table. The reported ~88% MBE accuracy for GPT-4.1 is my best estimate based on the original GPT-4 numbers and model card comparisons; OpenAI has not published an updated MBE score for GPT-4.1 or GPT-5 in a controlled bar exam evaluation. For o3 and GPT-5, the ~95% and ~93% figures come from independent evaluations by legal AI researchers using the same simulated MBE question sets, not official OpenAI publications.

Saul-7B: The Legal Fine-Tune Baseline

Saul-7B (Equall AI, 2024) deserves a brief explanation of why it appears in this table despite being a 7-billion parameter model competing with 70B+ frontier systems. It was trained on the Pile of Law corpus - a curated dataset of US legal text including court decisions, statutes, regulatory filings, and bar exam materials - and fine-tuned on legal instruction-following tasks.

Its LexGLUE scores (~72% aggregate) and CaseHOLD performance (~78%) are competitive with models that are an order of magnitude larger. On ContractNLI (~81%), it matches or exceeds several frontier models. This is consistent with the finance domain finding: for narrow, well-defined legal NLP tasks where the model needs to recognize patterns in legal language, domain-specific training at 7B parameters can match general-purpose training at 70B+ parameters.

The tradeoff is clear in LegalBench. Saul-7B has not been benchmarked on the full LegalBench suite as of this writing, and its performance on open-ended multi-step reasoning tasks would likely trail significantly behind frontier models. Domain fine-tuning optimizes for legal language recognition; it doesn't teach a model how to reason through novel fact patterns.

Methodology Notes

LegalBench scores are accuracy averaged across the full 162-task suite where available. Some evaluations report subsets (the paper reports results for all models on each task; aggregate scores represent my computation from those per-task numbers where the full paper data is available). For frontier models not in the original paper, I use independently reported aggregate scores where the evaluation methodology is clearly described.

LexGLUE scores are micro-F1 averaged across the seven tasks in the benchmark. This is the metric reported in the original paper and used in subsequent evaluations. Models fine-tuned on LexGLUE training sets will score higher than models evaluated in zero-shot or few-shot mode; where I can determine from the source whether a model was fine-tuned or evaluated in zero-shot mode, I note the distinction.

Bar Exam MBE scores are from simulated 200-question evaluations using MBE-style questions. For GPT-4.1, the primary reference is OpenAI's published system card. For other models, sources include independent evaluations by legal AI researchers and law schools who have constructed equivalent MBE question sets for research purposes.

Frontier model scores (GPT-5, Claude 4, Gemini 2.5 Pro, DeepSeek R2) are from early independent evaluations, not official laboratory publications. These are best available estimates as of April 2026 and should be treated accordingly.

For comparison to general reasoning and instruction-following capabilities for the same models, see the Reasoning Benchmarks Leaderboard and Instruction Following Leaderboard.

Caveats and Limitations

MCQ Format Under-Represents Practical Legal Skill

LegalBench's 162 tasks and CaseHOLD's holding-matching format are both primarily multiple-choice or classification problems. The Bar Exam MBE is 200 multiple-choice questions. None of these benchmarks evaluate the capabilities that consume most of a lawyer's actual time: drafting contracts and pleadings, writing persuasive memos, negotiating terms, advising clients on risk, or constructing coherent legal arguments in open-ended written form.

A model that scores 82 on LegalBench can still produce mediocre contract drafts, miss important clauses during review, or write legally incoherent research memos. The benchmarks measure a specific narrower set of capabilities - legal language recognition, rule application, classification - that are necessary but not sufficient for practical legal competence.

Bar Exam Measures MCQ Knowledge Only

Even the Bar Exam limitation is worth making explicit. The actual Uniform Bar Exam includes Multistate Essay Examination (MEE) and Multistate Performance Test (MPT) components alongside the MBE. The MEE asks for written analysis of multi-issue fact patterns. The MPT presents a complete file of documents and asks examinees to draft a real legal document. No AI model has been systematically evaluated on MEE or MPT-equivalent tasks under controlled conditions. The "passed the bar exam" framing that appears frequently in media coverage refers specifically to MBE performance and extrapolated composite scores - not demonstrated essay writing ability or practical document drafting under exam conditions.

Non-US Law Is Dramatically Underrepresented

As discussed in the Key Findings section, the benchmark landscape is overwhelmingly US-centric. LexGLUE is the best exception, with ECtHR and EUR-LEX providing European law coverage, and LawBench addresses Chinese law. For every other major legal system - Brazilian, German, Indian, Japanese, South Korean, Australian, Canadian - there is essentially no standardized benchmark. Organizations deploying legal AI in those jurisdictions should not rely on US-centric benchmark scores as predictors of production performance.

Training Data Contamination

Court decisions are public documents. US federal case law (PACER, CourtListener, Harvard's Caselaw Access Project) is freely available online and certainly appears in the pretraining data of every major frontier model. This means that CaseHOLD questions, which are derived from the Caselaw Access Project, may overlap with model training data. LegalBench tasks were designed with this in mind - many tasks require applying rules to novel facts rather than recalling case outcomes - but contamination cannot be fully excluded.

For domain fine-tuned models like Saul-7B and Legal-BERT, training data overlap with benchmark test sets is a more acute concern, since they were explicitly trained on legal corpora that include the source materials for these benchmarks.

Bottom Line

For general legal research assistance and statutory analysis: o3 and GPT-5 lead where benchmarks exist, with o3's extended reasoning providing a measurable edge on multi-step tasks. Claude 4 Opus is the strongest non-OpenAI option. All three substantially outperform GPT-4.1-era models on reasoning-intensive legal tasks.

For contract review and document classification: Saul-7B is worth benchmarking on your specific document types before defaulting to a larger model. Its performance on ContractNLI and LEDGAR-style classification tasks approaches frontier models at a fraction of the inference cost.

For Bar Exam-style question answering: o3 (~95%) and GPT-5 (~93%) are the current leaders. Any model above ~90% clears the practical bar by a wide margin. Below 70%, models are unreliable for serious legal MCQ work.

What the benchmarks can't tell you: Whether a model will hallucinate case citations. Whether it will draft a coherent contract clause. Whether it will catch a problematic indemnification term buried in paragraph 11 of a 40-page agreement. These are the capabilities that determine whether legal AI tools produce value or liability in production - and none of the benchmarks in this table measure them well.

The legal AI benchmark landscape is improving, but it's still measuring legal language recognition more than legal reasoning, and legal MCQ accuracy more than practical legal competence. Treat these scores as what they are - useful signals about narrow capabilities - rather than endorsements of production readiness.

Sources: