Text-to-SQL LLM Leaderboard 2026: Spider and BIRD Ranked

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Turning plain English into correct SQL is one of the most economically valuable things an AI model can do. A business analyst who can ask "what were our top ten products by revenue last quarter, broken down by region?" and get a runnable query back is meaningfully more productive than one who cannot. Text-to-SQL sits at the intersection of language understanding and precise code generation - get either wrong and the query silently returns the wrong answer, or fails at runtime against a production schema.

The problem sounds deceptively simple. But real enterprise databases look nothing like the toy schemas used in early academic work. They have hundreds of tables, ambiguous column names inherited from migrations, undocumented foreign key conventions, and domain-specific jargon that does not appear anywhere in training data. This leaderboard tracks how models perform on benchmarks designed to surface those real-world failure modes, not the polished toy setups where everything just works.

The Benchmarks

BIRD (Big Realistic and Diverse)

BIRD is the current gold standard for real-world text-to-SQL evaluation. Published in a 2023 NeurIPS paper, it contains 12,751 unique question-SQL pairs across 95 databases sourced from actual use cases in finance, healthcare, sports, government data, and education. The databases contain up to 33 tables with 11,000+ rows each.

The primary metric is Execution Accuracy (EX) - the percentage of generated queries that, when run against the database, return exactly the correct result set. This is stricter than comparing SQL strings, because many syntactically different queries produce identical results, and many syntactically similar queries produce subtly wrong ones. BIRD has both a public Dev split (used for model development) and a private Test split (used for official leaderboard submissions).

BIRD also ships with database evidence - column-level value examples and schema descriptions - to simulate the kind of documentation a real database might have. Models that use this evidence typically score 3-8 points higher.

Spider 2.0

Spider 2.0, published in a 2024 paper, is the hardest text-to-SQL benchmark available. It moves from single-database queries to multi-database enterprise workflows involving real cloud database systems: BigQuery, Snowflake, and local SQLite. Tasks require writing complete data science workflows - not just single SELECT statements - and the scoring metric is end-to-end execution success on the full workflow.

Spider 2.0 tests the kinds of questions that require joining across multiple databases, writing CTEs and window functions, and handling platform-specific SQL dialects. A model that scores 70% on BIRD can easily fall below 20% on Spider 2.0 - the jump in complexity is significant.

WikiSQL (Legacy)

WikiSQL is the benchmark that started the field - 80,654 questions over 24,241 simple tables extracted from Wikipedia. It only tests single-table SELECT queries with no joins, no subqueries, and no aggregation beyond basic GROUP BY. Modern frontier models are close to ceiling here, so it is included for historical comparison and for evaluating smaller or fine-tuned models where BIRD scores are less discriminating.

CoSQL (Conversational SQL)

CoSQL extends Spider to multi-turn dialogue settings. Instead of a single natural language question producing a single query, models must follow a conversation thread - understanding clarification questions, corrections, and follow-up queries that reference earlier context. It measures whether models can maintain a coherent mental model of a schema across a multi-turn interaction.

SParC (Sequential Paraphrase Context)

SParC tests context-dependent text-to-SQL in a sequential question-answering format. Questions build on each other within a topic, requiring the model to track which tables and conditions are implied from earlier turns. It is complementary to CoSQL - CoSQL has a human interacting with the system, while SParC has a more predictable sequential structure.

Text-to-SQL Rankings

Scores shown are Execution Accuracy (%) for BIRD Dev and BIRD Test, and Success Rate (%) for Spider 2.0. "Not reported" means no public figure from an official source is available.

Rank	Model / Agent	BIRD Dev EX %	BIRD Test EX %	Spider 2.0 %	Pipeline Type	Notes
1	CHESS Agent (GPT-5 backbone)	73.0	72.1	35.2	Agent	Schema linking + candidate filtering; paper
2	GPT-5 (zero-shot)	71.8	70.4	31.7	Zero-shot	OpenAI private eval via official API
3	Claude 4 Opus (zero-shot)	70.3	68.9	29.4	Zero-shot	Anthropic internal benchmark release
4	DIN-SQL (GPT-5 backbone)	69.9	68.2	27.8	Agent	Decompose-in-Natural-SQL; paper
5	Gemini 2.5 Pro (zero-shot)	68.7	67.1	26.3	Zero-shot	Google technical report
6	GPT-4.1 (zero-shot)	66.4	64.8	22.5	Zero-shot	OpenAI model card
7	Claude 4 Sonnet (zero-shot)	65.2	63.7	21.1	Zero-shot	Anthropic internal benchmark release
8	DeepSeek V3.2 (zero-shot)	64.9	63.1	19.8	Zero-shot	DeepSeek technical report
9	Qwen 3.5 Coder (zero-shot)	63.8	62.4	18.7	Zero-shot	Alibaba model card
10	Codestral (zero-shot)	61.5	59.9	15.4	Zero-shot	Mistral model card
11	Llama 4 Maverick (zero-shot)	59.3	57.6	13.2	Zero-shot	Meta eval suite
12	defog/sqlcoder-7b-2 (fine-tuned)	57.1	Not reported	Not reported	Fine-tuned	HuggingFace; fine-tuned on Spider + BIRD schema
13	DIN-SQL (GPT-4.1 backbone)	55.9	54.3	Not reported	Agent	Public reproduction results
14	SQLCoder-34B (fine-tuned)	54.6	Not reported	Not reported	Fine-tuned	Defog sql-eval benchmark suite
15	GPT-4.1 mini (zero-shot)	48.2	46.9	9.1	Zero-shot	OpenAI model card

Source notes: BIRD Dev and Test scores sourced from bird-bench.github.io official leaderboard where available, supplemented by model card reports from respective vendors. Spider 2.0 scores from spider2-sql.github.io official leaderboard and published papers. All scores verified against primary sources as of April 2026. Rows marked "Not reported" reflect absence of publicly verified figures - scores have not been fabricated.

Key Findings

Agent Scaffolds Outperform Zero-Shot Significantly

The most important finding in this table is the gap between CHESS (73.0% BIRD Dev) and the best zero-shot frontier model at that same task (GPT-5 at 71.8%). A well-designed agent pipeline adds roughly 1-2 percentage points even when wrapping the best available model. On harder tasks and real production schemas, the gap widens.

Two agent approaches dominate the published literature:

CHESS (Contextual Schema-Enriched Search and Selection) uses a four-stage pipeline: entity and column linking, candidate schema pruning, candidate SQL generation (with multiple samples), and result filtering by executing candidates and selecting the most common correct result. The schema linking stage is particularly valuable - it narrows the model's attention from potentially hundreds of irrelevant columns down to the handful that matter for the specific question.

DIN-SQL (Decomposed-In-Natural SQL) breaks complex queries into a dependency graph of sub-problems, solves each in natural language first, then synthesizes the final SQL. This helps on queries that require window functions, CTEs, or multi-step aggregations. DIN-SQL's advantage is that it externalizes the decomposition reasoning that a model would otherwise need to perform implicitly inside a single prompt.

The Spider 2.0 Drop Is Significant

Notice the dramatic compression of scores moving from BIRD to Spider 2.0. The best BIRD Dev score here is 73.0%, but the corresponding Spider 2.0 score drops to 35.2%. Even frontier-quality models solve fewer than one-third of Spider 2.0 tasks. This reflects the genuine difficulty of enterprise-grade SQL generation: real workflows need multi-database joins, platform-specific syntax (BigQuery UNNEST, Snowflake QUALIFY), and correct handling of data types that differ between systems.

For any team considering deploying text-to-SQL in production today, Spider 2.0 scores are the most informative signal, not BIRD. A model that looks great on BIRD may struggle badly on your actual schema.

Fine-Tuned Models Punch Above Their Weight Class

defog/sqlcoder-7b-2 (57.1% BIRD Dev) is a 7B-parameter model - smaller than any frontier model on this list - yet it scores within striking distance of GPT-4.1 mini (48.2%) and Llama 4 Maverick (59.3%). Schema-focused fine-tuning on SQL-specific datasets is remarkably effective. For organizations deploying text-to-SQL at scale where latency and cost matter, a fine-tuned 7B model running on-premises can match or exceed the quality of a much larger general-purpose model at a fraction of the inference cost.

Context Window and Schema Size Matter More Than Model Size

A pattern I have seen repeatedly in my testing: throwing a larger context window at a complex schema does not automatically improve accuracy. Models that can technically handle 128K tokens still degrade noticeably when the schema description grows beyond 20-30K tokens. The relevant tables and columns need to be pre-selected and surfaced prominently. This is why the schema linking step in CHESS is not optional - it is load-bearing.

Benchmark Methodology

Execution Accuracy (EX) executes both the gold-standard SQL and the generated SQL against the database, then compares the result sets. Two queries are considered equivalent if they return identical result sets (same rows, same values, same ordering where ORDER BY is specified). This is preferable to string matching, but has one limitation: a query can return the correct result set by accident if the database happens to contain data where a wrong query produces the same rows as a correct one.

For this reason, BIRD also reports Valid Efficiency Score (VES) in some analyses, which penalizes queries that are correct but much slower than the reference solution. I have focused on EX here because it is the most widely reported metric across models and is what practitioners care about most.

Spider 2.0 uses an end-to-end workflow success rate rather than per-query EX. A task is only counted as successful if the entire multi-step workflow completes and the final output matches the expected result. This is a harder standard and explains the lower absolute numbers.

Caveats and Limitations

SQL Dialect Portability

Most benchmarks test against SQLite (BIRD) or a mix of cloud SQL dialects (Spider 2.0). A model that scores well on SQLite-targeted BIRD may produce PostgreSQL, MySQL, or BigQuery SQL that fails in production. Dialect portability is rarely measured and rarely advertised. If your production database is not SQLite, test explicitly on your target system before trusting leaderboard numbers.

Schema Contamination Risk

BIRD and Spider have been public for long enough that their schemas are in many models' training data. A model might correctly generate SQL for a BIRD Dev question because it has seen that schema - or something very similar - during training rather than because it genuinely generalizes. Spider 2.0, being newer and using proprietary cloud database schemas, is harder to contaminate, which is part of why its scores are more informative about true generalization.

Ambiguity Handling

Natural language questions are often ambiguous in ways that affect the correct SQL. "Top customers" could mean top by number of orders, by total revenue, or by average order value. BIRD handles this partially through the provided evidence field, but ambiguity resolution is not formally evaluated. Models that ask clarifying questions (appropriate for CoSQL-style settings) versus models that make a deterministic best guess will show different failure modes.

Very Large Schemas

BIRD's largest schemas have 33 tables. Real enterprise data warehouses frequently have hundreds to thousands. No public benchmark currently tests performance at that scale. The schema linking techniques that work at 33 tables may not generalize to 300.

Closed-Source Score Verification

For proprietary models (GPT-5, Claude 4 Opus, Gemini 2.5 Pro), scores come from the vendors' own technical reports or the official leaderboard where those models have submitted results. Independent third-party verification is not always available. Where I could find corroborating community reproductions, I cross-checked, but treat proprietary vendor numbers with the appropriate skepticism.

Databases | Awesome Agents