Most AI benchmarks measure what a model knows. Instruction following benchmarks measure whether a model will actually do what you tell it. The difference matters more than it sounds: a model that aces GPQA Diamond but refuses to stay under 200 words, or ignores your "no bullet points" requirement, is a liability in production.

IFEval - Instruction Following Evaluation - was introduced by Google in late 2023 to tackle this directly. It doesn't ask models to solve hard problems. It asks them to follow simple, checkable rules: write at least 400 words, use the word "innovation" exactly once, don't use commas, end with a specific phrase. These are tasks a competent assistant should handle without incident. Turns out, many models still trip over them.

This leaderboard tracks performance across IFEval and IFBench, a newer generalization test from the Allen Institute for AI, updated as of April 2026.

TL;DR

Qwen3.5-27B leads the IFEval open-source rankings at 95.0%, with six Qwen3.5 variants in the top 10
On IFBench - a harder test of instruction generalization - Hermes 3 70B surprises at 81.2%, above larger Qwen models
Gemma 3 4B scores 90.2% on IFEval, making it the best small model for instruction-sensitive tasks under a 10B parameter budget

What IFEval Actually Tests

IFEval evaluates models on 541 prompts containing one or more "verifiable instructions" - constraints that can be automatically checked with a short Python function. The benchmark covers 25 distinct constraint types grouped into six categories:

Format: JSON output, markdown sections, number of paragraphs, use of bullet points or numbered lists
Length: word count minimums/maximums, sentence count, number of sentences per paragraph
Keywords: use a specific word at least N times, avoid a specific word entirely
Content: start with a specific sentence, end with a specific phrase, include a postscript
Language: respond in a specific language
Style: don't use commas, write in all caps, avoid certain punctuation

The evaluation uses two accuracy metrics: strict (exact compliance) and loose (allowing minor formatting variations that don't change semantic compliance). Both are measured at the prompt level (did every instruction in the prompt pass?) and the instruction level (what fraction of individual instructions passed?). Higher prompt-level strict accuracy is the hardest bar - and the one that matters most in practice.

IFEval's advantage is reproducibility. Any team can run it locally. Its limitation is that models have trained extensively on or near its constraint vocabulary, which creates overfitting risk.

IFBench: Testing Generalization

IFBench, from the Allen Institute for AI, addresses this directly. It introduces 58 novel, verifiable constraints that models are unlikely to have seen during training - unusual output structures, nonstandard formatting patterns, and precise content requirements that don't appear in IFEval's vocabulary. If a model learned to follow "write at least 400 words" by pattern matching against IFEval training data, IFBench will expose that.

The gap between a model's IFEval and IFBench scores is a useful signal. A small gap means the model truly understands constraint-following. A large gap suggests it's memorized the IFEval constraint types without real generalization.

Frontier Models: Combined Rankings

This table uses data from BenchLM.ai's composite instruction following leaderboard (IFEval weighted 65%, IFBench 35%), verified as of April 9, 2026. The weighted score reflects that IFEval still has broader evaluation coverage while IFBench provides the generalization signal.

Rank	Model	Provider	Weighted Score
1	Kimi K2.5 (Reasoning)	Moonshot AI	100.0%
2	Grok 4.20 Multi-agent	xAI	100.0%
3	Grok 4.20	xAI	97.7%
4	Claude Opus 4.6	Anthropic	95.1%
5	GPT-5.4	OpenAI	93.8%
6	GPT-5.4 Pro	OpenAI	93.6%
7	o1	OpenAI	93.3%
8	GLM-5.1	Z.AI	93.0%
9	Gemini 3.1 Pro	Google	92.8%
10	GPT-5.2-Codex	OpenAI	92.6%

The top two positions - Kimi K2.5 Reasoning and Grok 4.20 Multi-agent - are specialized variants that aren't everyday API picks. Kimi K2.5 Reasoning is a chain-of-thought model that may use its scratchpad to explicitly track constraint compliance before writing its final answer. Grok 4.20 Multi-agent is a scaffolded system, not a single call. At the frontier of practical single-call models, Claude Opus 4.6 at 95.1% and GPT-5.4 at 93.8% are the realistic top performers.

The o1 entry is particularly valuable because it's one of the few frontier models with individually reported IFEval (93.3%) and IFBench (92.2%) sub-scores - a gap of only 1.1 percentage points, suggesting genuine generalization rather than benchmark overfitting.

Open-Source IFEval Rankings

These scores are from llm-stats.com's IFEval leaderboard as of April 8, 2026. The 62 models are assessed on the full IFEval prompt set (prompt-level strict accuracy). All results are self-reported by model developers or evaluators using the standard IFEval harness.

Rank	Model	Provider	IFEval Score	Params
1	Qwen3.5-27B	Alibaba	0.950	27B
2	Qwen3.6 Plus	Alibaba	0.943	~100B
3	o3-mini	OpenAI	0.939	-
4	Qwen3.5-122B-A10B	Alibaba	0.934	122B MoE
5	Claude 3.7 Sonnet	Anthropic	0.932	-
6	Qwen3.5-397B-A17B	Alibaba	0.926	397B MoE
7	Nova Pro	Amazon	0.921	-
7	Llama 3.3 70B Instruct	Meta	0.921	70B
9	Qwen3.5-35B-A3B	Alibaba	0.919	35B MoE
10	Qwen3.5-9B	Alibaba	0.915	9B
11	Gemma 3 27B	Google	0.904	27B
12	Nemotron Nano 9B v2	NVIDIA	0.903	9B
13	Gemma 3 4B	Google	0.902	4B

The average across all 62 evaluated models is 0.844. Anything above 0.90 puts a model in the top tier for practical constraint-following.

Benchmark score tracking across open-source LLMs on IFEval Qwen3.5 variants dominate the IFEval leaderboard, occupying six of the top ten positions across different model sizes. Source: unsplash.com

IFBench: Generalization Rankings

IFBench scores tell a different story. The benchmark only has 16 models evaluated so far - it's newer and requires more setup than IFEval - but the rankings reveal some surprises.

Rank	Model	Provider	IFBench Score	Params
1	Hermes 3 70B	Nous Research	0.812	70B
2	Qwen3.5-397B-A17B	Alibaba	0.765	397B MoE
2	Qwen3.5-27B	Alibaba	0.765	27B
4	Qwen3.5-122B-A10B	Alibaba	0.761	122B MoE
5	Qwen3.6 Plus	Alibaba	0.742	~100B
6	Nemotron 3 Super 120B A12B	NVIDIA	0.726	120B MoE
7	Mercury 2	Inception	0.710	-
8	Qwen3.5-35B-A3B	Alibaba	0.702	35B MoE
9	MiniMax M2.1	MiniMax	0.700	-
10	GPT OSS 120B High	OpenAI	0.695	120B
11	K-EXAONE-236B-A23B	LG AI Research	0.673	236B MoE
12	Qwen3.5-9B	Alibaba	0.645	9B
13	Qwen3.5-4B	Alibaba	0.592	4B
14	Mistral Small 4	Mistral	0.480	-

Hermes 3 70B at 81.2% is a striking result. It's a fine-tuned instruction model from Nous Research built specifically for precise, structured output generation. Its strong IFBench performance relative to much larger Qwen models suggests Nous's training approach instills something closer to genuine constraint understanding than the Qwen post-training process does. The average IFBench score across assessed models is 0.649 - dramatically lower than IFEval's 0.844 average - which confirms that instruction generalization is truly harder than it looks on standard benchmarks.

Key Takeaways

Qwen Has Owned IFEval

Six of the top ten open-source IFEval rankings belong to Qwen3.5 variants. Qwen3.5-27B at 95.0% leads the field, followed closely by Qwen3.6 Plus at 94.3%. Even the 9B parameter Qwen3.5-9B scores 91.5% - above most 70B class models from other providers.

This isn't a coincidence. Alibaba's post-training pipeline for the Qwen3.5 family appears to have placed significant emphasis on constraint-following RLHF, and it shows. If your use case requires reliable formatting compliance - structured outputs, templated reports, constrained customer-facing text - Qwen3.5-27B is the open-source default.

Gemma 3 4B is the Efficiency Winner

Gemma 3 27B scores 90.4%, good for rank 11 overall. But the more interesting result is Gemma 3 4B at 90.2%. A 4B parameter model approaching 90% on IFEval places it ahead of models four to ten times its size. For teams running inference on CPU-only hardware or edge devices, Gemma 3 4B delivers serious instruction following performance at minimal compute cost.

IFEval Saturation is Real

The top of the IFEval leaderboard is compressing. When the 13th-ranked model (Gemma 3 4B) scores 90.2% and the leader (Qwen3.5-27B) scores 95.0%, that's only 4.8 percentage points separating much of the practical field. IFEval's 25 constraint types have been in the training data long enough that marginal improvements here don't reveal much about real-world instruction-following ability.

IFBench is the more meaningful signal now. The 12-point gap between Qwen3.5-27B's IFEval score (95.0%) and IFBench score (76.5%) tells you something IFEval alone doesn't: the model's constraint-following is strong but not fully general.

IFBench is the more meaningful signal now. A 12-point gap between a model's IFEval and IFBench scores signals benchmark overfitting, not real generalization.

Reasoning Models Have an Inherent Disadvantage

One counterintuitive finding: extended-thinking or chain-of-thought reasoning models sometimes struggle more with IFEval's strict output constraints. When a model reasons at length before producing output, it may "lose track" of formatting constraints specified in the original prompt - a phenomenon documented in a recent arXiv paper on reasoning model instruction following. The ReasonIF paper found that models including extended reasoning sometimes violate output constraints during their thinking steps even when they correctly follow them in the final answer.

This is why Kimi K2.5 Reasoning and Grok 4.20 Reasoning variants are labeled separately in the frontier rankings - their instruction-following performance is measured on final output, but their usage patterns differ from standard models.

Hermes 3 Points to a Training Approach Worth Studying

The Hermes 3 70B IFBench result (81.2%) is worth looking at more closely. Nous Research's training methodology for Hermes 3 stresses function calling, structured output, and precise response formatting. The model was specifically optimized to follow instructions without hedging or creative reinterpretation. Its IFBench score is 4.7 points above Qwen3.5-27B on the generalization benchmark - a model with 1/14th the parameters - suggesting that the training objective matters more than model size for this specific capability.

Practical Guidance

The right choice depends on your constraints and deployment setup.

For production APIs with strict formatting requirements, Claude Opus 4.6 (95.1% composite) and GPT-5.4 (93.8%) are the most reliable frontier options. Both consistently handle format, length, and content constraints without manual verification loops.

For self-hosted deployments, Qwen3.5-27B (95.0% IFEval) is the top open-weights choice. It fits on a single A100 80GB GPU in fp16 and handles the full range of IFEval constraint types. For tighter hardware budgets, Qwen3.5-9B at 91.5% and Gemma 3 4B at 90.2% both deliver strong instruction following at small model sizes.

For structured output generation specifically - JSON schemas, templated reports, constrained text - Hermes 3 70B's IFBench leadership makes it worth testing. If your workflow involves novel output constraints that weren't in the standard training set, Hermes may generalize better than larger Qwen models.

For the open-source LLM leaderboard broader picture, Llama 3.3 70B Instruct at 92.1% is a solid baseline - it's widely supported across inference frameworks, making it a practical choice even if it trails Qwen on this specific benchmark.

Check our guide on how to understand AI benchmarks if you want to calibrate how to weight IFEval against other evaluation criteria for your specific use case.

Notebook open to a handwritten page with precise formatting - analogous to what IFEval tests IFEval's core idea: give a model clear, verifiable constraints - format, word count, keywords - and check whether it actually follows them. Simple in concept, harder in practice. Source: unsplash.com

FAQ

What is IFEval and how does it work?

IFEval tests models on 541 prompts with verifiable constraints like word count, formatting rules, and keyword requirements. Compliance is checked automatically with Python functions, so results are objective and reproducible.

Which model follows instructions best overall?

For single-call production API use, Claude Opus 4.6 leads the frontier at 95.1% on the combined leaderboard. For open-source/self-hosted use, Qwen3.5-27B leads IFEval at 95.0%.

What's the difference between IFEval and IFBench?

IFEval tests 25 known constraint types; models may overfit to these from training data. IFBench introduces 58 novel constraints to test genuine generalization. The average IFBench score is 19 points lower than average IFEval, confirming the harder bar.

Why does Hermes 3 70B outperform much larger models on IFBench?

Nous Research trained Hermes 3 specifically for precise instruction following and structured output. Its training objective aligns more directly with IFBench's novel constraint types than general-purpose fine-tuning.

How often do these rankings change?

IFEval rankings are fairly stable at the top - the Qwen3.5 family has dominated for several months. IFBench rankings are more volatile because fewer models are assessed. Expect significant movement as more frontier models submit results.

Is a high IFEval score enough to guarantee production reliability?

No. IFEval tests 25 constraint types; real workflows include countless others. Use IFEval as a baseline signal, test your specific constraints in staging, and monitor IFBench scores as a generalization proxy.

Sources: