Instruction Following Leaderboard: IFEval Rankings 2026
Rankings of AI models on IFEval and IFBench, the two main benchmarks for measuring how reliably LLMs follow precise formatting, length, and content constraints.

Most AI benchmarks measure what a model knows. Instruction following benchmarks measure whether a model will actually do what you tell it. The difference matters more than it sounds: a model that aces GPQA Diamond but refuses to stay under 200 words, or ignores your "no bullet points" requirement, is a liability in production.
IFEval - Instruction Following Evaluation - was introduced by Google in late 2023 to tackle this directly. It doesn't ask models to solve hard problems. It asks them to follow simple, checkable rules: write at least 400 words, use the word "innovation" exactly once, don't use commas, end with a specific phrase. These are tasks a competent assistant should handle without incident. Turns out, many models still trip over them.
This leaderboard tracks performance across IFEval and IFBench, a newer generalization test from the Allen Institute for AI, updated as of April 2026.
TL;DR
- Qwen3.5-27B leads the IFEval open-source rankings at 95.0%, with six Qwen3.5 variants in the top 10
- On IFBench - a harder test of instruction generalization - Hermes 3 70B surprises at 81.2%, above larger Qwen models
- Gemma 3 4B scores 90.2% on IFEval, making it the best small model for instruction-sensitive tasks under a 10B parameter budget
What IFEval Actually Tests
IFEval evaluates models on 541 prompts containing one or more "verifiable instructions" - constraints that can be automatically checked with a short Python function. The benchmark covers 25 distinct constraint types grouped into six categories:
- Format: JSON output, markdown sections, number of paragraphs, use of bullet points or numbered lists
- Length: word count minimums/maximums, sentence count, number of sentences per paragraph
- Keywords: use a specific word at least N times, avoid a specific word entirely
- Content: start with a specific sentence, end with a specific phrase, include a postscript
- Language: respond in a specific language
- Style: don't use commas, write in all caps, avoid certain punctuation
The evaluation uses two accuracy metrics: strict (exact compliance) and loose (allowing minor formatting variations that don't change semantic compliance). Both are measured at the prompt level (did every instruction in the prompt pass?) and the instruction level (what fraction of individual instructions passed?). Higher prompt-level strict accuracy is the hardest bar - and the one that matters most in practice.
IFEval's advantage is reproducibility. Any team can run it locally. Its limitation is that models have trained extensively on or near its constraint vocabulary, which creates overfitting risk.
IFBench: Testing Generalization
IFBench, from the Allen Institute for AI, addresses this directly. It introduces 58 novel, verifiable constraints that models are unlikely to have seen during training - unusual output structures, nonstandard formatting patterns, and precise content requirements that don't appear in IFEval's vocabulary. If a model learned to follow "write at least 400 words" by pattern matching against IFEval training data, IFBench will expose that.
The gap between a model's IFEval and IFBench scores is a useful signal. A small gap means the model truly understands constraint-following. A large gap suggests it's memorized the IFEval constraint types without real generalization.
Frontier Models: Combined Rankings
This table uses data from BenchLM.ai's composite instruction following leaderboard (IFEval weighted 65%, IFBench 35%), verified as of April 9, 2026. The weighted score reflects that IFEval still has broader evaluation coverage while IFBench provides the generalization signal.
| Rank | Model | Provider | Weighted Score |
|---|---|---|---|
| 1 | Kimi K2.5 (Reasoning) | Moonshot AI | 100.0% |
| 2 | Grok 4.20 Multi-agent | xAI | 100.0% |
| 3 | Grok 4.20 | xAI | 97.7% |
| 4 | Claude Opus 4.6 | Anthropic | 95.1% |
| 5 | GPT-5.4 | OpenAI | 93.8% |
| 6 | GPT-5.4 Pro | OpenAI | 93.6% |
| 7 | o1 | OpenAI | 93.3% |
| 8 | GLM-5.1 | Z.AI | 93.0% |
| 9 | Gemini 3.1 Pro | 92.8% | |
| 10 | GPT-5.2-Codex | OpenAI | 92.6% |
The top two positions - Kimi K2.5 Reasoning and Grok 4.20 Multi-agent - are specialized variants that aren't everyday API picks. Kimi K2.5 Reasoning is a chain-of-thought model that may use its scratchpad to explicitly track constraint compliance before writing its final answer. Grok 4.20 Multi-agent is a scaffolded system, not a single call. At the frontier of practical single-call models, Claude Opus 4.6 at 95.1% and GPT-5.4 at 93.8% are the realistic top performers.
The o1 entry is particularly valuable because it's one of the few frontier models with individually reported IFEval (93.3%) and IFBench (92.2%) sub-scores - a gap of only 1.1 percentage points, suggesting genuine generalization rather than benchmark overfitting.
Open-Source IFEval Rankings
These scores are from llm-stats.com's IFEval leaderboard as of April 8, 2026. The 62 models are assessed on the full IFEval prompt set (prompt-level strict accuracy). All results are self-reported by model developers or evaluators using the standard IFEval harness.
| Rank | Model | Provider | IFEval Score | Params |
|---|---|---|---|---|
| 1 | Qwen3.5-27B | Alibaba | 0.950 | 27B |
| 2 | Qwen3.6 Plus | Alibaba | 0.943 | ~100B |
| 3 | o3-mini | OpenAI | 0.939 | - |
| 4 | Qwen3.5-122B-A10B | Alibaba | 0.934 | 122B MoE |
| 5 | Claude 3.7 Sonnet | Anthropic | 0.932 | - |
| 6 | Qwen3.5-397B-A17B | Alibaba | 0.926 | 397B MoE |
| 7 | Nova Pro | Amazon | 0.921 | - |
| 7 | Llama 3.3 70B Instruct | Meta | 0.921 | 70B |
| 9 | Qwen3.5-35B-A3B | Alibaba | 0.919 | 35B MoE |
| 10 | Qwen3.5-9B | Alibaba | 0.915 | 9B |
| 11 | Gemma 3 27B | 0.904 | 27B | |
| 12 | Nemotron Nano 9B v2 | NVIDIA | 0.903 | 9B |
| 13 | Gemma 3 4B | 0.902 | 4B |
The average across all 62 evaluated models is 0.844. Anything above 0.90 puts a model in the top tier for practical constraint-following.
Qwen3.5 variants dominate the IFEval leaderboard, occupying six of the top ten positions across different model sizes.
Source: unsplash.com
IFBench: Generalization Rankings
IFBench scores tell a different story. The benchmark only has 16 models evaluated so far - it's newer and requires more setup than IFEval - but the rankings reveal some surprises.
| Rank | Model | Provider | IFBench Score | Params |
|---|---|---|---|---|
| 1 | Hermes 3 70B | Nous Research | 0.812 | 70B |
| 2 | Qwen3.5-397B-A17B | Alibaba | 0.765 | 397B MoE |
| 2 | Qwen3.5-27B | Alibaba | 0.765 | 27B |
| 4 | Qwen3.5-122B-A10B | Alibaba | 0.761 | 122B MoE |
| 5 | Qwen3.6 Plus | Alibaba | 0.742 | ~100B |
| 6 | Nemotron 3 Super 120B A12B | NVIDIA | 0.726 | 120B MoE |
| 7 | Mercury 2 | Inception | 0.710 | - |
| 8 | Qwen3.5-35B-A3B | Alibaba | 0.702 | 35B MoE |
| 9 | MiniMax M2.1 | MiniMax | 0.700 | - |
| 10 | GPT OSS 120B High | OpenAI | 0.695 | 120B |
| 11 | K-EXAONE-236B-A23B | LG AI Research | 0.673 | 236B MoE |
| 12 | Qwen3.5-9B | Alibaba | 0.645 | 9B |
| 13 | Qwen3.5-4B | Alibaba | 0.592 | 4B |
| 14 | Mistral Small 4 | Mistral | 0.480 | - |
Hermes 3 70B at 81.2% is a striking result. It's a fine-tuned instruction model from Nous Research built specifically for precise, structured output generation. Its strong IFBench performance relative to much larger Qwen models suggests Nous's training approach instills something closer to genuine constraint understanding than the Qwen post-training process does. The average IFBench score across assessed models is 0.649 - dramatically lower than IFEval's 0.844 average - which confirms that instruction generalization is truly harder than it looks on standard benchmarks.
Key Takeaways
Qwen Has Owned IFEval
Six of the top ten open-source IFEval rankings belong to Qwen3.5 variants. Qwen3.5-27B at 95.0% leads the field, followed closely by Qwen3.6 Plus at 94.3%. Even the 9B parameter Qwen3.5-9B scores 91.5% - above most 70B class models from other providers.
This isn't a coincidence. Alibaba's post-training pipeline for the Qwen3.5 family appears to have placed significant emphasis on constraint-following RLHF, and it shows. If your use case requires reliable formatting compliance - structured outputs, templated reports, constrained customer-facing text - Qwen3.5-27B is the open-source default.
Gemma 3 4B is the Efficiency Winner
Gemma 3 27B scores 90.4%, good for rank 11 overall. But the more interesting result is Gemma 3 4B at 90.2%. A 4B parameter model approaching 90% on IFEval places it ahead of models four to ten times its size. For teams running inference on CPU-only hardware or edge devices, Gemma 3 4B delivers serious instruction following performance at minimal compute cost.
IFEval Saturation is Real
The top of the IFEval leaderboard is compressing. When the 13th-ranked model (Gemma 3 4B) scores 90.2% and the leader (Qwen3.5-27B) scores 95.0%, that's only 4.8 percentage points separating much of the practical field. IFEval's 25 constraint types have been in the training data long enough that marginal improvements here don't reveal much about real-world instruction-following ability.
IFBench is the more meaningful signal now. The 12-point gap between Qwen3.5-27B's IFEval score (95.0%) and IFBench score (76.5%) tells you something IFEval alone doesn't: the model's constraint-following is strong but not fully general.
IFBench is the more meaningful signal now. A 12-point gap between a model's IFEval and IFBench scores signals benchmark overfitting, not real generalization.
Reasoning Models Have an Inherent Disadvantage
One counterintuitive finding: extended-thinking or chain-of-thought reasoning models sometimes struggle more with IFEval's strict output constraints. When a model reasons at length before producing output, it may "lose track" of formatting constraints specified in the original prompt - a phenomenon documented in a recent arXiv paper on reasoning model instruction following. The ReasonIF paper found that models including extended reasoning sometimes violate output constraints during their thinking steps even when they correctly follow them in the final answer.
This is why Kimi K2.5 Reasoning and Grok 4.20 Reasoning variants are labeled separately in the frontier rankings - their instruction-following performance is measured on final output, but their usage patterns differ from standard models.
Hermes 3 Points to a Training Approach Worth Studying
The Hermes 3 70B IFBench result (81.2%) is worth looking at more closely. Nous Research's training methodology for Hermes 3 stresses function calling, structured output, and precise response formatting. The model was specifically optimized to follow instructions without hedging or creative reinterpretation. Its IFBench score is 4.7 points above Qwen3.5-27B on the generalization benchmark - a model with 1/14th the parameters - suggesting that the training objective matters more than model size for this specific capability.
Practical Guidance
The right choice depends on your constraints and deployment setup.
For production APIs with strict formatting requirements, Claude Opus 4.6 (95.1% composite) and GPT-5.4 (93.8%) are the most reliable frontier options. Both consistently handle format, length, and content constraints without manual verification loops.
For self-hosted deployments, Qwen3.5-27B (95.0% IFEval) is the top open-weights choice. It fits on a single A100 80GB GPU in fp16 and handles the full range of IFEval constraint types. For tighter hardware budgets, Qwen3.5-9B at 91.5% and Gemma 3 4B at 90.2% both deliver strong instruction following at small model sizes.
For structured output generation specifically - JSON schemas, templated reports, constrained text - Hermes 3 70B's IFBench leadership makes it worth testing. If your workflow involves novel output constraints that weren't in the standard training set, Hermes may generalize better than larger Qwen models.
For the open-source LLM leaderboard broader picture, Llama 3.3 70B Instruct at 92.1% is a solid baseline - it's widely supported across inference frameworks, making it a practical choice even if it trails Qwen on this specific benchmark.
Check our guide on how to understand AI benchmarks if you want to calibrate how to weight IFEval against other evaluation criteria for your specific use case.
IFEval's core idea: give a model clear, verifiable constraints - format, word count, keywords - and check whether it actually follows them. Simple in concept, harder in practice.
Source: unsplash.com
FAQ
What is IFEval and how does it work?
IFEval tests models on 541 prompts with verifiable constraints like word count, formatting rules, and keyword requirements. Compliance is checked automatically with Python functions, so results are objective and reproducible.
Which model follows instructions best overall?
For single-call production API use, Claude Opus 4.6 leads the frontier at 95.1% on the combined leaderboard. For open-source/self-hosted use, Qwen3.5-27B leads IFEval at 95.0%.
What's the difference between IFEval and IFBench?
IFEval tests 25 known constraint types; models may overfit to these from training data. IFBench introduces 58 novel constraints to test genuine generalization. The average IFBench score is 19 points lower than average IFEval, confirming the harder bar.
Why does Hermes 3 70B outperform much larger models on IFBench?
Nous Research trained Hermes 3 specifically for precise instruction following and structured output. Its training objective aligns more directly with IFBench's novel constraint types than general-purpose fine-tuning.
How often do these rankings change?
IFEval rankings are fairly stable at the top - the Qwen3.5 family has dominated for several months. IFBench rankings are more volatile because fewer models are assessed. Expect significant movement as more frontier models submit results.
Is a high IFEval score enough to guarantee production reliability?
No. IFEval tests 25 constraint types; real workflows include countless others. Use IFEval as a baseline signal, test your specific constraints in staging, and monitor IFBench scores as a generalization proxy.
Sources:
- IFEval Leaderboard - llm-stats.com
- IFBench Leaderboard - llm-stats.com
- IFBench Benchmark Leaderboard - Artificial Analysis
- Instruction Following Benchmarks 2026 - BenchLM.ai
- Instruction-Following Evaluation for Large Language Models - arXiv:2311.07911
- Generalizing Verifiable Instruction Following - arXiv
- Building AdvancedIF: Evolving Instruction Following - Surge HQ
✓ Last verified April 10, 2026
