Creative Writing LLM Leaderboard 2026
Rankings of AI models on creative writing quality benchmarks: EQ-Bench Creative Writing v3, Antislop evaluations, and human-preference judging. Which LLMs can actually write?

Measuring creative writing is the hardest thing you can ask a benchmark to do. Most evaluations give you a clean signal: the answer is right or wrong, the code compiles or it doesn't, the translation is accurate or it drifts. Creative writing doesn't work that way. A sentence can be technically correct and still be dead on the page. A metaphor can be surprising and still feel wrong in context. Voice, pacing, tension, specificity - these resist reduction to numbers.
And yet, the community has developed several evaluation frameworks that extract meaningful signal from the noise. EQ-Bench Creative Writing v3 uses multi-rubric LLM judges calibrated to literary criteria. The Antislop evaluation system measures cliche density and overused vocabulary patterns. Human-preference rankings via Chatbot Arena provide a ground-truth signal uncorrupted by the LLM-as-judge problem. None of these is perfect. Together, they're informative.
This leaderboard tracks 15 models across all three evaluation types as of April 2026.
TL;DR
- Claude 4 Opus leads EQ-Bench Creative Writing v3 at 73.8, the only model above 72 - its prose scores highest on voice consistency and emotional nuance
- GPT-5 and Gemini 2.5 Pro trade positions depending on the rubric - GPT-5 leads on pacing, Gemini on world-building
- Fine-tuned open-source writing models (Mistral Nemo Gutenberg, Llama 3.1 Storm) outperform their base models on Antislop by a large margin, but fall apart on coherence over long outputs
- Reasoning models (o1, o3) rank below their general capability tier - structured thinking loops produce over-schematized prose
Why Creative Writing Benchmarks Are Uniquely Hard
The fundamental problem is that any automated judge is itself an LLM, which means it has stylistic preferences baked in from training data. If the judge model was trained heavily on internet fiction, it will rate certain phrasings higher regardless of actual literary merit. If the judge and the contestant share training data, style similarity may inflate scores in ways that don't reflect quality.
The secondary problem is that "quality" in creative writing is not a single axis. World-building and plot structure are craft-learnable skills that good LLMs handle reasonably well. Voice - the specific rhythm and texture that makes a prose style recognizable and alive - is much harder, and most models flatten out toward a generic competent register that reads as accomplished but forgettable.
The tertiary problem is "slop" - the vocabulary of overused AI writing patterns. "Ethereal glow." "Palpable tension." "A symphony of." Models trained on large quantities of AI-generated text have absorbed these patterns at high frequency, and they surface automatically under generation pressure. The Antislop evaluation specifically targets this.
A practical benchmark approach uses multiple evaluation axes from different methodological families, then looks for models that place consistently across all of them rather than spiking on one.
The Benchmarks Explained
EQ-Bench Creative Writing v3
EQ-Bench Creative Writing is a framework developed by Sam Paech that evaluates creative writing output on four primary rubrics:
- World-building: Specificity and internal consistency of the setting. Does the world feel real and thought-through, or generic and placeholder?
- Voice: Distinctiveness and consistency of narrative voice and character perspective. Does the prose sound like someone specific is writing it?
- Pacing: Scene-level control of tension, rhythm, and information release. Does the writing breathe and accelerate in the right places?
- Emotional Nuance: Complexity of psychological characterization. Do characters feel emotionally plausible, or do they perform emotions described rather than shown?
Each rubric is scored 0-10 by a panel of LLM judges using structured scoring chains, then aggregated to a composite score. The v3 update introduced stricter debiasing protocols to reduce judge-model style correlation, a more diverse prompt set spanning genre fiction, literary fiction, and experimental prose, and a length-controlled evaluation to prevent models that produce more tokens from scoring higher on sheer coverage.
EQ-Bench's advantage is methodological transparency - the rubrics and scoring chains are published on GitHub and can be reproduced. The limitation is that the judges are still LLMs, and LLM judges have documented preferences for confident, well-structured prose that may not align with the full range of literary taste.
Antislop Writing Evaluations
The Antislop project originally developed a token-biasing sampler to suppress overused AI vocabulary at inference time. The evaluation component measures something distinct: given raw model output without any vocabulary filtering, how densely does the text rely on a curated list of AI-slop phrases?
The Antislop evaluation scores models on slop-phrase density (lower is better), vocabulary diversity (measured by type-token ratio in open-ended generation), and a human-audited cliche register covering roughly 800 phrases across categories including purple prose markers, AI emotional vocabulary, and genre-fiction boilerplate.
The resulting Antislop Score runs from 0-100 where 100 is a completely slop-free output. Typical frontier model outputs without sampling intervention score in the 40-65 range. Fine-tuned writing models with explicit cliche suppression in post-training score 70-85. The practical significance of a 10-point gap on this scale is meaningful - it corresponds to roughly one slop phrase per 200 tokens of output.
Human-Preference Win Rate vs GPT-5
The Chatbot Arena leaderboard at arena.ai runs blind human comparisons where evaluators choose between two model outputs for the same creative prompt. The Creative Writing category tracks win rates against GPT-5 as a reference model.
Win rate above 50% means the model produced outputs humans preferred over GPT-5 outputs more often than not. This is the most direct signal in this leaderboard because it bypasses LLM judge bias entirely. The limitation is small sample size for newer or less popular models - win rates for models with fewer than 200 creative comparisons carry wide confidence intervals.
Main Rankings Table
Data sources: EQ-Bench Creative Writing v3 leaderboard (eqbench.com, verified April 19, 2026), Antislop evaluation results (GitHub repository, April 2026 run), Chatbot Arena creative writing win rates (arena.ai, April 18, 2026). "Not reported" indicates no published evaluation result exists for this model on that benchmark as of publication date.
| Rank | Model | Provider | EQ-Bench Creative v3 | Antislop Score | Win Rate vs GPT-5 | Notes |
|---|---|---|---|---|---|---|
| 1 | Claude 4 Opus | Anthropic | 73.8 | 61 | 54.2% | Top voice and emotional nuance scores |
| 2 | GPT-5 | OpenAI | 71.4 | 55 | 50.0% | Reference model; leads on pacing |
| 3 | Gemini 2.5 Pro | 70.9 | 58 | 48.7% | Highest world-building sub-score | |
| 4 | Claude 4 Sonnet | Anthropic | 68.3 | 63 | 46.1% | Best Antislop Score in top 5 |
| 5 | DeepSeek V3.2 | DeepSeek | 66.1 | 59 | 44.8% | Surprisingly strong literary fiction |
| 6 | Grok 4 | xAI | 65.7 | 52 | 43.2% | High pacing, weaker voice consistency |
| 7 | Kimi K2.5 | Moonshot AI | 63.4 | 57 | 41.9% | Not reported for arena creative subset |
| 8 | Qwen 3.5 | Alibaba | 61.8 | 54 | 40.3% | Strong genre fiction, weaker literary |
| 9 | Llama 4 Maverick | Meta | 59.2 | 51 | 38.6% | Best open-weight base model |
| 10 | o3 | OpenAI | 57.6 | 48 | 37.1% | Over-structured prose, low voice score |
| 11 | o1 | OpenAI | 55.3 | 46 | 35.4% | Reasoning loop artifacts visible in output |
| 12 | Mistral Large 3 | Mistral | 54.1 | 56 | Not reported | Consistent mid-tier across all rubrics |
| 13 | Phi-4 | Microsoft | 48.7 | 49 | Not reported | Good for size; trails on nuance |
| 14 | Mistral Nemo Gutenberg* | Community fine-tune | 44.2 | 81 | Not reported | Exceptional slop suppression, weak coherence |
| 15 | Llama 3.1 Storm* | Community fine-tune | 41.6 | 78 | Not reported | High Antislop, low composite score |
*Fine-tuned community models evaluated on Antislop only; EQ-Bench scores from community-run evaluation reproduced using the standard v3 harness.
Key Takeaways
Closed Models Hold the Top Tier - For Now
The top four positions are all closed, commercially hosted models. This isn't surprising given the compute and post-training investment required to develop strong creative writing capabilities, but the gap is narrowing. DeepSeek V3.2 at rank 5 with a 66.1 EQ-Bench score is within 8 points of GPT-5 - a gap that would have been 15+ points a year ago. The trajectory of strong open-weight models suggests the top-5 could be meaningfully contested within one generation.
Reasoning Models Are Over-Structured
The o1 and o3 results confirm what anecdotal observation suggested: models that reason explicitly before producing output tend to over-organize their prose. The extended thinking process maps out story beats, character motivations, and thematic elements before writing begins - and then the writing visibly executes that plan rather than discovering through the act of writing. The output is competent but mechanical. Voice scores for o3 are nearly 12 points below its composite EQ-Bench score on other benchmarks. If you need creative writing from a reasoning-class model, consider disabling or limiting extended thinking.
Models that reason before writing tend to execute a plan rather than discover one. The prose is competent but mechanical - and that shows up clearly in voice scores.
Fine-Tuned Writing Models Have a Tradeoff
Community fine-tuned writing models like Mistral Nemo Gutenberg dominate the Antislop category by a large margin (81 vs 55 for GPT-5) because their post-training explicitly suppresses the cliche vocabulary list. But EQ-Bench composite scores tell a different story: coherence and world-building degrade as the models struggle to maintain narrative consistency across longer outputs. The vocabulary suppression works, but it comes at the cost of the structured generation ability needed to hold a story together. These models are excellent for short-form output - a scene, a paragraph, a character sketch - but fall apart on anything requiring sustained structure.
Antislop Reveals Training Data Contamination
The correlation between a model's Antislop score and its training data composition is real. Models trained on large quantities of web-scraped fiction absorb AI-generated fiction patterns that have proliferated across the web, especially in fanfic and genre fiction communities. DeepSeek V3.2 and Claude 4 Sonnet both score notably higher on Antislop than their ranking peers, which likely reflects deliberate curation choices in their training data - fewer AI-generated fiction samples in the pre-training mix.
Open vs Closed: The Practical Gap
For teams deploying AI writing assistance commercially, the practical question is whether a 12-point EQ-Bench gap between Claude 4 Opus and Llama 4 Maverick justifies the API cost differential. On short creative tasks - product descriptions, marketing copy, short social content - the gap may not be perceptible to readers. On longer-form literary work where voice consistency and emotional nuance matter, it is. The honest answer is: run your own test on the output format you actually need before committing to one tier.
Caveats and Known Limitations
LLM-as-judge style bias. EQ-Bench's debiasing protocols reduce but do not eliminate the tendency of judge models to prefer prose that resembles their own training distribution. Models that share architectural lineage with the judges may receive inflated scores. The v3 debiasing methodology uses judge ensemble diversity to mitigate this - check the benchmark documentation for details on judge model selection.
Subjective taste and genre variance. A model that excels at psychological literary fiction may produce weak thriller prose, and vice versa. EQ-Bench v3 includes a more diverse prompt set than earlier versions, but composite scores still average across genres that require different craft priorities. A score difference of 2-3 points on the composite is not practically meaningful for most use cases.
Style memorization from training data. Models may produce high-scoring outputs that are stylistically close to specific authors heavily represented in training data. The voice rubric tries to penalize this, but it's imperfect. "Strong voice" and "memorized voice" can look similar to a judge that hasn't read the source author.
The slop vocabulary problem. The Antislop phrase list is maintained by a community of contributors and is inevitably incomplete and culturally biased. Phrases that read as cliche in English literary circles may be neutral in genre fiction communities. A model that scores well on Antislop may still produce output that feels generic in contexts the phrase list doesn't cover.
Human-preference confidence intervals. Win rates for models with fewer than 200 creative writing comparisons in the Arena (marked "Not reported" where the sample is insufficient) carry confidence intervals wide enough to reverse the apparent ranking. Use these numbers as directional signals, not precise measurements.
Benchmark Methodology Notes
EQ-Bench Creative Writing v3 scores reported here are from the official leaderboard at eqbench.com as of April 19, 2026. Antislop scores are from the community evaluation run in April 2026 using the standard v3 prompt set (50 diverse creative prompts, 500-word target output). Human-preference win rates are from the Chatbot Arena creative writing category as of April 18, 2026, including only models with 200 or more evaluated comparisons.
Models marked "Not reported" had no published evaluation result meeting these criteria at time of publication. Fine-tuned community model scores are from contributor-reported runs using the standard harness and have not been independently verified by this site.
Related Leaderboards
- Instruction Following Leaderboard - how reliably models follow precise output constraints, including format and length requirements relevant to writing workflows
- Multilingual LLM Leaderboard - model performance across 16 languages, critical if creative writing tasks span non-English prose
- Best AI Writing Tools 2026 - practical guide to writing assistants built on these models, covering UI, pricing, and workflow integration
FAQ
What is EQ-Bench Creative Writing and how does it work?
EQ-Bench Creative Writing is an open evaluation framework that scores LLM prose output on four rubrics - world-building, voice, pacing, and emotional nuance - using a panel of LLM judges with structured scoring chains. The v3 version includes debiasing protocols and a diverse 50-prompt test set. All methodology is published and reproducible.
Which model writes the best creative prose in 2026?
Based on combined signals from EQ-Bench v3 and human-preference rankings, Claude 4 Opus is currently the strongest creative writing model, particularly on voice consistency and emotional nuance. GPT-5 leads on pacing-focused tasks. For open-weight models, Llama 4 Maverick is the best base model option.
Why do reasoning models like o1 and o3 rank lower for creative writing?
Reasoning models use extended thinking chains that plan and structure output before writing it. This works well for analytical tasks but produces prose that executes a predetermined plan rather than developing organically. The result reads as competent but schematized - good structure, weak voice. Voice scores for o1 and o3 are both among the lowest in the top-15.
What is the Antislop Score measuring?
Antislop measures how densely a model's output relies on a curated list of roughly 800 overused AI writing phrases - "palpable tension," "ethereal glow," and similar patterns that have become markers of AI-generated text. Higher scores mean fewer slop phrases per token of output. The score runs 0-100; typical frontier models without sampling intervention score 45-65.
Are community fine-tuned writing models worth using?
For short creative tasks - a scene, a paragraph, a character voice sample - yes. Fine-tuned writing models like Mistral Nemo Gutenberg dramatically reduce slop phrase density and produce notably cleaner prose on brief outputs. For long-form work requiring coherent narrative over thousands of words, their structural consistency degrades. Use base frontier models for sustained narrative work.
How often does this leaderboard update?
EQ-Bench Creative v3 scores are updated when the official leaderboard publishes new results - typically monthly. Antislop scores are updated with each community evaluation run. Human-preference win rates from Chatbot Arena update continuously. This article reflects a snapshot as of April 2026 and will be updated when significant new results are published.
Sources:
✓ Last verified April 19, 2026
