Mercury 2 Is 13x Faster Than Claude Haiku - Verified

Inception Labs launched Mercury 2 on February 24, claiming it's the world's fastest reasoning language model. Independent testing by Artificial Analysis clocked it at 1,196 tokens per second - more than 13 times the throughput of Claude 4.5 Haiku and nearly 17 times faster than GPT-5 Mini. The architecture behind that number isn't a prompt trick or quantization hack. It's a fundamentally different way of building a language model.

TL;DR

Mercury 2 hit 1,196 tokens per second in independent Artificial Analysis testing - 13x Claude Haiku, 17x GPT-5 Mini
Built on diffusion architecture: generates a full draft and refines it in parallel instead of token-by-token
Matches Haiku-class quality on reasoning benchmarks (AIME 2025: 91.1, GPQA: 73.6) at $0.75/M output tokens
Text-only, cloud-only for now - no multimodal, no self-hosting on standard CUDA
Optimized for NVIDIA Blackwell GPUs; the speed numbers assume datacenter hardware

The Claims Table

Model	Throughput	Output Price	AIME 2025	Intelligence Rank
Mercury 2	1,196 t/s	$0.75/M	91.1	#18/134
Claude 4.5 Haiku	~89 t/s	$4.90/M	~82	~#35
GPT-5 Mini	~71 t/s	~$1.90/M	~74	~#40
Nemotron 3 Nano	~120 t/s	~$0.40/M	~68	~#50

The throughput figures for Mercury 2 come from Artificial Analysis, a third-party evaluation firm that measures model performance across providers. The other figures are approximate, drawn from their public leaderboard data. Mercury 2 sits at the intersection of fast and cheap - an unusual position for a model that also scores in the top 15% on quality.

"Reasoning models are only as useful as their ability to run in production. For the past few years, we've seen incredible progress in model capability, but much less progress in making that capability usable in low-latency use cases. With Mercury 2, we've built a system where high-quality reasoning runs fast enough and efficiently enough for real-time applications."
Stefano Ermon, CEO and co-founder of Inception Labs

Stefano Ermon, CEO of Inception Labs, speaking about Mercury 2's diffusion-based architecture Stefano Ermon is a co-inventor of diffusion methods used in modern image generation, including the techniques underlying Stable Diffusion.

How It Actually Works

Standard language models - GPT, Claude, Gemini, LLaMA, everything you use today - are autoregressive. They predict one token, append it to the sequence, predict the next, and repeat. This serial chain means every token waits for the previous one to finish. It's inherently sequential.

Mercury 2 works differently. It starts with a noisy, incomplete draft of the full output and runs a denoising process that refines multiple tokens simultaneously across a small number of passes. The company calls this a diffusion large language model (dLLM). The approach borrows directly from image generation techniques - Stable Diffusion works the same way on pixels.

Ermon, who co-invented the diffusion methods used in Stable Diffusion and DALL-E, has spent the past two years applying that architecture to text. Mercury 2 is the production version of that effort.

What They Measured

Inception's numbers aren't output tokens per second on a toy prompt. Artificial Analysis runs a standardized multi-turn benchmark suite across real API endpoints, measuring total throughput including network round trips. Their 1,196 t/s figure is reproducible by any developer with an API key. The cost figures also check out - Mercury 2 is priced at $0.25 per million input tokens and $0.75 per million output tokens, which is 6.5x cheaper than Claude Haiku 4.5 on output and roughly 2.5x cheaper than GPT-5 Mini.

Quality benchmarks tell a consistent story: Mercury 2 scored 91.1 on AIME 2025 (competitive math reasoning), 73.6 on GPQA (graduate-level science questions), and 71.3 on IFBench (instruction following). These put it solidly in the Haiku/Mini tier - not a frontier reasoner, but not a toy. If you are evaluating model quality for production use, it lands in the same performance bracket as models that cost far more per token.

Server infrastructure at scale - Mercury 2 runs on NVIDIA Blackwell GPUs in Inception's cloud Mercury 2 achieves its throughput gains on Blackwell hardware. The 1,009 t/s baseline figure assumes NVFP4 precision on H100-class or newer accelerators.

What They Didn't Measure

The 1,196 t/s figure is real, but it comes with context that the press release buries.

Hardware dependency. Mercury 2 is optimized for NVIDIA Blackwell GPUs (H100 and newer). The company reports 1,009 t/s as their baseline on Blackwell with NVFP4 precision. On older CUDA hardware the throughput advantage shrinks - exact figures aren't published. This matters because most developers running self-hosted workloads don't have Blackwell clusters.

Cloud-only. Mercury 2 isn't open-weight. You can't download and run it locally or fine-tune it on your own data. The speed numbers are API numbers. If you care about data privacy, on-premises deployment, or model ownership, Mercury 2 isn't your answer - look instead at Nemotron 3 Nano, which is open-weight and available on Hugging Face.

Text-only. No images, no audio, no video. The entire Mercury architecture is designed for text sequences. In a product landscape moving rapidly toward multimodal agents, this is a striking gap.

No long-form stress test. The 91.1 AIME score is impressive, but AIME problems fit in a short context. There's no published data on multi-turn coherence over 10K+ tokens, or on whether the parallel refinement approach degrades for very long outputs. The 128K context window is there - what happens inside it isn't fully documented.

Benchmark familiarity. AIME and GPQA scores have become routine claims in model launches. For a full picture of what the numbers actually mean, the guide to understanding AI benchmarks covers what these evaluations measure and where they break down.

The Architecture Gap Is Closing

Diffusion for text is not a new idea - researchers have been working on masked diffusion models since at least 2022. What Inception has done is make one fast enough and coherent enough to compete in production. The original Mercury launched quietly in 2024 as a research artifact. Mercury 2 is a commercial product with real throughput, published pricing, and OpenAI API compatibility - meaning you can swap it into existing code without touching a prompt template.

The 1,000 t/s ceiling matters most in three settings: real-time voice assistants (where 70 t/s is perceptibly slow), agent loops (where latency compounds across dozens of sequential model calls), and search or RAG pipelines (where throughput determines how many documents you can process in a given time budget). For these use cases, the cost efficiency leaderboard shows Mercury 2 landing near the top once latency and price are weighted together.

Should You Care?

If you are building a pipeline where speed is a constraint and quality can sit at the Haiku/Mini level, Mercury 2 is worth running a benchmark against your actual workload. The Artificial Analysis throughput numbers are real and reproducible. The quality scores are honest - it isn't trying to be o3.

The harder question is architecture risk. Diffusion LLMs are an emerging class, not a proven commodity. The training recipes, fine-tuning pipelines, and failure modes are less understood than autoregressive models. If something breaks in production, there is a smaller community of engineers who have seen this failure mode before.

That caveat aside, the speed advantage isn't marginal - it is an order of magnitude. For agent frameworks and latency-sensitive APIs, that changes what is possible in a single inference budget. Engineers building those systems should test it before dismissing it.

Sources: