AI Speed and Latency Leaderboard: Tokens/s Rankings
Rankings of the fastest AI models and inference providers by tokens per second, time to first token, and end-to-end latency.

Speed kills - or rather, the lack of it. In production AI systems, the difference between 50 tokens per second and 500 tokens per second isn't academic. It's the difference between a chatbot that feels like talking to a person and one that feels like waiting for a fax. For agentic workflows that chain multiple LLM calls, latency compounds fast: a 10-step reasoning chain at 2 seconds per step means 20 seconds of dead air. And at scale, throughput directly maps to cost. More tokens per second per GPU means fewer GPUs, which means real money saved.
This leaderboard ranks AI models and inference providers by the metrics that matter for production deployment: output tokens per second (TPS), time to first token (TTFT), and practical throughput under realistic conditions. All data is sourced from independent benchmarks - mostly Artificial Analysis, which tests API endpoints every 8 hours - added to by our own spot checks.
TL;DR - March 2026 Speed Rankings
- Fastest model overall: Mercury 2 at ~629 tokens/s - a diffusion-based architecture that creates tokens in parallel rather than one at a time
- Fastest traditional LLM: IBM Granite 3.3 8B at ~375 tokens/s, followed by Granite 4.0 H Small at ~355 tokens/s
- Fastest frontier-class model: Gemini 2.5 Flash at ~221 tokens/s via Google AI Studio
- Fastest hardware provider: Cerebras, delivering 2,522 tokens/s on Llama 4 Maverick (400B) - more than 2x NVIDIA Blackwell
- Lowest TTFT: ServiceNow Apriel-v1.5-15B-Thinker at 0.37 seconds
- Best speed-to-quality ratio: Gemini 2.5 Flash - strong benchmark scores at 221 tokens/s and $0.15/M input tokens
How We Measure Speed
Three metrics define inference performance, and they measure different things:
Output Tokens Per Second (TPS) is the average number of tokens generated per second after the first token arrives. This is the headline number most providers advertise, and it's what determines how fast text streams to the user. Higher is better, but context matters: a model producing 500 TPS on a 10-token response is less useful than one producing 200 TPS sustained over 2,000 tokens.
Time to First Token (TTFT) measures how long a user waits before seeing any output. For interactive applications - chatbots, autocomplete, real-time translation - TTFT matters more than raw throughput. A model with 0.4s TTFT and 100 TPS will feel faster to a user than one with 2s TTFT and 300 TPS for short responses.
End-to-End Latency combines TTFT plus the time to produce the full response. This is what matters for non-streaming use cases and agentic pipelines where you need the complete output before proceeding.
All TPS figures in this leaderboard use OpenAI token equivalents as a standardized unit, following Artificial Analysis methodology. Measurements reflect single-request performance (not batched throughput) taken over the past 72 hours as of publication.
Output Speed Rankings - March 2026
| Rank | Model | Output TPS | TTFT (s) | Provider | Price (Output/1M) |
|---|---|---|---|---|---|
| 1 | Mercury 2 | 629 | 0.8 | Inception Labs | $0.75 |
| 2 | IBM Granite 3.3 8B | 375 | 0.5 | IBM | $0.04 |
| 3 | IBM Granite 4.0 H Small | 355 | 0.5 | IBM | $0.05 |
| 4 | Gemini 2.5 Flash-Lite | 341 | 0.42 | $0.30 | |
| 5 | AWS Nova Micro | 305 | 0.6 | Amazon | $0.14 |
| 6 | Gemini 2.5 Flash | 221 | 0.45 | $0.60 | |
| 7 | Gemini 3 Flash Preview | 199 | 0.5 | $0.80 | |
| 8 | GPT-5.2 | 187 | 1.2 | OpenAI | $10.00 |
| 9 | Claude 4.5 Haiku | 135 | 0.6 | Anthropic | $1.25 |
| 10 | Gemini 3.1 Pro Preview | 110 | 0.9 | $5.00 | |
| 11 | GPT-5 (high) | 58 | 1.5 | OpenAI | $15.00 |
| 12 | Claude 4.5 Sonnet | 40 | 1.5 | Anthropic | $15.00 |
Data: Artificial Analysis, March 2026. Single-request measurements. Prices reflect standard API tier.
A few things jump out from this table. Mercury 2 isn't just fast - it's in a different category. At 629 TPS, it's roughly 1.7x faster than the next model. That's because it doesn't work like a traditional autoregressive LLM. Built on a diffusion architecture, Mercury 2 generates full draft sequences and refines them in parallel, rather than producing tokens one at a time. Independent testing by Artificial Analysis clocked it at over 1,100 TPS in some configurations - more than 13x the throughput of Claude 4.5 Haiku.
IBM's Granite models occupy the second and third slots, which is a surprise to many. These are small models (8B and similar parameter counts), and their speed comes from sheer compactness rather than architectural innovation. They won't match frontier models on reasoning benchmarks, but for straightforward tasks like classification, extraction, or simple Q&A, they're hard to beat on speed.
Google's Gemini family dominates the middle of the table. Gemini 2.5 Flash at 221 TPS is the sweet spot for many developers: it's fast enough for real-time streaming, scores competitively on reasoning benchmarks, and costs just $0.60 per million output tokens. The Flash-Lite variant pushes to 341 TPS if you can accept slightly lower quality.
Inference Provider Speed Rankings
The same model can run at very different speeds depending on who's serving it. Here's how dedicated inference providers compare on Llama 4 Maverick (400B parameters) - a useful benchmark because it's large enough to stress any hardware:
| Rank | Provider | Model | Output TPS | Hardware |
|---|---|---|---|---|
| 1 | Cerebras | Llama 4 Maverick | 2,522 | CS-3 Wafer-Scale Engine |
| 2 | SambaNova | Llama 4 Maverick | 794 | SN40L RDU |
| 3 | Groq | Llama 4 Maverick | 549 | LPU |
| 4 | Amazon Bedrock | Llama 4 Maverick | 290 | NVIDIA GPUs |
| 5 | Google Cloud | Llama 4 Maverick | 125 | TPU v5 |
| 6 | Microsoft Azure | Llama 4 Maverick | 54 | NVIDIA GPUs |
Data: Cerebras press release and Artificial Analysis provider benchmarks, Q1 2026.
Cerebras is in a class of its own. Its wafer-scale engine - a single chip the size of a dinner plate with ~21 PB/s of on-chip memory bandwidth - holds entire models in SRAM and removes the memory bottleneck that limits GPU inference. On Llama 3.3 70B, Cerebras has hit 2,100 TPS; on the smaller Llama 3.1 8B, it reaches 1,800 TPS. These numbers are real, verified by Artificial Analysis, and they're roughly 4-6x what Groq achieves on identical models.
SambaNova's RDU architecture takes second place. Its upcoming SN50 chip (shipping H2 2026) claims 895 TPS per user on Llama 3.3 70B with FP8 precision - nearly 5x NVIDIA B200's 184 TPS on the same model. Those are vendor benchmarks, so treat them accordingly, but even discounting for marketing optimism, SambaNova is consistently fast.
Groq's LPU remains the most well-known speed-focused chip. At 549 TPS on Maverick, it's fast but no longer the fastest. Where Groq still excels is TTFT - its deterministic, compiler-driven architecture pre-computes the entire execution graph, delivering exceptionally consistent latency with near-zero variance. For voice assistants and real-time applications where predictable response times matter more than peak throughput, Groq is still a strong pick.
Speed vs. Quality - The Real Tradeoff
The rankings above tell a clear story: speed and intelligence are still inversely correlated, but the gap is narrowing.
Consider the extremes. Mercury 2 runs at 629 TPS and matches Claude 4.5 Haiku on reasoning benchmarks (AIME 2025: 91.1%, GPQA: 73.6%). GPT-5 (high) runs at 58 TPS - roughly 11x slower - but scores significantly higher on complex reasoning tasks. For a simple customer support bot, Mercury 2 is the obvious choice. For a research assistant working through multi-step proofs, GPT-5's extra thinking time buys real capability.
The Gemini 2.5 Flash family represents the best current compromise. At 221 TPS it's not the fastest, but it's fast enough for smooth streaming. Its Chatbot Arena Elo of 1335 puts it in the "premium efficient" tier - well ahead of small speed-optimized models, close enough to frontier models that most users can't tell the difference. At $0.60/M output tokens, it's also 25x cheaper than GPT-5. See our cost efficiency leaderboard for the full breakdown.
When Speed Matters Most
Not every workload needs maximum TPS. Here's a practical guide:
Real-time chat and voice assistants: TTFT is king. Users notice delays above ~500ms. Groq's consistent sub-400ms TTFT and Gemini 2.5 Flash-Lite's 0.42s make them strong candidates. Mercury 2 is fast on throughput but its TTFT of 0.8s is average - the diffusion architecture needs a brief warmup before tokens flow.
Agentic workflows and tool-calling loops: End-to-end latency per call matters most. If your agent makes 8-15 LLM calls per task, shaving 200ms per call saves 1.6-3 seconds total. Gemini 2.5 Flash with its combination of speed, quality, and low TTFT is currently the best pick here. Mercury 2 is worth testing if your agent can tolerate slightly lower reasoning quality.
Batch processing and data pipelines: Raw throughput wins. Cerebras at 2,500+ TPS on 400B-parameter models is unmatched for processing large document sets, running eval suites, or doing bulk classification. SambaNova is the runner-up. GPU-based providers can close the gap somewhat with batched inference (multiple requests processed in parallel), but per-request speed still favors custom silicon.
Streaming content generation: For long-form writing, code generation, or any task where the user watches tokens appear, sustained TPS above ~100 feels fluid. Anything below ~50 TPS feels sluggish. Most frontier models clear this bar. The Gemini 3 Flash Preview at 199 TPS is especially good for streaming code.
Hardware Trends to Watch
The inference speed landscape is shifting fast. Three developments will reshape these rankings over the next 6-12 months:
SambaNova SN50 (H2 2026): If vendor claims hold - 3x more efficient than NVIDIA B200, with 1.6 PFLOPS FP16 - this chip could challenge Cerebras for the throughput crown. It's designed specifically for agentic workloads with a three-tier memory architecture supporting models up to 10 trillion parameters.
Diffusion LLMs going mainstream: Mercury 2 proved the concept. Expect more models built on parallel token generation rather than sequential autoregression. The speed advantage is architectural - it doesn't require custom silicon, which means any cloud provider can serve these models fast on standard GPUs.
NVIDIA Blackwell optimization: NVIDIA's B200 posted 60,000 TPS per GPU with TensorRT-LLM on optimized workloads. That's an aggregate throughput number (not per-user), but it shows that GPU-based inference isn't standing still. The gap between custom ASICs and GPUs may narrow as NVIDIA's software stack matures.
The Bottom Line
For most developers in March 2026, Gemini 2.5 Flash is the default recommendation for speed-sensitive applications. It's fast (221 TPS), cheap ($0.60/M output), has a low TTFT (0.45s), and performs well on benchmarks. If you need something faster and can tolerate Haiku-level quality, Mercury 2 at 629 TPS is the standout.
For infrastructure teams running open-weight models at scale, Cerebras is the performance leader - but access is limited and pricing isn't transparent. Groq and SambaNova offer more accessible alternatives through their cloud APIs. Check our free AI inference providers guide for options with generous free tiers.
And if you're building an application where quality matters more than speed, the overall LLM rankings are a better starting point. Speed is a means to an end - the goal is to ship a product that works.
Sources
- Artificial Analysis - LLM Leaderboard - Primary speed and latency data
- Artificial Analysis - Performance Benchmarking Methodology
- Cerebras - Llama 4 Maverick Inference Results
- Cerebras - 3x Faster: Llama 3.1 70B at 2,100 TPS
- SambaNova - SN50 RDU Announcement
- Groq - LPU Benchmark Results
- Mercury 2 Speed Verification
✓ Last verified March 9, 2026
