AI Speed and Latency Leaderboard: Tokens/s Rankings

Rankings of the fastest AI models and inference providers by tokens per second, time to first token, and end-to-end latency.

AI Speed and Latency Leaderboard: Tokens/s Rankings

Speed kills - or rather, the lack of it. In production AI systems, the difference between 50 tokens per second and 500 tokens per second isn't academic. It's the difference between a chatbot that feels like talking to a person and one that feels like waiting for a fax. For agentic workflows that chain multiple LLM calls, latency compounds fast: a 10-step reasoning chain at 2 seconds per step means 20 seconds of dead air. And at scale, throughput directly maps to cost. More tokens per second per GPU means fewer GPUs, which means real money saved.

This leaderboard ranks AI models and inference providers by the metrics that matter for production deployment: output tokens per second (TPS), time to first token (TTFT), and practical throughput under realistic conditions. All data is sourced from independent benchmarks - mostly Artificial Analysis, which tests API endpoints every 8 hours - added to by our own spot checks.

TL;DR - March 2026 Speed Rankings

  • Fastest model overall: Mercury 2 at ~629 tokens/s - a diffusion-based architecture that creates tokens in parallel rather than one at a time
  • Fastest traditional LLM: IBM Granite 3.3 8B at ~375 tokens/s, followed by Granite 4.0 H Small at ~355 tokens/s
  • Fastest frontier-class model: Gemini 2.5 Flash at ~221 tokens/s via Google AI Studio
  • Fastest hardware provider: Cerebras, delivering 2,522 tokens/s on Llama 4 Maverick (400B) - more than 2x NVIDIA Blackwell
  • Lowest TTFT: ServiceNow Apriel-v1.5-15B-Thinker at 0.37 seconds
  • Best speed-to-quality ratio: Gemini 2.5 Flash - strong benchmark scores at 221 tokens/s and $0.15/M input tokens

How We Measure Speed

Three metrics define inference performance, and they measure different things:

Output Tokens Per Second (TPS) is the average number of tokens generated per second after the first token arrives. This is the headline number most providers advertise, and it's what determines how fast text streams to the user. Higher is better, but context matters: a model producing 500 TPS on a 10-token response is less useful than one producing 200 TPS sustained over 2,000 tokens.

Time to First Token (TTFT) measures how long a user waits before seeing any output. For interactive applications - chatbots, autocomplete, real-time translation - TTFT matters more than raw throughput. A model with 0.4s TTFT and 100 TPS will feel faster to a user than one with 2s TTFT and 300 TPS for short responses.

End-to-End Latency combines TTFT plus the time to produce the full response. This is what matters for non-streaming use cases and agentic pipelines where you need the complete output before proceeding.

All TPS figures in this leaderboard use OpenAI token equivalents as a standardized unit, following Artificial Analysis methodology. Measurements reflect single-request performance (not batched throughput) taken over the past 72 hours as of publication.

Output Speed Rankings - March 2026

RankModelOutput TPSTTFT (s)ProviderPrice (Output/1M)
1Mercury 26290.8Inception Labs$0.75
2IBM Granite 3.3 8B3750.5IBM$0.04
3IBM Granite 4.0 H Small3550.5IBM$0.05
4Gemini 2.5 Flash-Lite3410.42Google$0.30
5AWS Nova Micro3050.6Amazon$0.14
6Gemini 2.5 Flash2210.45Google$0.60
7Gemini 3 Flash Preview1990.5Google$0.80
8GPT-5.21871.2OpenAI$10.00
9Claude 4.5 Haiku1350.6Anthropic$1.25
10Gemini 3.1 Pro Preview1100.9Google$5.00
11GPT-5 (high)581.5OpenAI$15.00
12Claude 4.5 Sonnet401.5Anthropic$15.00

Data: Artificial Analysis, March 2026. Single-request measurements. Prices reflect standard API tier.

A few things jump out from this table. Mercury 2 isn't just fast - it's in a different category. At 629 TPS, it's roughly 1.7x faster than the next model. That's because it doesn't work like a traditional autoregressive LLM. Built on a diffusion architecture, Mercury 2 generates full draft sequences and refines them in parallel, rather than producing tokens one at a time. Independent testing by Artificial Analysis clocked it at over 1,100 TPS in some configurations - more than 13x the throughput of Claude 4.5 Haiku.

IBM's Granite models occupy the second and third slots, which is a surprise to many. These are small models (8B and similar parameter counts), and their speed comes from sheer compactness rather than architectural innovation. They won't match frontier models on reasoning benchmarks, but for straightforward tasks like classification, extraction, or simple Q&A, they're hard to beat on speed.

Google's Gemini family dominates the middle of the table. Gemini 2.5 Flash at 221 TPS is the sweet spot for many developers: it's fast enough for real-time streaming, scores competitively on reasoning benchmarks, and costs just $0.60 per million output tokens. The Flash-Lite variant pushes to 341 TPS if you can accept slightly lower quality.

Inference Provider Speed Rankings

The same model can run at very different speeds depending on who's serving it. Here's how dedicated inference providers compare on Llama 4 Maverick (400B parameters) - a useful benchmark because it's large enough to stress any hardware:

RankProviderModelOutput TPSHardware
1CerebrasLlama 4 Maverick2,522CS-3 Wafer-Scale Engine
2SambaNovaLlama 4 Maverick794SN40L RDU
3GroqLlama 4 Maverick549LPU
4Amazon BedrockLlama 4 Maverick290NVIDIA GPUs
5Google CloudLlama 4 Maverick125TPU v5
6Microsoft AzureLlama 4 Maverick54NVIDIA GPUs

Data: Cerebras press release and Artificial Analysis provider benchmarks, Q1 2026.

Cerebras is in a class of its own. Its wafer-scale engine - a single chip the size of a dinner plate with ~21 PB/s of on-chip memory bandwidth - holds entire models in SRAM and removes the memory bottleneck that limits GPU inference. On Llama 3.3 70B, Cerebras has hit 2,100 TPS; on the smaller Llama 3.1 8B, it reaches 1,800 TPS. These numbers are real, verified by Artificial Analysis, and they're roughly 4-6x what Groq achieves on identical models.

SambaNova's RDU architecture takes second place. Its upcoming SN50 chip (shipping H2 2026) claims 895 TPS per user on Llama 3.3 70B with FP8 precision - nearly 5x NVIDIA B200's 184 TPS on the same model. Those are vendor benchmarks, so treat them accordingly, but even discounting for marketing optimism, SambaNova is consistently fast.

Groq's LPU remains the most well-known speed-focused chip. At 549 TPS on Maverick, it's fast but no longer the fastest. Where Groq still excels is TTFT - its deterministic, compiler-driven architecture pre-computes the entire execution graph, delivering exceptionally consistent latency with near-zero variance. For voice assistants and real-time applications where predictable response times matter more than peak throughput, Groq is still a strong pick.

Speed vs. Quality - The Real Tradeoff

The rankings above tell a clear story: speed and intelligence are still inversely correlated, but the gap is narrowing.

Consider the extremes. Mercury 2 runs at 629 TPS and matches Claude 4.5 Haiku on reasoning benchmarks (AIME 2025: 91.1%, GPQA: 73.6%). GPT-5 (high) runs at 58 TPS - roughly 11x slower - but scores significantly higher on complex reasoning tasks. For a simple customer support bot, Mercury 2 is the obvious choice. For a research assistant working through multi-step proofs, GPT-5's extra thinking time buys real capability.

The Gemini 2.5 Flash family represents the best current compromise. At 221 TPS it's not the fastest, but it's fast enough for smooth streaming. Its Chatbot Arena Elo of 1335 puts it in the "premium efficient" tier - well ahead of small speed-optimized models, close enough to frontier models that most users can't tell the difference. At $0.60/M output tokens, it's also 25x cheaper than GPT-5. See our cost efficiency leaderboard for the full breakdown.

When Speed Matters Most

Not every workload needs maximum TPS. Here's a practical guide:

Real-time chat and voice assistants: TTFT is king. Users notice delays above ~500ms. Groq's consistent sub-400ms TTFT and Gemini 2.5 Flash-Lite's 0.42s make them strong candidates. Mercury 2 is fast on throughput but its TTFT of 0.8s is average - the diffusion architecture needs a brief warmup before tokens flow.

Agentic workflows and tool-calling loops: End-to-end latency per call matters most. If your agent makes 8-15 LLM calls per task, shaving 200ms per call saves 1.6-3 seconds total. Gemini 2.5 Flash with its combination of speed, quality, and low TTFT is currently the best pick here. Mercury 2 is worth testing if your agent can tolerate slightly lower reasoning quality.

Batch processing and data pipelines: Raw throughput wins. Cerebras at 2,500+ TPS on 400B-parameter models is unmatched for processing large document sets, running eval suites, or doing bulk classification. SambaNova is the runner-up. GPU-based providers can close the gap somewhat with batched inference (multiple requests processed in parallel), but per-request speed still favors custom silicon.

Streaming content generation: For long-form writing, code generation, or any task where the user watches tokens appear, sustained TPS above ~100 feels fluid. Anything below ~50 TPS feels sluggish. Most frontier models clear this bar. The Gemini 3 Flash Preview at 199 TPS is especially good for streaming code.

The inference speed landscape is shifting fast. Three developments will reshape these rankings over the next 6-12 months:

SambaNova SN50 (H2 2026): If vendor claims hold - 3x more efficient than NVIDIA B200, with 1.6 PFLOPS FP16 - this chip could challenge Cerebras for the throughput crown. It's designed specifically for agentic workloads with a three-tier memory architecture supporting models up to 10 trillion parameters.

Diffusion LLMs going mainstream: Mercury 2 proved the concept. Expect more models built on parallel token generation rather than sequential autoregression. The speed advantage is architectural - it doesn't require custom silicon, which means any cloud provider can serve these models fast on standard GPUs.

NVIDIA Blackwell optimization: NVIDIA's B200 posted 60,000 TPS per GPU with TensorRT-LLM on optimized workloads. That's an aggregate throughput number (not per-user), but it shows that GPU-based inference isn't standing still. The gap between custom ASICs and GPUs may narrow as NVIDIA's software stack matures.

The Bottom Line

For most developers in March 2026, Gemini 2.5 Flash is the default recommendation for speed-sensitive applications. It's fast (221 TPS), cheap ($0.60/M output), has a low TTFT (0.45s), and performs well on benchmarks. If you need something faster and can tolerate Haiku-level quality, Mercury 2 at 629 TPS is the standout.

For infrastructure teams running open-weight models at scale, Cerebras is the performance leader - but access is limited and pricing isn't transparent. Groq and SambaNova offer more accessible alternatives through their cloud APIs. Check our free AI inference providers guide for options with generous free tiers.

And if you're building an application where quality matters more than speed, the overall LLM rankings are a better starting point. Speed is a means to an end - the goal is to ship a product that works.

Sources

✓ Last verified March 9, 2026

AI Speed and Latency Leaderboard: Tokens/s Rankings
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.