AI Voice and Speech Leaderboard: TTS and STT Rankings
Rankings of the best text-to-speech and speech-to-text AI models by naturalness, accuracy, latency, and pricing.

Voice AI is quietly becoming the infrastructure layer that most people interact with daily without thinking about it. Every time you talk to a customer service bot, listen to an AI-narrated podcast, or dictate a message on your phone, a TTS or STT model is doing the heavy lifting. The market is projected to hit $7.5 billion by 2028, and the technical quality gap between the best and worst options is enormous.
The problem with evaluating voice models is that TTS and STT require completely different metrics. A speech synthesis model that sounds amazing might cost 200x more per character than a passable alternative. A transcription engine with the lowest word error rate might add 500ms of latency that kills your real-time voice agent. I've pulled data from the Artificial Analysis Speech Arena, the TTS Arena on Hugging Face, and independent benchmarks to build rankings that account for quality, speed, and cost together.
TL;DR
- ElevenLabs Scribe v2 leads STT accuracy at 2.3% WER, but costs $6.67 per 1,000 minutes - Deepgram Nova-3 offers 5.26% WER at $4.30 for budget-conscious teams
- Inworld TTS 1 Max tops the Artificial Analysis Speech Arena with ELO 1,162, while Kokoro 82M delivers surprisingly strong quality at $0.70 per million characters
- Cartesia Sonic 3 wins the latency race at 40ms time-to-first-audio, essential for conversational AI agents
- Open-source TTS has caught up fast - Kokoro and Fish Speech v1.5 are production-viable for many use cases
How TTS and STT Benchmarks Work
Measuring voice AI quality requires both automated metrics and human listening tests.
Source: unsplash.com
TTS and STT evaluation couldn't be more different, which is why lumping them into a single "voice AI" ranking without separating the metrics is misleading.
Text-to-Speech (TTS) quality is mostly measured through:
- Mean Opinion Score (MOS) - Human listeners rate synthesized speech on a 1-5 scale for naturalness. Scores above 4.0 are generally considered "near-human." The issue: MOS tests are expensive, slow, and vary across listener panels.
- ELO Rating - The TTS Arena on Hugging Face and the Artificial Analysis Speech Arena both use ELO systems where listeners compare two audio samples blind and pick the more natural one. This sidesteps some MOS inconsistencies but introduces its own biases toward English and short-form content.
- Latency - Time-to-first-audio (TTFA) matters for real-time applications. Anything above 300ms creates noticeable conversation lag in voice agents.
- Word Error Rate (WER) in synthesis - Measures mispronunciations, skipped words, and hallucinated sounds. Lower is better.
Speech-to-Text (STT) benchmarks center on:
- Word Error Rate (WER) - The percentage of words incorrectly transcribed. Calculated as (substitutions + insertions + deletions) / total words. A 5% WER means roughly 1 in 20 words is wrong.
- Real-time factor - How fast the model transcribes relative to audio duration. A 100x factor means 1 hour of audio is transcribed in 36 seconds.
- Latency - For streaming applications, the delay between speech and transcription output.
One caveat worth repeating: vendor-published benchmarks are self-reported. I weight third-party evaluations like Artificial Analysis and TTS Arena more heavily, and I'll note where data comes from self-reported sources.
TTS Rankings - March 2026
| Rank | Model | Provider | Arena ELO | Latency (TTFA) | Price (per 1M chars) | Voice Cloning |
|---|---|---|---|---|---|---|
| 1 | Inworld TTS 1 Max | Inworld | 1,162 | <250ms | $10 | No |
| 2 | Inworld TTS 1.5 Max | Inworld | 1,115 | <250ms | $10 | No |
| 3 | TTS-1 | OpenAI | 1,111 | ~500ms | $15 | No |
| 4 | Speech-02-Turbo | MiniMax | 1,107 | N/A | $100 | Yes |
| 5 | Multilingual v2 | ElevenLabs | 1,105 | 75ms | $206 | Yes |
| 6 | Speech 2.6 HD | MiniMax | 1,105 | N/A | $100 | Yes |
| 7 | Turbo v2.5 | ElevenLabs | 1,096 | 75ms | $206 | Yes |
| 8 | Fish Audio OpenAudio S1 | Fish Audio | 1,074 | N/A | $15 | Yes |
| 9 | Amazon Polly Generative | AWS | 1,060 | 100ms-1s | $30 | No |
| 10 | Kokoro 82M v1.0 | Open Source | 1,059 | <50ms | $0.70 | No |
| 11 | Cartesia Sonic 3 | Cartesia | 1,054 | 40ms | $46.70 | Yes |
| 12 | gpt-4o-mini-tts | OpenAI | ~1,040 | ~500ms | $15 | No |
ELO scores from Artificial Analysis Speech Arena (March 2026). Pricing reflects pay-as-you-go rates. ElevenLabs pricing varies by plan tier - $0.30/1K chars on Creator down to $0.12/1K chars on Business.
The spread tells an interesting story. Inworld has quietly taken both top slots with strong quality at $10 per million characters - that's 20x cheaper than ElevenLabs for higher-rated output. OpenAI's original TTS-1 still holds up remarkably well at rank 3, which is notable given that it launched in late 2023.
ElevenLabs remains the go-to name in voice AI, and their Multilingual v2 delivers arguably the most expressive and emotionally rich synthesis available. But at $206 per million characters on the default API tier, you're paying a steep premium. Their recently launched v3 Alpha (ELO 1,095) is close to Turbo v2.5 in quality and may signal a pricing reset once it exits preview.
MiniMax's Speech-02 models are strong performers that don't get enough attention in the English-speaking market. They offer voice cloning at competitive quality, though documentation and API stability have been inconsistent based on developer reports.
Kokoro 82M at rank 10 is the standout value play. An 82-million parameter open-source model running at 96x real-time on a basic cloud GPU, with quality that beats Amazon Polly on blind listening tests. At $0.70 per million characters (self-hosted compute cost), it's 294x cheaper than ElevenLabs.
STT Rankings - March 2026
Word error rate remains the primary metric, but latency and pricing matter just as much in production.
Source: unsplash.com
| Rank | Model | Provider | WER (English) | Speed (RT factor) | Real-time Streaming | Price (per 1K min) |
|---|---|---|---|---|---|---|
| 1 | Scribe v2 | ElevenLabs | 2.3% | 31.7x | Yes | $6.67 |
| 2 | Gemini 3 Pro | 2.9% | 5.9x | Yes | $7.68 | |
| 3 | Voxtral Small | Mistral | 3.0% | 67.7x | No | $4.00 |
| 4 | Gemini 2.5 Pro | 3.1% | 13.3x | Yes | $4.80 | |
| 5 | Gemini 3 Flash | 3.1% | 11.9x | Yes | $1.92 | |
| 6 | GPT-4o Transcribe | OpenAI | 3.5% | N/A | Yes | $6.00 |
| 7 | Nova-3 | Deepgram | 5.26% | 441.6x | Yes | $4.30 |
| 8 | Universal-2 | AssemblyAI | 5.5% | N/A | Yes | $2.50 |
| 9 | Whisper large-v3 | OpenAI | 6.5% | 1x (self-hosted) | No (batch) | $6.00 (API) / Free |
| 10 | Chirp 2 | Google Cloud | 7.4% | N/A | Yes | $16.00 |
| 11 | GPT-4o-mini Transcribe | OpenAI | 7.8% | N/A | Yes | $3.00 |
WER from Artificial Analysis benchmarks (February 2026) across three English test datasets. Self-hosted models marked "Free" for pricing column. Speed measured as real-time factor on provider infrastructure.
ElevenLabs Scribe v2 took the accuracy crown with a 2.3% WER, which is remarkably close to human transcription error rates (usually around 2-4% depending on audio quality). But Google's dominance is the real headline - three Gemini models in the top five, with Gemini 3 Flash offering 3.1% WER at just $1.92 per 1,000 minutes. That's the best accuracy-to-price ratio in the entire table.
Deepgram Nova-3 at rank 7 might look mid-tier on WER alone, but context matters. Its 441.6x real-time processing speed makes it the fastest commercial STT by a wide margin, and its sub-300ms streaming latency is purpose-built for voice agents that need to respond instantly. If you're building a ChatGPT-style voice assistant, Nova-3's latency advantage outweighs the 3-point WER gap vs. Scribe v2.
Whisper large-v3 remains the workhorse of the open-source STT ecosystem. At 6.5% WER it's not competitive with the top commercial APIs on accuracy, but it's free to self-host, supports 99 languages, and has spawned an entire ecosystem of optimized variants (Faster Whisper, WhisperX, Distil-Whisper). For batch transcription where latency doesn't matter, it's hard to beat free.
Google Cloud's Chirp 2 is notably expensive at $16 per 1,000 minutes with a 7.4% WER that doesn't justify the price. Their own Gemini models are cheaper and more accurate - Chirp 2 exists mainly for teams locked into the Google Cloud Speech-to-Text API.
Key Takeaways
The Price-Quality Curve Is Not Linear
ElevenLabs charges $206 per million characters for TTS that scores ELO 1,105. Inworld charges $10 for ELO 1,162. Kokoro costs $0.70 for ELO 1,059. The 103-point ELO gap between Inworld and Kokoro is noticeable in blind tests, but whether it's worth a 14x price increase depends completely on your use case. For internal tools and prototypes, Kokoro is more than enough. For customer-facing voice experiences where brand perception matters, Inworld or ElevenLabs make sense.
On the STT side, Gemini 3 Flash at $1.92/1K minutes with 3.1% WER vs. ElevenLabs Scribe v2 at $6.67/1K minutes with 2.3% WER means you're paying 3.5x more for a 0.8 percentage point improvement. At scale, that adds up.
Latency Separates Real-Time From Batch
Cartesia Sonic 3's 40ms TTFA is in a different category from OpenAI's ~500ms. For voice agents that need to feel conversational, anything above 250ms creates awkward gaps. The current latency leaders for TTS are Cartesia (40ms), ElevenLabs (75ms), and Inworld (<250ms). For STT streaming, Deepgram Nova-3's sub-300ms pipeline latency is the benchmark to beat.
Voice Cloning Is Still a Premium Feature
Only ElevenLabs, MiniMax, Fish Audio, and Cartesia offer voice cloning in the top-ranked TTS models. If you need custom voices - for audiobook narration, brand voices, or content localization - your options narrow quickly. Fish Audio's OpenAudio S1 at $15/million characters with voice cloning and ELO 1,074 is the best value in this niche.
Open Source vs. Commercial
The open-source TTS gap has closed dramatically in the past year. Kokoro 82M (ELO 1,059) beats Amazon Polly Generative (ELO 1,060) and sits only 46 points below ElevenLabs Turbo v2.5. Fish Speech v1.5, with its dual autoregressive architecture and 3.5% WER in synthesis, supports English, Chinese, and Japanese with over 300,000 hours of training data.
For STT, Whisper large-v3 remains the standard open-source option but shows its age against commercial APIs. The 6.5% English WER is workable for many applications but noticeably worse than the sub-4% results from ElevenLabs, Google, and Mistral. The Faster Whisper project on NVIDIA GPUs can push processing speed to over 100x real-time, which helps for batch processing.
Where open source still falls short: multilingual TTS quality, emotional expressiveness, and voice cloning. If you need natural-sounding synthesis in languages beyond English, Chinese, and Japanese, commercial APIs are still the safer bet. ElevenLabs supports over 70 languages with consistent quality. Google's TTS covers 40+ languages with WaveNet and Neural2 voices.
Practical Guidance
Building a voice agent or IVR system? Pair Deepgram Nova-3 (STT) with Cartesia Sonic 3 or Inworld (TTS). Total round-trip latency under 500ms. Cost-effective at scale. Nova-3's streaming capability means transcription starts before the caller finishes speaking.
Producing podcasts or audiobooks? ElevenLabs Multilingual v2 or MiniMax Speech-02-HD for maximum naturalness and emotional range. The higher per-character cost is irrelevant when you're generating a fixed amount of content. Voice cloning lets you maintain a consistent narrator voice.
Accessibility and screen readers? OpenAI TTS-1 or gpt-4o-mini-tts. Reliable, clear output at $15 per million characters. Supports 50+ languages. The steerable delivery in gpt-4o-mini-tts lets you adjust speed and tone through prompting.
Developer prototyping on a budget? Kokoro 82M for TTS (self-hosted, practically free) and Whisper large-v3 for STT (also free). Both run on consumer GPUs. Swap to commercial APIs only when the quality gap matters for your specific application.
High-volume transcription? Gemini 3 Flash at $1.92/1K minutes with 3.1% WER is the best bang for your dollar. For even cheaper batch processing, GPT-4o-mini Transcribe at $3.00/1K minutes handles most English content acceptably.
These rankings will shift. The TTS Arena and Artificial Analysis leaderboards update weekly, and new models from Google, ElevenLabs, and the open-source community ship monthly. I'll update this leaderboard quarterly with fresh benchmark data. Check our overall AI model rankings for the broader picture, or browse voice generator tools for hands-on comparisons.
Sources
- Artificial Analysis Speech-to-Text Leaderboard - WER benchmarks, speed, and pricing data for STT models
- Artificial Analysis Text-to-Speech Leaderboard - ELO rankings from blind listening comparisons
- TTS Arena v2 on Hugging Face - Community-driven TTS quality rankings
- Deepgram Nova-3 Benchmarks - Provider-published STT accuracy and latency data
- Inworld TTS Benchmarks (2026) - Independent real-time voice agent TTS comparison
- ElevenLabs API Pricing - Current TTS and STT pricing tiers
- OpenAI Audio Models Update - Model specs and pricing for TTS-1 and gpt-4o-mini-tts
- Kokoro 82M on Hugging Face - Open-source model weights and benchmarks
✓ Last verified March 9, 2026
