Text-to-Speech API Pricing Compared - 2026
Normalized per-1M-character and per-hour TTS pricing across ElevenLabs, OpenAI, Google, Azure, Amazon Polly, Play.ht, Cartesia, Deepgram Aura, WellSaid, and more.

TL;DR
- Cheapest API: OpenAI tts-1 at $15.00/1M characters - good quality, no voice cloning, 6 preset voices
- Azure Neural Standard voices come in at roughly $14.11/1M characters ($16/1M characters billed per hour) and include 500+ voices across 140+ languages
- Best quality-per-dollar: ElevenLabs Turbo v2.5 at $66/1M characters on the Scale plan, with voice cloning included
- Realtime/streaming winner: Cartesia Sonic at $50/1M characters with sub-100ms first-audio-byte latency
- Voice cloning costs vary wildly - from $0 (ElevenLabs Creator plan includes instant cloning) to $5,000+ for professional neural voice training (Azure, Microsoft Custom Neural Voice)
- Amazon Polly remains the cheapest option for long-form narration at $4.00/1M characters on Generative, but it is a fixed voice catalog, no cloning
The Bottom Line
TTS pricing is messier than transcription pricing. Every vendor normalizes differently - some charge per character, some per 1,000 characters, some per minute of audio produced, some by monthly subscription with a soft character cap. I've converted everything to per-1M characters here, using provider-published character-to-audio ratios (typically 5 characters per word, 150 words per minute of output audio) where needed.
The real complexity is not the base rate - it's what falls outside it. Voice cloning, custom neural voice training, emotion markup (SSML prosody controls, Hume's expression tags), and streaming mode each carry separate pricing at most providers. The gap between a "simple TTS" quote and your actual bill can be significant.
One dimension this table can't fully capture: quality. $15/1M characters from OpenAI tts-1 sounds noticeably different from $66/1M characters from ElevenLabs Turbo v2.5. The use case determines whether that difference matters. For internal tooling and low-stakes narration, OpenAI and Azure are fine. For audiobooks, character-voiced games, or brand voice work, the quality gap justifies the premium. See the AI voice and speech leaderboard for quality benchmarks across models.
For transcription (speech-to-text) pricing, see Speech-to-Text API Pricing Compared. For LLM cost comparisons, see LLM API Pricing Comparison.
Main Pricing Table
Prices normalized to per-1M characters (input text). Conversion notes below the table. Verified April 19, 2026.
| Provider | Model / Tier | Per 1M Chars | Per Hour Audio | Streaming | Voice Cloning | Free Tier | Languages |
|---|---|---|---|---|---|---|---|
| Amazon Polly | Standard voices | $4.00 | ~$17.65 | Yes | No | 5M chars/mo (1yr) | 60+ |
| Amazon Polly | Neural voices | $16.00 | ~$70.59 | Yes | No | 1M chars/mo (1yr) | 30+ |
| Amazon Polly | Long-form voices | $100.00 | ~$441.18 | No | No | None | English |
| Amazon Polly | Generative voices | $30.00 | ~$132.35 | No | No | None | English |
| Azure Speech | Neural Standard | $14.11 | ~$16.00/hr | Yes | No | 500K chars/mo | 140+ |
| Azure Speech | Neural HD | $91.75 | ~$103.90/hr | Yes | No | None | English (select) |
| Azure Speech | Custom Neural Voice | $24.00 | ~$27.17/hr | Yes | Yes (trained) | None | 60+ |
| OpenAI | tts-1 | $15.00 | ~$66.18 | Yes | No | $5 trial | 57 |
| OpenAI | tts-1-hd | $30.00 | ~$132.35 | No | No | $5 trial | 57 |
| OpenAI | gpt-4o-mini-tts | $15.00 | ~$66.18 | Yes | No | $5 trial | 57 |
| OpenAI | gpt-4o-voice | Contact sales | - | Yes | No | No | 57 |
| Google Cloud TTS | Standard voices | $4.00 | ~$17.65 | No | No | 4M chars/mo | 50+ |
| Google Cloud TTS | WaveNet voices | $16.00 | ~$70.59 | No | No | 1M chars/mo | 30+ |
| Google Cloud TTS | Neural2 voices | $16.00 | ~$70.59 | No | No | 1M chars/mo | 30+ |
| Google Cloud TTS | Chirp 3 HD | $30.00 | ~$132.35 | No | No | 1M chars/mo | 35+ |
| ElevenLabs | Creator plan | $66/1M* | ~$291/hr | Yes | Instant (3 voices) | 10K chars/mo | 32 |
| ElevenLabs | Pro plan | $66/1M* | ~$291/hr | Yes | Instant (30 voices) | - | 32 |
| ElevenLabs | Scale plan | $66/1M* | ~$291/hr | Yes | Instant + Pro voices | - | 32 |
| ElevenLabs | Enterprise | Contact sales | - | Yes | Professional | - | 32 |
| Cartesia | Sonic (pay-as-you-go) | $50.00 | ~$220.59 | Yes | Yes | $5 credit | 15+ |
| Deepgram | Aura (pay-as-you-go) | $15.00 | ~$66.18 | Yes | No | $200 credit | English |
| Play.ht | Creator plan | $33/1M* | ~$145/hr | Yes | Instant cloning | 12,500 chars/mo | 142 |
| Play.ht | Pro plan | $22/1M* | ~$97/hr | Yes | Instant + ultra | - | 142 |
| Play.ht | Enterprise | Contact sales | - | Yes | Custom | - | 142 |
| Resemble.ai | Pay-as-you-go | $0.006/sec | ~$21.60/hr | Yes | Yes ($0.006/sec) | $0.05 credit | 10+ |
| WellSaid Labs | Maker plan | ~$66/1M* | ~$291/hr | No | Via Studio | $0/mo trial | English |
| WellSaid Labs | Teams / Enterprise | Contact sales | - | Yes (API) | Custom | None | English |
| Speechify | Consumer plans | Subscription | N/A | N/A | No API | Free (reading) | 30+ |
| Speechify | API | Contact sales | - | Yes | Contact sales | No | 30+ |
| Hume AI / Octave | API | $30.00 | ~$132.35 | Yes | No (EVI voice) | $5 credit | English |
| Microsoft CNV | Custom Neural Voice | Contact sales | - | Yes | Yes (trained model) | None | 60+ |
| xTTS v2 / Coqui | Self-hosted | ~$0 API cost | Self-host infra | Yes | Yes (3-6 sec clip) | N/A (open source) | 17 |
*ElevenLabs, Play.ht, and WellSaid subscription tiers include a character/minute quota - per-character rate is derived by dividing monthly cost by included quota. ElevenLabs charges $0.30/1K characters ($300/1M) for overage.
Character-to-audio conversion note: 1M characters produces approximately 2.27 hours of audio at average English speech rate (150 wpm, ~5 chars/word). Per-hour audio cost is calculated from this ratio. Amazon Polly and Google TTS count characters differently (spaces count; SSML tags do not count at most providers).
Per-Provider Breakdown
OpenAI - tts-1, tts-1-hd, gpt-4o-mini-tts
OpenAI's TTS API is the simplest on this list. You send text, you get audio. No voice cloning, no expression markup beyond SSML, six preset voices (Alloy, Echo, Fable, Onyx, Nova, Shimmer). That simplicity is a feature for pipelines that just need reliable, clean narration.
tts-1 at $15.00/1M characters is priced almost identically to gpt-4o-mini-tts. The difference: tts-1 uses an older, slightly more robotic voice architecture; gpt-4o-mini-tts uses the newer neural voice stack from the GPT-4o family and supports streaming. For new integrations, gpt-4o-mini-tts is the better default at the same price.
tts-1-hd doubles the rate to $30.00/1M characters for improved naturalness - fewer artifacts on consonants and more consistent prosody. It does not support streaming. For batch production work where latency is irrelevant, tts-1-hd's quality jump over tts-1 is audible and worth the cost difference. For real-time chat or voice agents, stream with gpt-4o-mini-tts.
gpt-4o-voice is a separate product for live voice conversations - it is part of OpenAI's Realtime API, priced differently (per audio token), and outside the scope of a straight TTS comparison.
Source: OpenAI API Pricing
Google Cloud Text-to-Speech - Standard, WaveNet, Neural2, Chirp 3 HD
Google's TTS product has four distinct tiers at four distinct price points. Standard voices at $4.00/1M characters are the cheapest "cloud TTS" option available - these are the older concatenative synthesis voices, and the quality gap versus WaveNet is substantial. Unless you're on a very tight budget, do not use Standard voices for consumer-facing output.
WaveNet and Neural2 are both $16.00/1M characters. WaveNet uses Google's original deep generative audio model; Neural2 uses a newer approach that produces more natural prosody. In practical testing, Neural2 is noticeably better on punctuation handling and sentence-level rhythm. Google gives 1M characters free per month for both WaveNet and Neural2 - a useful ongoing allowance for development and low-volume production.
Chirp 3 HD at $30.00/1M characters targets the highest quality output Google offers on their TTS product. It supports 35+ languages and is positioned for audiobook and premium narration use cases. Audio output quality is on par with ElevenLabs at roughly half the per-character cost, though Chirp 3 HD is more limited on voice variety and has no voice cloning feature.
Google charges per character of input text, including spaces. SSML tags are not charged. Requests are rounded up to the nearest character.
Source: Google Cloud Text-to-Speech Pricing
Azure AI Speech - Neural Standard, Neural HD, Custom Neural Voice
Azure's TTS pricing is published per 1M characters. Neural Standard voices at $14.11/1M characters cover 500+ voices across 140+ languages - the broadest voice catalog in this comparison. The free tier (500K characters/month, no expiration) is the most practical ongoing free TTS allowance available from any cloud provider.
Neural HD voices at $91.75/1M characters are Azure's premium-tier narration voices. They cover a subset of languages (primarily English) and produce noticeably higher quality output on long-form reading tasks - less monotony, better stress patterns on complex sentences. At roughly 6.5x the Standard rate, the use case is specifically audiobook production or branded IVR where quality standards are strict.
Custom Neural Voice is Azure's voice cloning product. The per-character rate for synthesis from a custom-trained voice is $24.00/1M characters - between Standard and HD pricing. The training process itself requires a separate setup: a human voice actor reading a curated script, data upload, and a training job that costs separately. Microsoft requires an application and signed consent documentation from the voice talent before a Custom Neural Voice project can begin - this is not a self-serve click-to-clone product.
Source: Azure AI Speech - Microsoft Learn
Amazon Polly - Standard, Neural, Long-form, Generative
Amazon Polly has four pricing tiers corresponding roughly to four generations of synthesis technology. Standard voices at $4.00/1M characters use the older concatenative synthesis approach and are only justified for cost-sensitive high-volume pipelines where quality is not a primary requirement.
Neural voices at $16.00/1M characters use Amazon's Neural TTS architecture. Quality is solid and comparable to Google's WaveNet tier. Polly Neural covers 30+ languages with 60+ voice options. Streaming via Amazon Polly Synthesis Marks (synchronizing audio with text highlights) is supported at no extra cost.
Long-form voices at $100.00/1M characters are Polly's premium narration product. These voices are trained specifically for hour-plus reading tasks - they maintain consistent quality over long passages without the prosodic drift common in shorter-trained models. English only.
Generative voices at $30.00/1M characters are Amazon's newest tier (2024 launch). Built on a large language model - similar in concept to ElevenLabs' approach - generative voices respond dynamically to context and punctuation. English only, no streaming support. For English narration without voice cloning requirements, Polly Generative at $30/1M is a reasonable competitor to OpenAI tts-1-hd.
The free tier gives 5M characters/month of Standard and 1M characters/month of Neural for the first 12 months. After year one, there is no free tier.
Source: Amazon Polly Pricing
ElevenLabs - Creator, Pro, Scale, Enterprise
ElevenLabs is the reference point for quality in this comparison. Their pricing model is subscription-first: you buy a monthly plan that includes a character quota, and your effective per-character rate depends on how fully you use that quota. The Creator plan ($22/month) includes 100K characters/month - effective rate $220/1M characters if you use exactly the quota. The Scale plan ($330/month) includes 500K characters/month - $660/1M characters quota rate. But the more relevant number is the overage rate: $0.30/1K characters ($300/1M) for characters beyond your plan quota.
For production API usage at scale, the actual rate you should plan around is the overage rate of $300/1M characters - unless you're buying a quota-sized plan that matches your volume. At 5M+ characters/month, ElevenLabs' pricing becomes significantly more expensive than any cloud competitor. The Enterprise plan negotiates custom rates, but those are contact-sales.
The key differentiators: voice cloning (instant cloning from a 30-second audio sample, included on Creator and above), 32-language support with high multilingual naturalness, and the Turbo v2.5 model that produces notably more expressive output than any cloud TTS alternative. ElevenLabs also publishes a Streaming API that supports chunk-based audio delivery with sub-300ms first-byte latency on Turbo v2.5.
Source: ElevenLabs Pricing
Play.ht - 2.0 Turbo, Creator, Pro
Play.ht's pricing follows a similar subscription model to ElevenLabs but with more aggressive per-character rates at the mid tiers. The Creator plan ($39/month) includes 50,000 words/month (roughly 250K characters), effective rate ~$156/1M characters at quota. The Pro plan ($99/month) includes 250,000 words (1.25M characters), effective ~$79/1M characters at quota.
The distinguishing feature is voice breadth: 142 languages with instant voice cloning available on the Creator plan from a 3-10 second audio sample. Play.ht Ultra voice cloning (their highest-fidelity product) requires the Pro plan and produces a more stable, consistent voice match than the instant clone.
The API is a separate product from the web studio. API access at scale requires the Pro plan or higher. For voice agents and real-time applications, Play.ht publishes a Streaming API with WebSocket support.
Source: Play.ht API Docs
Cartesia Sonic - Realtime TTS
Cartesia Sonic is the latency-first TTS option on this list. Their Sonic model was specifically designed and benchmarked for sub-100ms time-to-first-audio-byte in streaming mode - they have published latency comparisons showing Sonic at 80-90ms versus 200-400ms for ElevenLabs streaming and similar.
Pricing is $50.00/1M characters on pay-as-you-go. Voice cloning is included - you can create a voice from a short sample via the API. Language support is currently limited to 15+ languages, with English as the primary optimization target. The free tier provides $5 in credit on signup.
Cartesia does not offer the voice variety or quality ceiling of ElevenLabs, but for voice agent applications - customer service bots, real-time conversation interfaces - where latency is the primary constraint, Sonic is the relevant comparison. At $50/1M characters it is also meaningfully cheaper than ElevenLabs for applications that need streaming at scale.
Source: Cartesia Pricing
Deepgram Aura - Streaming TTS
Deepgram's Aura is a streaming-focused TTS model that pairs naturally with their Nova-3 transcription product. At $15.00/1M characters, it matches OpenAI tts-1 in pricing. English only. No voice cloning. The differentiator is low-latency streaming and tight integration with Deepgram's transcription pipeline - useful if you are building a full voice conversation stack on a single provider.
The $200 signup credit covers approximately 13M characters of Aura synthesis - a generous amount for evaluating the product. Quality is clean and natural but below ElevenLabs and Cartesia on expressiveness. The right use case is a voice bot where you already use Deepgram for transcription and want a single-vendor stack.
Source: Deepgram Pricing
Resemble.ai
Resemble.ai prices by the second of audio produced rather than by input characters. At $0.006/second, that converts to approximately $21.60/hour of audio (at standard 150wpm speech rate, $0.006 * 3600 = $21.60). The per-character equivalent is roughly $7.94/1M characters - competitive with Amazon Polly Neural.
Voice cloning via the API uses the same per-second pricing. Creating a custom voice requires uploading recorded samples via the Resemble API or dashboard; it is self-serve without the consent documentation requirements that Microsoft and Azure enforce. The quality of Resemble's cloned voices is solid for narration use cases, though ElevenLabs and Play.ht produce more natural-sounding clones on accent-heavy source material.
Source: Resemble.ai Pricing
Hume AI - Octave TTS
Hume's TTS product (branded Octave) is the only one on this list specifically designed to synthesize emotionally expressive speech from text. Rather than using SSML prosody tags to tweak pitch and rate, Hume's model is prompted with natural language descriptions of the emotional context - "the character is nervous but trying to sound confident" - and the model adjusts vocal characteristics accordingly.
Pricing is $30.00/1M characters on pay-as-you-go, matching OpenAI tts-1-hd. English only. Streaming is supported. There is no voice cloning - you use Hume's built-in voice palette. The use case is game dialogue, interactive narrative, or any context where emotional authenticity of the voice output matters more than matching a specific human's voice.
Source: Hume AI
WellSaid Labs
WellSaid Labs targets the enterprise narration market - primarily corporate learning and development (L&D), marketing video production, and e-learning. Their voice avatars are trained in-house with professional voice talent under signed agreements, which is a meaningful differentiator for corporate clients who need legally vetted content.
The Maker plan ($49/month) targets individual creators and includes access to the web studio and API. Per-character pricing is not published directly - WellSaid's model is quota-based subscription. Teams and Enterprise plans require contact. There is no self-serve voice cloning; custom voices are created through a partnership process with WellSaid, not by uploading a voice sample.
Source: WellSaid Labs Pricing
Microsoft Custom Neural Voice
Microsoft Custom Neural Voice (CNV) is not a self-serve product. It requires an application, approval, and signed documentation from the voice talent providing the recording. This is a feature, not a bug - it is designed to prevent voice fraud and non-consensual cloning.
Once approved, CNV voice training is a separate priced service; synthesis from the trained voice is billed at Azure's Custom Neural Voice rate ($24.00/1M characters). For enterprise deployments where a specific public-facing brand voice is required and the voice actor's consent is properly documented, CNV is the most legally defensible cloning option available from any major provider.
Source: Azure Custom Neural Voice - Microsoft Learn
Speechify
Speechify operates primarily as a consumer productivity app. Its developer API is available but not publicly priced - contact sales required. The consumer apps (iOS, Android, Chrome extension, web) offer a free tier for basic voices and a Pro subscription ($139/year) that unlocks their AI voices and higher reading speeds. Not a relevant option for API integrations.
Source: Speechify Pricing
Coqui TTS (discontinued) and xTTS v2
Coqui AI - the company - shut down in January 2024. Their open-source TTS library (coqui-ai/TTS) has been archived on GitHub. The code is available under a non-commercial license.
xTTS v2 is the most capable model from the Coqui project and remains the most widely used open-source multilingual TTS model in 2026. It supports 17 languages and can clone a voice from a 3-6 second audio sample with no training. The model is available from Hugging Face (coqui/XTTS-v2).
API cost for self-hosted xTTS v2 is effectively $0 at inference - you pay only for compute infrastructure. Running it on an A10G GPU (24GB VRAM) produces audio at roughly 3-4x real-time. An A10G via cloud compute costs approximately $0.75-$1.50/hour depending on provider, which at 3x real-time maps to roughly $0.25-$0.50 per hour of generated audio. At the same throughput, that is approximately $0.53-$1.06/1M characters - far below any commercial provider.
The caveats are real: no SLA, self-maintained infrastructure, model quality below ElevenLabs and Cartesia on naturalness, and the licensing (Mozilla Public License 2.0 for xTTS v2 weights) restricts commercial use in ways that require careful legal review.
Source: coqui/XTTS-v2 on Hugging Face, Coqui TTS GitHub
Hidden Costs
Voice Cloning Upsells and Tiers
Instant voice cloning (from a short audio clip) and professional voice cloning (trained neural voice) are very different products at very different price points:
| Provider | Instant Clone | Professional Clone | Notes |
|---|---|---|---|
| ElevenLabs | Included Creator+ | Enterprise pricing | Instant = 30-sec clip; Pro clone = dedicated voice training |
| Play.ht | Included Creator+ | Pro plan required | Ultra clone requires Pro |
| Cartesia | Included (PAYG) | N/A | API-based cloning, no training job |
| Resemble.ai | Included (PAYG) | Custom model pricing | Same per-second rate for clone voices |
| Azure | Not available | $24/1M chars (synthesis) | Training is a separate fee; consent docs required |
| Microsoft CNV | Not available | Contact sales | Application approval required |
| Amazon Polly | Not available | Not available | Fixed voice catalog only |
| Google TTS | Not available | Not available | Fixed voice catalog only |
| OpenAI | Not available | Not available | Fixed preset voices only |
For applications where voice consistency matters - a brand's IVR system, a recurring narrator in an audio product - the inability to clone a voice is a functional limit that rules out Amazon, Google (standard), and OpenAI TTS entirely.
SSML and Expression Markup
Standard Speech Synthesis Markup Language (SSML) tags - controlling pitch, rate, volume, pauses - are supported by Azure, Google, Amazon, and OpenAI at no extra cost. ElevenLabs uses their own voice settings API rather than SSML.
The premium expression layer is Hume AI's Octave, which accepts natural language emotional direction rather than numeric SSML parameters. At $30/1M characters, it is priced at OpenAI tts-1-hd rates. If your application needs emotional expressiveness and you cannot achieve it through SSML tuning, Hume's approach is architecturally different from any other provider on this list.
Long-Form vs. Standard Pricing
Amazon Polly's Long-form voices at $100/1M characters cost 6.25x the Neural rate. These are specifically optimized for passages longer than a few sentences - audiobooks, lecture recordings, long-form podcast production. Using Neural voices for hour-long content degrades quality noticeably at the model level; Long-form voices maintain quality over extended content. If your use case is audiobooks, the 6.25x cost uplift versus Neural is worth evaluating against the quality difference.
Custom Voice Training Fees
Beyond per-character synthesis costs, several providers charge separately for training a custom voice model:
- Microsoft Custom Neural Voice: Pricing is not published - enterprise project scope required. Typically five figures for a full professional voice
- Azure Custom Neural Voice: Training cost is separate from the $24/1M synthesis rate. Contact sales
- WellSaid Labs: Custom voice avatars through a partnership process - not a self-serve fee structure
- ElevenLabs Professional Voice Clone: Professional grade clone (separate from instant clone) is available at Enterprise tier pricing
Self-hosting xTTS v2 has no training fee for instant voice cloning - the model can clone from a short sample at inference time with no fine-tuning required.
Monthly Plan Character Counting Quirks
- Google counts characters including spaces; SSML tags are excluded
- Amazon Polly counts billable characters as text characters, not SSML characters; requests under 100 characters are billed as 100 characters minimum
- Azure counts character pairs for some languages (Chinese, Japanese, Korean); a 500K-character limit effectively means 250K CJK characters
- ElevenLabs counts by character of input text; their quota meters in the dashboard update in real time
Free Tier Comparison
| Provider | Free Offering | Amount | Expiration | Notes |
|---|---|---|---|---|
| Azure Speech | Neural Standard voices | 500K chars/mo | None (ongoing) | Best ongoing free tier |
| Google TTS | WaveNet / Neural2 | 1M chars/mo | None (ongoing) | Ongoing monthly |
| Google TTS | Chirp 3 HD | 1M chars/mo | None (ongoing) | Ongoing |
| Google TTS | Standard voices | 4M chars/mo | None (ongoing) | Ongoing - lower quality |
| Amazon Polly | Standard (1yr free) | 5M chars/mo | 12 months | AWS Free Tier, year one only |
| Amazon Polly | Neural (1yr free) | 1M chars/mo | 12 months | AWS Free Tier, year one only |
| ElevenLabs | Starter plan | 10K chars/mo | None | Ongoing free plan |
| Play.ht | Free tier | 12,500 chars/mo | None | Ongoing |
| Deepgram | Signup credit | $200 (~13M chars) | Not stated | One-time |
| Cartesia | Signup credit | $5 (~100K chars) | Not stated | One-time |
| Hume AI | Signup credit | $5 (~167K chars) | Not stated | One-time |
| OpenAI | ~$5 trial credit | ~333K chars (tts-1) | 3 months | Shared across all APIs |
Azure's 500K characters/month ongoing free tier for Neural Standard voices is the best sustained free TTS allowance in the comparison. At average English speech rate, 500K characters produces roughly 19 minutes of audio per month - enough for regular development testing and low-volume production use cases without any cost.
Google's ongoing free tier for WaveNet/Neural2 (1M chars/month) and Chirp 3 HD (1M chars/month) are also strong. Combined, that's a meaningful monthly allowance for testing across quality tiers.
ElevenLabs' free plan at 10K characters/month is intentionally limited - roughly 7 minutes of audio. It is enough to test voice quality but not sufficient for meaningful production use.
Latency Leaderboard - Time to First Audio Byte
For voice agent and real-time conversation use cases, end-to-end latency (text-in to audio-out) is often more important than price. Provider-reported or widely-cited benchmarks, English, streaming mode, cloud region US-East, April 2026:
| Provider | Model | Reported TTFAB | Streaming Support | Notes |
|---|---|---|---|---|
| Cartesia | Sonic | ~80-90ms | Yes | Purpose-built for realtime |
| Deepgram | Aura | ~120-150ms | Yes | Low-latency streaming |
| ElevenLabs | Turbo v2.5 | ~200-300ms | Yes | Quality + speed balance |
| OpenAI | gpt-4o-mini-tts | ~250-350ms | Yes | API streaming |
| Azure | Neural Standard | ~200-400ms | Yes | Varies by region |
| Play.ht | 2.0 Turbo | ~300-500ms | Yes | WebSocket streaming |
| OpenAI | tts-1 | ~300-500ms | Yes | Older architecture |
| Google TTS | Neural2 / Chirp 3 | ~400-600ms | No | Request-response only |
| Amazon Polly | Neural | ~300-500ms | Yes | Streaming via Synthesis Marks |
Cartesia's sub-100ms TTFAB is the clearest advantage for voice agent pipelines. The cost (giving up voice cloning breadth and language coverage) is acceptable for most English-language agent applications. For multilingual voice agents, ElevenLabs Turbo v2.5 at 200-300ms is the practical balance between latency and capability.
Voice Cloning Ethics and Consent
Voice cloning has become a material legal and ethical issue in 2025-2026. Several state and national laws now regulate non-consensual voice synthesis.
Consent requirements by provider:
- Azure Custom Neural Voice / Microsoft CNV: Requires written consent from the voice talent and submission to Microsoft's application review. The most rigorous consent enforcement in the managed API space.
- ElevenLabs: Requires users to agree not to clone voices without consent. Their platform includes voice verification for uploaded profiles and has implemented actor voice rights protections. Instant clone does not require external verification - the responsibility sits with the user.
- Play.ht: Terms of service prohibit cloning without consent. No technical enforcement layer for self-serve instant clones.
- Cartesia, Resemble.ai, xTTS v2: Similar terms-only enforcement; technical controls are limited.
Most providers' enforcement is terms-based, not technical. The legal exposure from cloning a voice without consent is real and growing. The EU AI Act and several US state statutes (California, Texas, Tennessee's ELVIS Act) specifically address synthetic voice use for commercial purposes.
If you are building a product that uses voice cloning on user-submitted samples or public-figure voices, get legal review before going to production. This is not hypothetical risk.
At $300/1M characters for ElevenLabs overages versus $14/1M characters for Azure Neural Standard, the quality premium is real - and 21x is a number worth knowing before you build your pipeline.
Price History Timeline
Apr 2026 - OpenAI published gpt-4o-mini-tts as a standalone TTS endpoint at $15/1M characters. It uses the GPT-4o voice architecture and supports streaming - the first OpenAI TTS model to do so from a dedicated API endpoint (distinct from the Realtime API).
Feb 2026 - Cartesia released Sonic-2, improving multilingual support from 8 to 15+ languages while holding pricing flat at $50/1M characters. TTFAB benchmarks showed consistent sub-100ms on US-East cloud regions.
Jan 2026 - Google Cloud TTS added Chirp 3 HD voices at $30/1M characters, doubling the price of their previous Neural2 tier. 35+ language support with notably more natural prosody on long-form content.
Nov 2025 - ElevenLabs launched Turbo v2.5. Pricing structure remained unchanged but quality benchmarks showed clear improvement on accent accuracy and emotional expressiveness. Streaming latency dropped to ~200-300ms from ~400ms on v2.
Sep 2025 - Amazon Polly launched Generative voices at $30/1M characters. LLM-based synthesis approach, English only, no streaming. Positioned against OpenAI tts-1-hd.
Jul 2025 - Hume AI launched Octave TTS API. First commercially available TTS model accepting natural language emotional prompts rather than SSML prosody parameters. Launched at $30/1M characters.
May 2025 - Deepgram launched Aura at $15/1M characters for their streaming TTS product. Priced identically to OpenAI tts-1, targeting developers already using Deepgram Nova for transcription in voice conversation pipelines.
Mar 2025 - Azure Neural HD voices launched at $91.75/1M characters. Long-form narration quality comparable to ElevenLabs but at 30% of the ElevenLabs overage rate.
Jan 2025 - OpenAI reduced tts-1 pricing from $30/1M to $15/1M characters, and tts-1-hd from $60/1M to $30/1M. Both cuts were 50%. The pricing revision put OpenAI's TTS squarely below ElevenLabs for cost without the voice cloning features.
Oct 2024 - Cartesia Sonic launched at $50/1M characters, positioning specifically on latency rather than voice quality. First provider to publish and market sub-100ms TTFAB as a primary metric.
Jul 2024 - ElevenLabs added their Turbo v2 model with the same pricing but improved streaming latency. Introduced the Scale plan at $330/month for higher-volume users.
The pricing trajectory mirrors what happened in transcription a year earlier: OpenAI's price cuts force the market down, and the differentiated providers (ElevenLabs, Cartesia) compete on quality and features rather than cost. Azure's ongoing free tier has remained unchanged while adding premium voice tiers above it - a textbook land-and-expand pricing strategy.
FAQ
What is the cheapest text-to-speech API?
For standard cloud TTS, Amazon Polly Standard and Google TTS Standard both run $4.00/1M characters - cheapest by raw price but lower quality. For a practical starting point with good quality, OpenAI tts-1 and Azure Neural Standard are both around $14-15/1M characters. Azure Neural Standard also has a free tier of 500K characters/month with no expiration.
Which TTS API has the best voice quality?
ElevenLabs (Turbo v2.5) is the quality reference for commercial managed TTS in 2026. Cartesia Sonic is close behind for streaming use cases. Azure Neural HD is a credible option for long-form narration at lower per-character cost than ElevenLabs. See the AI voice and speech leaderboard for detailed quality rankings.
How much does voice cloning cost?
Instant voice cloning (from a short audio sample) is included in ElevenLabs Creator plans and above, Play.ht Creator plans and above, and Cartesia at no extra cost. The synthesis rate after cloning is the same as standard voices at those providers. Professional voice training - where a model is trained specifically on a voice actor's data - costs significantly more and typically requires a custom enterprise agreement (Azure, Microsoft CNV, WellSaid Labs).
Which TTS API is best for real-time voice agents?
Cartesia Sonic at ~80-90ms time-to-first-audio-byte is the current latency leader for streaming TTS. Deepgram Aura at ~120-150ms is the second option, particularly if you are already using Deepgram Nova for transcription. ElevenLabs Turbo v2.5 at ~200-300ms is the choice if voice quality or multilingual support matters more than minimum latency.
What is the difference between tts-1 and tts-1-hd from OpenAI?
tts-1 costs $15/1M characters and supports streaming. tts-1-hd costs $30/1M characters and does not support streaming. The HD model produces cleaner consonants and more consistent prosody, particularly noticeable on longer passages. For batch narration, tts-1-hd is worth the 2x cost. For real-time applications, tts-1 or gpt-4o-mini-tts (same price, supports streaming, newer architecture) are the appropriate choices.
Can I use AI TTS commercially?
All major commercial managed API providers permit commercial use on paid plans. ElevenLabs, OpenAI, Azure, Google, Amazon, Cartesia, and Deepgram explicitly allow commercial use in their standard API terms. xTTS v2 (Coqui) uses the Mozilla Public License 2.0 for weights - commercial use is permitted but the license requires attribution and any modifications must be shared. Always verify the specific plan terms, as some entry-tier subscription plans may carry restrictions.
Is there a free text-to-speech API?
Azure Neural Standard voices offer 500K characters/month ongoing for free - the most practical ongoing free TTS allowance available. Google TTS gives 1M characters/month free for WaveNet, Neural2, and Chirp 3 HD voices on an ongoing basis. Amazon Polly's free tier (5M Standard / 1M Neural characters per month) applies only to the first 12 months. ElevenLabs free plan provides 10K characters/month with no expiration.
How does TTS pricing compare to transcription pricing?
Per character of processed text, TTS API pricing is generally comparable to or slightly cheaper than transcription on a per-minute-of-audio basis. At 150 wpm speech rate (roughly 750 characters per minute), OpenAI tts-1 at $15/1M chars costs $0.01125 per minute of audio produced. OpenAI Whisper-1 for transcription charges $0.006/min for audio input. Audio generation thus costs roughly 1.9x more per minute than transcription at these rates. For more detail, see Speech-to-Text API Pricing Compared.
Sources:
- OpenAI API Pricing
- Google Cloud Text-to-Speech Pricing
- Amazon Polly Pricing
- Azure AI Speech - Text to Speech
- Azure Custom Neural Voice - Microsoft Learn
- ElevenLabs Pricing
- Cartesia Pricing
- Deepgram Pricing
- Play.ht API Docs
- Resemble.ai Pricing
- WellSaid Labs Pricing
- Speechify Pricing
- Hume AI
- coqui/XTTS-v2 on Hugging Face
- Coqui TTS GitHub (archived)
Also see: LLM API Pricing Comparison, Best AI Transcription Tools 2026, AI Voice and Speech Leaderboard.
✓ Last verified April 19, 2026
