Speech-to-Text API Pricing Compared - 2026

TL;DR

Cheapest per-minute: Groq Whisper-Large-V3-Turbo at ~$0.0006/min via inference pricing ($0.60/1,000 min)
Best value with good accuracy: Deepgram Nova-3 batch at $0.0036/min ($3.60/1,000 min)
OpenAI Whisper-1 at $0.006/min ($6.00/1,000 min) remains the default-choice entry point for most developers
ElevenLabs Scribe launched with speaker diarization included at $0.004/min - no upsell required
Human transcription (Rev.ai) costs $1.50/min - ~250x the cheapest API option; only justified for legal/medical/broadcast

The Bottom Line

Transcription APIs have quietly become one of the best value-per-performance categories in AI. A thousand minutes of audio that would have cost $25-$30 in 2022 now costs under $4 from Deepgram or AssemblyAI on batch mode, and under $1 from Groq if your pipeline can tolerate inference-style throughput constraints.

The catch: per-minute pricing hides additive costs that can triple your bill. Diarization (who spoke when), PII redaction, sentiment analysis, and custom vocabulary are sold as optional extras by most providers. ElevenLabs Scribe is a notable exception - speaker diarization is included at its flat rate. Factor those into your math before committing to a vendor.

For a look at what these models actually get right and wrong on accuracy, see the audio understanding benchmarks leaderboard. If you need a recommendation for a specific use case rather than pricing analysis, best AI transcription tools 2026 has the breakdown.

Main Pricing Table

All prices converted to per-minute rates in USD. Where vendors charge per second or per character, I've normalized to per-minute. Prices verified April 19, 2026.

Provider	Model	Per-Min	Per 1,000 Min	Streaming	Batch Discount	Languages	Free Tier
Groq	Whisper-Large-V3-Turbo	~$0.0006	~$0.60	No	No	100+	Yes
Together AI	Whisper-Large-V3	~$0.001	~$1.00	No	No	100+	Credits
Fireworks AI	Whisper-Large-V3	~$0.001	~$1.00	No	No	100+	Credits
Deepgram	Nova-3 (batch)	$0.0036	$3.60	No	Built-in	45+	$200 credit
AssemblyAI	Universal (batch)	$0.0037	$3.70	No	Built-in	100+	$50/mo
ElevenLabs	Scribe	$0.004	$4.00	No	No	99	Yes
Deepgram	Nova-3 (streaming)	$0.0056	$5.60	Yes	-	45+	$200 credit
OpenAI	Whisper-1	$0.006	$6.00	No	No	57	$5 trial
AssemblyAI	Universal-Streaming	$0.0074	$7.40	Yes	-	100+	$50/mo
AssemblyAI	Slam-1 (batch)	$0.01	$10.00	No	No	English	$50/mo
OpenAI	gpt-4o-mini-transcribe	$0.003	$3.00	Yes	No	57	$5 trial
OpenAI	gpt-4o-transcribe	$0.006	$6.00	Yes	No	57	$5 trial
Google	Chirp 2	$0.0048	$4.80	Yes	No	100+	$300 credit
Google	Chirp 3 (HD)	$0.016	$16.00	Yes	No	35+	$300 credit
AWS	Transcribe (standard)	$0.024	$24.00	Yes	Volume	100+	60 min/mo
AWS	Transcribe Medical	$0.075	$75.00	Yes	No	English	No
AWS	Call Analytics	$0.044	$44.00	Yes	No	English	No
Azure	Speech (Standard)	$0.016	$16.00	Yes	No	100+	5 hrs/mo
Azure	Speech (Custom)	$0.027	$27.00	Yes	No	60+	None
Speechmatics	Enhanced	$0.0082	$8.20	Yes	Volume	50+	Contact
Rev.ai	Machine	$0.02	$20.00	Yes	Volume	36	No
Rev.ai	Human	$1.50	$1,500.00	No	Volume	Select	No
Replicate	Whisper-Large-V3	~$0.0036	~$3.60	No	No	100+	Credits

*Groq Whisper pricing is derived from their published inference rate ($0.10/hr GPU time equivalent). Actual cost may vary slightly based on audio duration vs. processing time. See the Groq section below.

Per-Provider Breakdown

OpenAI - Whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe

OpenAI publishes three transcription endpoints. Whisper-1 at $0.006/min ($6.00/1,000 min) is the legacy model - battle-tested, compatible with most SDKs, and the choice for teams that want boring reliability. The newer gpt-4o-transcribe family adds streaming output and better accuracy on accents and technical vocabulary.

gpt-4o-mini-transcribe at $0.003/min ($3.00/1,000 min) is the more interesting option. It undercuts Whisper-1 by 50% with comparable accuracy on clean audio. For noisy environments or heavy accents, gpt-4o-transcribe at $0.006/min holds up better. Both models support 57 languages.

No batch discount. No diarization add-on at the API level. If you need speaker labels, you'll post-process or switch providers. Input is audio up to 25MB per request - for longer files, you chunk server-side.

Source: OpenAI API pricing

Deepgram - Nova-3, Nova-2

Deepgram has built its pricing model around a simple binary: pre-recorded (batch) at $0.0036/min, and streaming at $0.0056/min. Nova-3 is the current flagship. Nova-2 remains available at slightly lower cost but most new projects should start with Nova-3, which improves accuracy by 25-30% on the Word Error Rate benchmarks Deepgram publishes.

Nova-3 handles 45+ languages. The model is genuinely fast - often faster than realtime on batch inputs. Deepgram's Growth plan includes $200 in credits on signup, which buys roughly 55,000 minutes of batch transcription. Their API response includes confidence scores per word, which matters for downstream pipelines.

Diarization is an add-on ($0.0015/min extra). Redaction adds $0.002/min. Sentiment analysis stacks at $0.003/min. A fully loaded Deepgram request with all features enabled costs more than a plain Whisper-1 call.

Source: Deepgram pricing

AssemblyAI - Universal, Universal-Streaming, Slam-1

AssemblyAI runs three tiers. Universal (batch) at $0.0037/min is price-competitive with Deepgram. Universal-Streaming at $0.0074/min is exactly 2x - the standard pattern for real-time transcription across the industry.

Slam-1 is AssemblyAI's new speech-language model, released in early 2026. At $0.01/min, it costs 2.7x Universal. It's English-only and the accuracy improvement over Universal is meaningful on technical vocabulary and specialized domains - think medical, legal, or coding podcasts. For general purpose transcription, Universal is the right default.

The $50/month free tier is generous and doesn't expire on a 3-month clock like OpenAI's credits. It covers roughly 13,500 minutes of Universal batch transcription per month. Speaker diarization, PII redaction, and LeMUR (LLM over transcripts) each carry separate per-minute charges on top of the base rate.

Source: AssemblyAI pricing

Google Cloud Speech-to-Text - Chirp 2, Chirp 3

Google prices its transcription API in increments of seconds rounded up to 15-second minimums. For practical purposes at any significant volume, the effective rate lands at $0.0048/min for Chirp 2 - their standard model - and $0.016/min for Chirp 3 HD.

Chirp 2 covers 100+ languages with solid multilingual performance. Chirp 3 HD targets higher-quality output with more natural prosody detection, though it's primarily pitched for voice interface use cases rather than bulk transcription workflows.

The $300 free credit on new Google Cloud accounts covers roughly 62,500 minutes of Chirp 2 - the most generous credit of any cloud provider by dollar amount, though it expires in 90 days. Beyond the credit, Google offers tiered discounts: 250,000+ minutes/month drops the Chirp 2 rate to $0.004/min; 1M+ minutes/month hits $0.003/min.

Source: Google Cloud Speech-to-Text pricing

Azure AI Speech - Standard, Custom, Fast

Microsoft publishes speech transcription pricing in per-hour rates. Standard recognition at $1.00/hour converts to $0.0167/min ($16.70/1,000 min). Custom speech - where you upload your own acoustic model data - runs $1.65/hour ($0.0275/min). Real-time recognition is priced the same as batch in their standard tier.

Azure offers 5 free hours per month on the standard tier, with no credit card required during the free trial period. The first 5 hours per month stay free indefinitely on the standard tier - a useful ongoing allowance for low-volume use cases.

The Azure Speech SDK supports streaming, WebSocket connections, and phone-quality audio normalization. Custom neural voice models - relevant for enterprise deployments where the acoustic model needs domain adaptation - require Custom Speech at minimum.

Source: Azure AI Speech - Microsoft Learn

Amazon Transcribe - Standard, Call Analytics, Medical

AWS Transcribe pricing is the most granular in this comparison. Standard transcription starts at $0.024/min for the first 250,000 minutes/month, dropping to $0.015/min at 1M+ minutes. That volume discount is aggressive - but you're starting from a significantly higher base than Deepgram or AssemblyAI.

Call Analytics at $0.044/min adds sentiment, interruption detection, call categorization, and compliance-focused outputs. It's the de facto choice for contact center analytics pipelines where you need structured output alongside the transcript. Medical Transcribe at $0.075/min is an HIPAA-eligible specialized model for clinical documentation - it understands medical terminology and produces structured outputs suitable for EHR integration.

The free tier is 60 minutes per month, which is enough for testing but irrelevant for production planning.

Source: Amazon Transcribe pricing

Groq Whisper-Large-V3-Turbo

Groq's transcription offering runs Whisper-Large-V3-Turbo on their LPU hardware. The published rate converts to roughly $0.0006/min when you work backwards from their inference pricing model. A 10-minute audio file typically processes in under 5 seconds, which is genuinely fast-faster-than-real-time by a large margin.

The caveat: Groq's transcription API doesn't support true streaming output - you submit the full audio and get back the full transcript. Rate limits on the free tier are stricter than dedicated ASR providers. It's the right pick for batch workflows where cost matters and you can pre-chunk long audio. It's the wrong pick if you need live real-time captions or SLA guarantees.

Source: Groq pricing

Together AI and Fireworks AI - Hosted Whisper

Both Together AI and Fireworks AI offer Whisper-Large-V3 as part of their inference-as-a-service platforms. Pricing is roughly $0.001/min, undercutting OpenAI's Whisper-1 by 6x. These are straightforward Whisper implementations with no proprietary accuracy enhancements.

Together AI includes these in their standard serverless pricing with credits available on signup. Fireworks AI's deployment is accessed via their OpenAI-compatible API, making migration from Whisper-1 a one-line change. Neither offers streaming or real-time use cases - batch/pre-recorded only.

Source: Together AI pricing, Fireworks AI pricing

ElevenLabs Scribe

ElevenLabs launched Scribe in early 2026, positioning it as a premium transcription product. At $0.004/min ($4.00/1,000 min), it's priced between gpt-4o-mini-transcribe and Deepgram Nova-3 batch. What distinguishes it: speaker diarization is included in the base rate. No add-on charge.

Scribe handles 99 languages with high accuracy, and ElevenLabs specifically calls out speaker identification quality as its differentiator. The free tier includes credit that covers basic experimentation. For podcasting, multi-speaker interviews, or any workflow where labeling "who said what" matters, Scribe's all-in pricing is worth running the numbers against Deepgram or AssemblyAI with diarization add-ons enabled.

Source: ElevenLabs pricing, ElevenLabs Scribe

Speechmatics

Speechmatics prices by submitted audio hour. Their Enhanced model works out to approximately $0.0082/min. Volume discounts kick in automatically above 50 hours/month, dropping toward $0.005/min at scale. They operate their own purpose-built ASR infrastructure rather than running general-purpose transformer architectures.

Speechmatics is notable for breadth: 50+ languages with a claimed strong emphasis on regional accents and dialect variation. They publish a flow-through pricing model rather than per-feature add-ons - diarization and punctuation are included. Enterprise pricing requires contact, but the self-serve tier has no commitment.

Source: Speechmatics pricing

Rev.ai

Rev.ai runs two products: machine transcription and human transcription. Machine transcription at $0.02/min ($20.00/1,000 min) is competitive with Azure but more expensive than Deepgram or AssemblyAI. It includes speaker diarization, punctuation, and real-time streaming via WebSocket.

Human transcription at $1.50/min is the reference rate for professional transcriptionists. That's $1,500 per 1,000 minutes - 250x the cheapest API option. The use cases that justify it are specific: legal depositions, court proceedings, broadcast captioning under compliance mandates, or anything where a verbatim verbatim record with verified speaker attribution is required.

Source: Rev.ai pricing

Replicate - Whisper Variants

Replicate hosts several Whisper variants billed by the second of compute used rather than by audio duration. Effective cost typically lands around $0.0036/min for standard Whisper-Large-V3, though it fluctuates based on audio complexity. The "incredibly-fast-whisper" variant from Vaibhav Srivastav runs on A100 hardware with batched inference and is typically 2-4x faster than vanilla Whisper-Large-V3.

Replicate's appeal is flexibility - you can also run proprietary models, fine-tunes, or specialized Whisper variants that don't have managed API counterparts. The trade-off is that per-call billing makes cost estimation harder compared to Deepgram's clean per-minute model.

Source: Replicate Whisper, Incredibly Fast Whisper on Replicate

Hidden Costs

Diarization Upsells

Speaker diarization - identifying who spoke which words - is the most common add-on charge. Add it to your mental model before comparing headline rates:

Provider	Base Rate (per min)	+ Diarization	Effective Rate
Deepgram Nova-3 (batch)	$0.0036	$0.0015	$0.0051
AssemblyAI Universal	$0.0037	$0.0007	$0.0044
ElevenLabs Scribe	$0.004	Included	$0.004
Speechmatics Enhanced	$0.0082	Included	$0.0082
OpenAI Whisper-1	$0.006	Not available	-
Rev.ai Machine	$0.02	Included	$0.02

ElevenLabs Scribe and Speechmatics both include diarization in the base price. OpenAI doesn't offer it at the API level at all.

PII Redaction

Personally Identifiable Information redaction - stripping phone numbers, SSNs, credit card numbers from transcripts - is available as an add-on from Deepgram ($0.002/min), AssemblyAI ($0.002/min), and AWS Transcribe (included in standard pricing). None of the inference-based providers (Groq, Together, Fireworks) offer this feature.

Call Analytics and Sentiment

AWS Call Analytics packages transcription with downstream analysis at $0.044/min. AssemblyAI's LeMUR feature - which runs an LLM over the completed transcript - charges separately based on LLM token consumption. Deepgram's sentiment add-on adds $0.003/min. Budget for these before comparing vendors on transcription-only rates.

Custom Vocabulary

Domain-specific terminology - product names, technical jargon, uncommon proper nouns - is handled through custom vocabulary or language model adaptation. Azure Custom Speech charges $0.027/min (vs. $0.0167/min standard). AWS Transcribe charges a one-time custom vocabulary fee plus a per-minute uplift. Deepgram and AssemblyAI include keyword boosting at the API request level at no extra cost. For specialized domains, this can swing a comparison significantly.

Minimum Billing Durations

AWS Transcribe charges a 15-second minimum per request. Azure charges in 1-second increments. Deepgram and AssemblyAI charge for actual audio duration with no floor. For applications that transcribe short utterances - voice commands, meeting prompts, short clips - minimum billing intervals inflate your effective per-minute rate substantially at AWS and Azure.

Volume Discount Cliffs

Several providers drop pricing at volume thresholds that look attractive but require large minimum commitments:

Google Chirp 2: $0.004/min above 250,000 min/month (vs. $0.0048 base)
AWS Transcribe: $0.015/min above 1,000,000 min/month (vs. $0.024 base)
Rev.ai: enterprise volume pricing requires contact sales
Speechmatics: automatic tiered discounts above 50 hours/month

At 10,000 minutes/month, none of these thresholds apply. At 100,000+ minutes/month, the math shifts materially.

Free Tier Comparison

Provider	Free Offering	Amount	Expiration	Notes
AssemblyAI	$50/month credit	~13,500 min/mo	None	Ongoing monthly credit
Google Cloud	$300 new account credit	~62,500 min	90 days	All GCP services
Azure	5 hrs/month	300 min/mo	None	Ongoing
AWS	60 min/month	60 min/mo	12 months	First year only
Deepgram	$200 signup credit	~55,000 min	Not stated	One-time
ElevenLabs	Credits on signup	~1,000 min est.	Not stated	Account credit
OpenAI	~$5 trial	~833 min (Whisper-1)	3 months	All OpenAI APIs
Groq	Free tier	Rate-limited	None	Shared queue
Together AI	Signup credits	Varies	Not stated	All models
Replicate	Billing on signup	-	None	No free tier
Rev.ai	None	0	-	Paid only

AssemblyAI's ongoing $50/month credit is the most practical for production experimentation. Google's $300 credit is the largest in dollar terms but runs against all GCP services and expires in 90 days. Azure's 5 free hours per month is the only ongoing free tier from a major cloud provider after the trial period ends.

Volume Discounts Summary

At 1,000 minutes/month - a podcast or small meeting-heavy team - there are no meaningful volume discounts. At 100,000+ minutes/month, the tiers start to matter:

Provider	Base Rate	Volume Tier	Rate at Tier
Google Chirp 2	$0.0048/min	250K min/mo	$0.004/min
Google Chirp 2	$0.0048/min	1M min/mo	$0.003/min
AWS Transcribe	$0.024/min	250K min/mo	$0.022/min
AWS Transcribe	$0.024/min	1M min/mo	$0.015/min
Deepgram	Custom	100K min/mo+	Contact sales
Speechmatics	$0.0082/min	50 hrs/mo	Auto-discount
Rev.ai	$0.02/min	Enterprise	Contact sales

At very high volumes (millions of minutes), enterprise contracts dominate and published pricing becomes mostly irrelevant.

Human transcription at $1.50/min costs 250x the cheapest API. For legal depositions that gap is worth it. For internal meeting notes, it almost never is.

Accuracy vs. Price Trade-Off

Headline price comparisons ignore the fact that a transcript with 5% Word Error Rate requires significantly more downstream cleanup than one at 2% WER. The cheapest option is not always cheapest when you factor in correction time.

Based on publicly available benchmarks and independent evaluations:

Best WER on English (clean audio): gpt-4o-transcribe, AssemblyAI Slam-1, Deepgram Nova-3 - all roughly equivalent at top accuracy
Best WER on accented/noisy audio: Deepgram Nova-3 and AssemblyAI Universal hold up well; Whisper-1 degrades faster
Best multilingual: Google Chirp 2 (100+ languages), AssemblyAI Universal, OpenAI Whisper family
Specialized domains (medical, legal): AWS Medical Transcribe and AssemblyAI Slam-1 are purpose-built; generic models struggle on domain vocabulary

Whisper-1 at $0.006/min is not the best accuracy option - it's the most widely integrated option. Teams running Whisper-1 today who haven't re-evaluated in 12 months are likely leaving both accuracy and cost improvements on the table.

For detailed WER comparisons across these models, see the audio understanding benchmarks leaderboard.

Price History

Apr 2026 - ElevenLabs launched Scribe with diarization included at $0.004/min. It's the first major entrant in the space to bundle speaker identification without a per-minute upsell.
Mar 2026 - AssemblyAI released Slam-1 at $0.01/min, targeting specialized English-language domains. 2.7x the Universal batch rate.
Feb 2026 - OpenAI added gpt-4o-mini-transcribe at $0.003/min - 50% cheaper than Whisper-1 and the first OpenAI transcription model with streaming output support.
Jan 2026 - Deepgram launched Nova-3 with improved WER across all languages. Pricing held flat at $0.0036/min batch.
Nov 2025 - Google updated Chirp 2 to support 100+ languages with no pricing change. Chirp 3 HD launched at $0.016/min targeting voice interface use cases.
Sep 2025 - Groq added Whisper-Large-V3-Turbo to its inference platform, effectively making it the lowest-cost transcription option for batch workflows.
Jul 2025 - AssemblyAI reduced Universal batch pricing from $0.0065/min to $0.0037/min - a 43% cut. Competitive pressure from Deepgram drove the reduction.
Mar 2025 - AWS Transcribe added volume tier discounts for 1M+ minutes/month, reducing the rate to $0.015/min at that threshold.

The pattern: accuracy leaders (Deepgram, AssemblyAI) have been cutting prices 30-50% per year while the cloud giants (AWS, Azure) dropped less aggressively. The gap between commodity inference (Groq at $0.0006/min) and the managed ASR providers ($0.003-$0.006/min) reflects the value of features like diarization, custom vocabulary, streaming, and SLA guarantees - not just the underlying model quality.

FAQ

Which transcription API is cheapest per minute?

Groq Whisper-Large-V3-Turbo at approximately $0.0006/min is the cheapest available option. Together AI and Fireworks AI hosted Whisper-Large-V3 come in around $0.001/min. For managed ASR providers with real-time streaming and diarization support, Deepgram Nova-3 batch at $0.0036/min and AssemblyAI Universal at $0.0037/min lead the cost rankings.

What is the best speech-to-text API for production use?

It depends on your requirements. For cost-optimized batch transcription: Deepgram Nova-3. For a balance of accuracy, streaming, and features: AssemblyAI Universal. For multi-speaker workflows with diarization included: ElevenLabs Scribe. For medical or legal compliance: AWS Medical Transcribe. For multilingual at scale: Google Chirp 2.

Does OpenAI Whisper-1 have speaker diarization?

No. Whisper-1 via the OpenAI API returns transcribed text without speaker labels. The gpt-4o-transcribe family also does not offer speaker diarization. If you need diarization, use Deepgram, AssemblyAI, ElevenLabs Scribe, Speechmatics, or Rev.ai.

What is the difference between batch and streaming transcription pricing?

Batch (pre-recorded audio submitted as a file) is consistently 30-50% cheaper than streaming (real-time audio via WebSocket). Deepgram charges $0.0036/min batch vs. $0.0056/min streaming. AssemblyAI charges $0.0037/min batch vs. $0.0074/min streaming. If your use case doesn't require live captions or real-time output, batch saves money.

Can I get volume discounts on transcription APIs?

Yes, but meaningful discounts typically require 250,000+ minutes per month. Google Chirp 2 drops from $0.0048 to $0.004/min above 250K min/month. AWS Transcribe drops from $0.024 to $0.015/min above 1M min/month. Deepgram and Rev.ai offer custom enterprise pricing above certain volume thresholds.

How does human transcription cost compare to AI transcription?

Rev.ai's human transcription runs $1.50/min ($1,500/1,000 min). The cheapest AI option (Groq) is approximately $0.0006/min - a 2,500x difference. AI WER on clean English audio is now below 3%, which is within reach of non-expert human transcriptionists. Human transcription is typically justified only for legal proceedings, broadcast compliance captioning, or verbatim records where liability or chain of custody matters.

Are there free speech-to-text APIs?

AssemblyAI provides $50/month in ongoing free credit. Azure offers 5 free hours per month on the standard tier with no expiration. AWS provides 60 free minutes per month for the first 12 months. Google gives $300 in new account credit applicable to Cloud STT. Groq offers rate-limited free access to Whisper on their inference platform.

Sources:

Also see: LLM API Pricing Comparison and Best AI Transcription Tools 2026.