MAI-Transcribe-1.5

Microsoft's second-generation speech-to-text model with 2.4% WER, 43-language support, keyword biasing, and 5x faster long-audio processing than comparable accuracy models.

MAI-Transcribe-1.5

MAI-Transcribe-1.5 is Microsoft's second speech-to-text model, announced at Build 2026 on June 2. It improves on MAI-Transcribe-1 - the April 2026 release that first put Microsoft's in-house transcription on the map - with a lower word error rate, nearly 4x faster long-audio processing, and a major language expansion from 25 to 43 languages. The model is available through Azure AI Foundry and Azure Speech as part of the LLM Speech API.

TL;DR

  • Ranked #3 on the Artificial Analysis AA-WER v2.0 leaderboard with 2.4% WER, down from 2.6% in MAI-Transcribe-1
  • Transcribes one hour of audio in under 15 seconds - 5x faster than models with comparable accuracy
  • Keyword biasing cuts WER by up to 30% on domain-specific vocabulary; 43 languages supported

The model sits third overall on the Artificial Analysis speech-to-text leaderboard, behind Alibaba's Fun-Realtime-ASR-preview (1.7% WER) and ElevenLabs Scribe v2 (2.2% WER). What sets MAI-Transcribe-1.5 apart from the models ahead of it is throughput: running at roughly 276x real-time, it processes audio at more than double the speed of the second-fastest top-10 accuracy model. For long-form batch workloads - call centers, meeting transcription, video captioning - that ratio matters as much as the WER headline.

Microsoft is launching it across Copilot, Teams, GitHub, and Dynamics 365 Contact Centre, which means it's not just an API product. It's the transcription engine that'll run underneath most of Microsoft's enterprise voice surface area.

Key Specifications

SpecificationDetails
ProviderMicrosoft
Model FamilyMAI
ParametersNot disclosed
Context WindowAudio input (batch, up to 300 MB per file)
Supported FormatsWAV, MP3, FLAC
Input Price$0.36/hr ($6.00/1,000 minutes)
Release DateJune 2, 2026
LicenseProprietary (Azure)
Prior VersionMAI-Transcribe-1 (April 2, 2026)

The pricing is unchanged from MAI-Transcribe-1: $0.36/hr, billed in second increments. That puts it at $6.00 per 1,000 minutes - the same as OpenAI Whisper-1. Budget-sensitive users wanting lower WER can compare it against Deepgram Nova-3 at $4.30/1,000 min (5.26% WER) on the transcription pricing comparison.

A professional studio condenser microphone in a recording booth Professional transcription use cases range from studio recording to call-center audio with variable background noise. Source: unsplash.com

Benchmark Performance

The AA-WER v2.0 benchmark weights three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). The methodology favors real-world speech with varied accents and acoustic conditions over clean studio audio, which makes the results harder to game than older WER benchmarks. Details on the methodology are at Artificial Analysis.

ModelAA-WERAA-AgentTalkVoxPopuliEarnings22
Fun-Realtime-ASR-preview1.7%Not disclosedNot disclosedNot disclosed
ElevenLabs Scribe v22.2%Not disclosedNot disclosedNot disclosed
MAI-Transcribe-1.52.4%2.0%1.6%4.0%
MAI-Transcribe-12.6% (prior)---
Voxtral Small (Mistral)2.8%Not disclosedNot disclosedNot disclosed

MAI-Transcribe-1.5 drops 0.2 percentage points from its predecessor on the AA-WER composite. The per-dataset breakdown shows strength in conversational agent-style audio (2.0% on AA-AgentTalk) and clean parliamentary speech (1.6% on VoxPopuli), with a harder time on spontaneous financial earnings calls (4.0% on Earnings22). That last number is typical across all STT models - spontaneous speech with domain jargon is where WER spikes.

On the FLEURS multilingual benchmark across 25 languages, Microsoft reports best-in-class accuracy. FLEURS WER improved from 3.9% (MAI-Transcribe-1) to 3.7% with 1.5. When keyword biasing is active, FLEURS WER drops a further 30%, which is the meaningful number for teams building specialized vertical applications.

MAI-Transcribe-1.5 processes 276x real-time - more than double the speed of the second-fastest top-10 accuracy model.

Key Capabilities

Keyword Biasing

The phraseList API parameter lets you inject up to 200 domain-specific terms - brand names, product codes, people's names, technical vocabulary - before the model processes audio. MAI-Transcribe-1.5 doesn't hard-force these terms; it uses them as biasing context and applies matches when the phonetic probability is plausible. The result is a 30% WER reduction on the FLEURS benchmark for domain-heavy speech.

This is the capability that matters most for enterprise deployments. Call-center transcription in financial services, legal dictation with case citations, medical charting with medication names - these are the workloads where generic WER doesn't tell the whole story. Keyword biasing is only available in 1.5, not the original MAI-Transcribe-1.

Transcription Style Control

The transcribeStyle parameter (also 1.5-only) lets you choose between two output modes: the default readability-optimized transcript, which cleans up disfluencies and fillers, or verbatim, which preserves every "uh", "um", and false start. Verbatim mode is useful for legal transcription, research interviews, and any use case where the exact spoken words - not a cleaned-up version - matter.

Language Coverage

43 languages as of June 2026, up from 25 in MAI-Transcribe-1. The new additions include Indic languages (Assamese, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu), plus Bulgarian, Catalan, Estonian, Greek, Lithuanian, Slovak, Slovenian, and Ukrainian. The model defaults to multilingual mode; you can pin it to a specific locale by passing a language code in the API request.

Earth from space at night showing city lights across multiple continents MAI-Transcribe-1.5 covers 43 languages, adding Indic and Eastern European languages absent from the original model. Source: unsplash.com

API Integration

MAI-Transcribe-1.5 runs through Azure's LLM Speech API. The REST call is straightforward - multipart form upload with a JSON definition block specifying enhancedMode.model: "mai-transcribe-1.5". Phrase lists and transcript style go in the same definition object. The model name identifier is mai-transcribe-1.5 (lowercase, hyphenated).

Current limits worth knowing: diarization isn't supported yet, and streaming isn't available in the launch version (Microsoft has flagged both as upcoming). For real-time transcription use cases, MAI-Transcribe-1.5 isn't the right choice today - it's batch-only.

Pricing and Availability

MAI-Transcribe-1.5 is available through:

  • Azure AI Foundry (Foundry resource for Speech)
  • Azure Speech Service (LLM Speech API, public preview)
  • MAI Playground (direct model testing)

Pricing: $0.36/hr ($6.00/1,000 minutes), billed in second increments. There's no separate fee for keyword biasing or verbatim style. A free tier exists through standard Azure Speech trial credits.

Supported regions are listed in the Azure Speech regions documentation. Files must be under 300 MB and in WAV, MP3, or FLAC format.

Compared to competitors at this accuracy tier: ElevenLabs Scribe v2 runs at roughly $6.67/1,000 minutes per independent benchmarking, putting MAI-Transcribe-1.5's value case mostly on price parity at higher throughput speed.

Strengths and Weaknesses

Strengths

  • Speed advantage is real: 276x real-time processing at top-5 accuracy puts it at a different point on the efficiency frontier than any direct accuracy competitor
  • Keyword biasing works and is API-native - no model fine-tuning required for domain adaptation
  • Verbatim mode is useful for legal and research transcription workflows
  • Azure integration means zero additional authentication setup for teams already running on Azure
  • Language coverage now includes major Indic languages absent from most competing cloud providers

Weaknesses

  • No diarization (speaker identification), which is table stakes for meeting transcription in most enterprise tools
  • No streaming/real-time mode at launch - batch only, limiting voice-agent use cases
  • Earnings22 WER of 4.0% shows weakness on spontaneous financial speech; Scribe v2 and Fun-Realtime-ASR-preview pull ahead in that domain
  • Closed-source with no option for on-premises or private deployment outside Azure
  • April 2, 2026 - MAI-Transcribe-1 launches: 2.6% WER, 25 languages, $0.36/hr, 3.9% FLEURS WER.

  • June 2, 2026 - MAI-Transcribe-1.5 ships at Microsoft Build 2026: 2.4% WER, 43 languages, keyword biasing, verbatim mode, and 53-second-to-under-15-second processing speed improvement.

FAQ

What is MAI-Transcribe-1.5's word error rate?

2.4% on the Artificial Analysis AA-WER v2.0 benchmark, ranking #3 globally as of June 2026. On the FLEURS multilingual benchmark it achieves best-in-class accuracy across 25 languages at 3.7% WER.

How fast is MAI-Transcribe-1.5?

It transcribes one hour of audio in under 15 seconds, running at approximately 276x real-time. That's 5x faster than accuracy-comparable models and more than double the speed of the second-fastest top-10 model on the Artificial Analysis leaderboard.

What does keyword biasing do?

It lets you inject up to 200 domain-specific terms via the phraseList API parameter. The model uses these as context to correctly transcribe specialized vocabulary - brand names, medical terms, legal citations - that a generic model would guess incorrectly. Microsoft reports 30% WER reduction on the FLEURS benchmark when keyword biasing is active.

Does MAI-Transcribe-1.5 support real-time transcription?

No. At launch it's batch-only. Streaming and diarization are listed as planned features. For real-time use cases, Deepgram Nova-3 or Azure's standard Speech-to-Text service are current alternatives.

How much does MAI-Transcribe-1.5 cost?

$0.36/hr ($6.00 per 1,000 minutes), billed in second increments through Azure AI Foundry. Pricing is identical to MAI-Transcribe-1.

Which languages does it support?

43 languages, including all 25 from MAI-Transcribe-1 plus 18 additions: Assamese, Bengali, Bulgarian, Catalan, Estonian, Greek, Gujarati, Kannada, Lithuanian, Malayalam, Marathi, Odia, Punjabi, Slovak, Slovenian, Tamil, Telugu, and Ukrainian.

Sources:

✓ Last verified June 17, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.