Microsoft Launches Three AI Models to Rival OpenAI

Daniel Okafor — Fri, 03 Apr 2026 19:57:35 +0200

Three models from Microsoft's in-house AI division landed in Azure Foundry on April 2, and the benchmark numbers are better than most analysts expected. MAI-Transcribe-1 ranks first on the FLEURS speech recognition benchmark across 25 languages. MAI-Voice-1 produces 60 seconds of audio in under one second on a single GPU. MAI-Image-2 debuted at third place on the Arena.ai image leaderboard, ahead of every other model except two.

All three run on Microsoft's own MAIA 200 inference chips. All three are priced below OpenAI and Google equivalents. None of them required any input from OpenAI to build.

That last point is the one that matters.

TL;DR

Microsoft's MAI division ships three models: speech-to-text, text-to-speech, and text-to-image
MAI-Transcribe-1 claims the top FLEURS benchmark spot at 3.9% average WER; priced at $0.36/hour
MAI-Image-2 debuts at #3 on Arena.ai; MAI-Voice-1 generates audio 60x faster than real-time
Runs on Microsoft's MAIA 200 chips; all three already power Copilot, Bing, and PowerPoint
A renegotiated OpenAI contract from late 2025 gave Microsoft freedom to build its own frontier AI

The Three Models

Model	Type	Key Benchmark	Price
MAI-Transcribe-1	Speech-to-Text	#1 FLEURS; 3.9% WER across 25 languages	$0.36/hour
MAI-Voice-1	Text-to-Speech	60s audio in under 1s on a single GPU	$22/1M characters
MAI-Image-2	Text-to-Image	#3 Arena.ai; 2x faster than prior generation	$5/1M input tokens

MAI-Transcribe-1

Microsoft's speech model ranks first on FLEURS in 11 of 25 core languages outright, and beats OpenAI's Whisper-large-v3 on the remaining 14. It also beats Gemini 2.0 Flash on 11 of those 14 languages. Average word error rate across all 25 is 3.9%. The model handles batch transcription now; real-time streaming and speaker diarization are in development. At $0.36 per hour of transcribed audio, it undercuts the pricing of comparable API offerings from both OpenAI and Google.

MAI-Voice-1

The text-to-speech model produces 60 seconds of output audio in under one second on a single GPU. It accepts audio samples for custom voice creation and maintains speaker identity across long-form content. It already powers Copilot's Audio Expressions and podcast features. Pricing is $22 per million characters.

MAI-Image-2

The image model accepts prompts up to 32,000 tokens, produces at up to 1024x1024, and carries particular strengths in photorealistic rendering, accurate skin tones, in-image text, and complex layouts. At launch it ranked third on the Arena.ai leaderboard for image model families. Output token pricing is $33 per million. WPP, the advertising conglomerate, is listed as one of the first enterprise partners rolling it out at scale.

Official announcement graphic showing the three new MAI models available in Microsoft Foundry as of April 2, 2026. Source: microsoft.ai

Who Built This and Why Now

Microsoft AI - known internally as MAI - is the company's dedicated AI research and product division, led by Mustafa Suleiman, the co-founder of DeepMind who joined Microsoft in 2024. In November 2025, Suleiman formally announced the MAI Superintelligence Team, a unit tasked with building what he called "humanist superintelligence." This week's model release is that team's first major public output.

The timing is not accidental. In late 2025, Microsoft renegotiated its foundational partnership agreement with OpenAI. The previous terms effectively barred Microsoft from training its own frontier-scale models. The revised contract lifted those restrictions, freeing Microsoft to pursue its own AI stack "alone or in partnership with third parties," as Suleiman described it at the time.

"We have a best-of-both environment, where we're free to pursue our own superintelligence and also work closely with them," Suleiman said when announcing the MAI Superintelligence Team in November 2025.

Mustafa Suleiman, CEO of Microsoft AI and co-founder of DeepMind, leads the MAI division that built the three new models. Source: commons.wikimedia.org

The models are available through Microsoft Foundry - the rebranded version of Azure AI Studio - and through the MAI Playground, currently in public preview in the US. All three are already in production across Microsoft's own products: Copilot, Bing, Bing Image Creator, and PowerPoint use them today.

Counter-Argument

The simplest pushback is that this is not a breakup. Microsoft's formal partnership with OpenAI runs until 2032, and both parties have financial stakes in the other's success. Microsoft is still OpenAI's largest cloud provider and investor. The new MAI models are in speech, voice, and image generation - not large language models, which remain the core of what OpenAI does. Suleiman has been careful to frame the two efforts as complementary.

Microsoft Foundry, the rebranded Azure AI Studio, is where the three new MAI models are available via API. Source: github.com/MicrosoftDocs

There is also a capability question. The FLEURS benchmark and Arena.ai rankings are legitimate signals, but neither is the hardest test in the field. FLEURS covers multilingual speech recognition, not the more demanding telephony, accented speech, or noisy-environment tasks that enterprise transcription customers care about most. Arena.ai reflects human preference voting, which rewards aesthetics as much as accuracy. Third place on launch day on an evolving leaderboard is a claim worth watching, not a verdict.

What the Market Is Missing

The partnership framing obscures the real dynamic. Microsoft isn't breaking from OpenAI - it's quietly building the capability to not need OpenAI if the relationship sours, costs increase, or competitive pressures shift. That is a different thing completely, and it's a rational hedge for a company that has bet heavily on a single external vendor.

The MAIA 200 chip dependency is worth tracking. Running your own models on your own silicon closes the loop on external exposure at every layer: the model, the inference stack, and the hardware. That's the same vertical integration play that Google has been running with TPUs and Gemini for years. Microsoft is now on the same path, a few years behind but moving faster than it was.

The question isn't whether MAI-Transcribe-1 beats Whisper on FLEURS. The question is whether Microsoft's in-house team can match OpenAI's development pace on language models - the one class of model that drives the most strategic value. That answer won't come from a speech API.

For competitive context on where MAI-Image-2 sits in the broader image generation market, the AI image generation leaderboard tracks real-time rankings across all major providers. For where speech models land, see the AI voice and speech leaderboard. The broader dynamic between proprietary and open approaches is covered in the open-source vs. proprietary AI guide.

Sources: Microsoft AI announcement, TechCrunch, The Register, SiliconAngle

Mustafa Suleiman | Awesome Agents