
MAI-Transcribe-1.5
Microsoft's second-generation speech-to-text model with 2.4% WER, 43-language support, keyword biasing, and 5x faster long-audio processing than comparable accuracy models.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Microsoft's second-generation speech-to-text model with 2.4% WER, 43-language support, keyword biasing, and 5x faster long-audio processing than comparable accuracy models.

GPT-5.1 is OpenAI's November 2025 coding and agentic flagship with 400K context, configurable reasoning effort, and 76.3% on SWE-bench Verified.

Alibaba launches three open-weight models for robot manipulation, navigation, and world prediction, built on a shared Qwen3.5 backbone with open weights.

Alibaba's generalist VLA model for robotic manipulation, built on Qwen3.5-4B with a DiT action decoder, trained on 38,100+ hours of open-source data, and ranked first on the RoboChallenge generalist track.

Alibaba's first multimodal agent model, combining GUI grounding (ScreenSpot Pro 79.0), 1M-token context, and text-plus-vision input at $0.40/M tokens.

Moonshot AI ships Kimi K2.7-Code with 30% fewer reasoning tokens and a 21.8% gain on its own coding benchmarks, but the model still trails Claude Opus 4.8 on most tests in the same table.

A hands-on review of all seven MAI models - from the April transcription and image launch to Build 2026's MAI-Thinking-1, MAI-Code-1-Flash, and the multimodal upgrades.

Google's new streaming audio model translates speech in real time across 70+ languages - available now in Google Translate and via the Gemini Live API.

DiffusionGemma 26B is Google DeepMind's open-weight discrete diffusion language model that generates 256 tokens in parallel, reaching 1,100+ tokens/sec on H100 - roughly 4x faster than autoregressive models of the same size.

Mistral AI's largest Ministral 3 model - 14B parameters, 256K context, Apache 2.0 license, multimodal, built for local deployment and agentic workflows.

MiniMax M3 is an open-weight frontier model with a 1M-token context window, native multimodal input, and strong agentic coding at $0.60/M input tokens.

NVIDIA Cosmos 3 is an open physical AI omnimodel with Mixture-of-Transformers architecture that natively handles text, images, video, sound, and robot actions in a single 16B or 64B model.