
Thinking Machines Builds AI That Listens While Talking
Mira Murati's startup unveils TML-Interaction-Small, a 276B MoE model that hits 0.40-second response latency by listening and generating speech at the same time.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Mira Murati's startup unveils TML-Interaction-Small, a 276B MoE model that hits 0.40-second response latency by listening and generating speech at the same time.

IBM's Granite Embedding Multilingual R2 ships with a 64x context window jump, ModernBERT architecture, and Apache 2.0 licensing that makes it enterprise-safe out of the box.

A physics formula predicts AI behavioral shifts before they happen, a benchmark shows LLMs fail at 90% of graduate math formalization, and a training-free method cuts synthetic data costs by up to 78%.

Subquadratic's SubQ claims the first linear-scaling LLM with a 12M-token window - but private beta access, self-reported benchmarks, and a 17-point MRCR gap make independent verification the only test that matters.

Rankings of the best AI models and agent frameworks on the GAIA benchmark, which tests real-world multi-step tasks requiring web browsing, tool use, and multi-hop reasoning.

SU-01 is a 30B-A3B MoE reasoning model from Shanghai AI Lab that achieves gold-medal performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 using a three-stage training recipe and test-time scaling.

A 30B model earns IMO gold, memory consolidation silently corrupts agents, and a new metric predicts when LLMs lose track of their instructions.

HiDream-O1-Image is an 8B open-source text-to-image model with a pixel-space diffusion architecture that outperforms 32B FLUX.2 [dev] across five major benchmarks.

SubQ is the first LLM built on a fully subquadratic attention architecture, achieving a 12M-token research context and 52x faster inference than FlashAttention at 1M tokens.

New research shows reasoning length amplifies position bias, behavior cues cut wasted tokens by 50% while boosting safety, and sparse autoencoders can predict tool failures from model internals.

Updated May 2026: DeepSeek V4-Flash reasoning now $0.28/MTok output (8x cheaper than R1), o3-pro launched at $20/$80, Grok 4 retires May 15 - verified pricing across 11 models.

NVIDIA Ising is the first open AI model family for quantum computing - a 35B VLM for processor calibration and CNN decoders for real-time error correction, already deployed at 20+ research institutions.