
Microsoft Foundry Bets on Open Models With Fireworks
Microsoft Azure's Foundry platform now runs Fireworks AI's inference engine, bringing DeepSeek V3.2, Kimi K2.5, and MiniMax M2.5 into enterprise AI under a unified control plane.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Microsoft Azure's Foundry platform now runs Fireworks AI's inference engine, bringing DeepSeek V3.2, Kimi K2.5, and MiniMax M2.5 into enterprise AI under a unified control plane.

NVIDIA opens GTC 2026 with the Vera Rubin platform - six co-designed chips delivering 50 PFLOPS of inference per GPU and 10x lower token cost than Blackwell.

Apple's flagship SoC with 40-core GPU, per-core Neural Accelerators, 614 GB/s bandwidth, and 4x AI performance over M4 Max.

Meta's first mass-deployed RISC-V AI accelerator - 1.2 PFLOPS FP8, 216 GB HBM, powering Facebook and Instagram at scale.

Rankings of the fastest AI models and inference providers by tokens per second, time to first token, and end-to-end latency.

vLLM v0.17.0 adds FlashAttention 4, elastic expert parallelism for live MoE rescaling, full Qwen3.5 support, and a performance-mode flag, all in 699 commits from 272 contributors.

Andrew Ng says AGI is decades away and the real AI bubble risk is in the training layer - not inference. We examine both claims against the data.

Mercury 2 by Inception Labs is the fastest reasoning LLM available, built on diffusion architecture. We tested the speed, quality, and real-world trade-offs.

Inception Labs' Mercury 2 hits 1,196 tokens per second in independent testing - a diffusion architecture that rewires how inference works.

NVIDIA will unveil a new inference processor built on Groq's LPU architecture at GTC 2026, with OpenAI as its first major customer allocating 3 GW of dedicated capacity.

A developer cracked Apple's undocumented ANE private APIs, measured its real throughput at 19 TFLOPS FP16 (not the marketed 38 TOPS), and trained a 109M-parameter transformer on hardware Apple designed exclusively for inference.

Full specs and critical analysis of the Etched Sohu - a transformer-specific ASIC claiming 500K+ tokens/sec on Llama 70B, built on TSMC 4nm with 144GB HBM3E. Bold claims, but no independent benchmarks yet.