
Nemotron-Cascade 2: 30B Open MoE, One GPU, Beats 120B
NVIDIA's new Nemotron-Cascade-2-30B-A3B activates just 3B parameters per token, runs on a single RTX 4090, and outscores NVIDIA's own 120B model on coding and math benchmarks.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

NVIDIA's new Nemotron-Cascade-2-30B-A3B activates just 3B parameters per token, runs on a single RTX 4090, and outscores NVIDIA's own 120B model on coding and math benchmarks.

NVIDIA's Nemotron 3 Nano 4B packs a Mamba-dominant hybrid architecture, 262K token context, and 95.4% on MATH500 into a model that fits an 8GB Jetson Orin Nano.

Microsoft Azure's Foundry platform now runs Fireworks AI's inference engine, bringing DeepSeek V3.2, Kimi K2.5, and MiniMax M2.5 into enterprise AI under a unified control plane.

NVIDIA opens GTC 2026 with the Vera Rubin platform - six co-designed chips delivering 50 PFLOPS of inference per GPU and 10x lower token cost than Blackwell.

Apple's flagship SoC with 40-core GPU, per-core Neural Accelerators, 614 GB/s bandwidth, and 4x AI performance over M4 Max.

Meta's first mass-deployed RISC-V AI accelerator - 1.2 PFLOPS FP8, 216 GB HBM, powering Facebook and Instagram at scale.

Rankings of the fastest AI models and inference providers by tokens per second, time to first token, and end-to-end latency.

vLLM v0.17.0 adds FlashAttention 4, elastic expert parallelism for live MoE rescaling, full Qwen3.5 support, and a performance-mode flag, all in 699 commits from 272 contributors.

Andrew Ng says AGI is decades away and the real AI bubble risk is in the training layer - not inference. We examine both claims against the data.

Mercury 2 by Inception Labs is the fastest reasoning LLM available, built on diffusion architecture. We tested the speed, quality, and real-world trade-offs.

Inception Labs' Mercury 2 hits 1,196 tokens per second in independent testing - a diffusion architecture that rewires how inference works.

NVIDIA will unveil a new inference processor built on Groq's LPU architecture at GTC 2026, with OpenAI as its first major customer allocating 3 GW of dedicated capacity.