Articles Tagged "MoE"

DiffusionGemma 26B

DiffusionGemma 26B is Google DeepMind's open-weight discrete diffusion language model that generates 256 tokens in parallel, reaching 1,100+ tokens/sec on H100 - roughly 4x faster than autoregressive models of the same size.

MAI-Code-1-Flash

Microsoft's first in-house coding model, a 137B sparse MoE built natively for GitHub Copilot, beating Claude Haiku 4.5 on SWE-Bench Pro by 16 points.

NVIDIA Nemotron 3 Ultra 550B-A55B

NVIDIA's 550B open-weight MoE model with 55B active parameters, hybrid Mamba-Transformer architecture, and 1M token context - the top-scoring US open model on the Artificial Analysis Intelligence Index.

NVIDIA Ships Nemotron 3 Ultra - 550B Open-Weight MoE

NVIDIA's 550B Nemotron 3 Ultra, released June 4, tops the US open-weight leaderboard with a hybrid Mamba-Transformer MoE architecture and 300-plus tokens per second throughput.

Cohere Command A+

Cohere Command A+ is a 218B sparse MoE model with Apache 2.0 license, native citations, and a 128K context window that runs on just two H100 GPUs.

Qwen3-Coder-Next

Qwen3-Coder-Next is an 80B MoE coding model from Alibaba that activates just 3B parameters per forward pass, scoring over 70% on SWE-Bench Verified with agent scaffolding under Apache 2.0.

SU-01

SU-01 is a 30B-A3B MoE reasoning model from Shanghai AI Lab that achieves gold-medal performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 using a three-stage training recipe and test-time scaling.

ZAYA1-8B

Zyphra's ZAYA1-8B is an 8.4B-parameter MoE reasoning model with only 760M active parameters that matches DeepSeek-R1-0528 on math and coding benchmarks while running at a fraction of the compute cost.

Nemotron 3 Nano Omni Unifies Vision, Audio, Language

NVIDIA's new open omni model activates 3B of 30B parameters, processes video, audio, and documents in one pass, and delivers up to 9.2x higher throughput than other open omni models.

Nemotron 3 Nano Omni

NVIDIA's first open omni-modal model: 30B total / 3B active hybrid Mamba-MoE that processes text, images, audio, and video in a single inference loop, with 9x higher throughput than comparable open omni models.

DeepSeek V4

DeepSeek V4 ships in two open-weight MoE variants - V4-Pro at 1.6T/49B active and V4-Flash at 284B/13B active - both with 1M-token context and MIT license, released April 24, 2026.

DeepSeek V4-Pro Review: Frontier Power, Penny Prices

DeepSeek V4-Pro matches Claude Opus 4.6 on SWE-bench at a fraction of the cost - a thorough review of what it gets right, where it still trails, and whether the price gap justifies the switch.

← Previous