Articles Tagged "Inference"

MoE Myths, Context Compression, and Steering Proofs

Three papers this week challenge how we think about MoE expert routing, LLM context management, and the limits of activation steering.

OpenRouter Drops a Free 100B Stealth Model With 256K Context

Elephant Alpha is a free 100B parameter model on OpenRouter with 256K context, tool use, and structured output - but your prompts get logged, there are no benchmarks, and nobody knows who made it.

Intel Arc Pro B70 Brings 32GB VRAM to Local AI for $949

Intel's Arc Pro B70 launched on March 25 with 32GB GDDR6 and 367 TOPS for $949, undercutting NVIDIA's RTX Pro 4000 by $850. The hardware case is strong. The software story is not.

AMD Instinct MI325X - 256GB CDNA3 for Inference

AMD Instinct MI325X specs, benchmarks, and analysis. 256GB HBM3e at 6 TB/s, 2.6 PFLOPS FP8, CDNA3 architecture - the memory-capacity upgrade to the MI300X targeting large model inference.

Huawei Atlas 350 - China's FP4 Inference Accelerator

Huawei Atlas 350 specs, benchmarks, and analysis. Ascend 950PR chip, 112GB HiBL 1.0 HBM, 1.56 PFLOPS FP4, 600W - China's first domestically developed FP4-capable AI accelerator.

Microsoft Maia 200 - Azure's Inference Accelerator

Microsoft Maia 200 specs, benchmarks, and architecture analysis. TSMC 3nm, 216GB HBM3e, 10 PFLOPS FP4, 750W - Microsoft's first inference-only silicon deployed in Azure.

South Korea Bets $400M on Rebellions to Rival Nvidia

South Korean AI chip startup Rebellions has closed a $400M pre-IPO round at a $2.34B valuation, with the government's Korea National Growth Fund leading Seoul's first direct bet under its K-Nvidia initiative.

llm-d Joins CNCF - Kubernetes Gets a Native LLM Inference Stack

IBM Research, Red Hat, and Google Cloud donated llm-d to the CNCF at KubeCon EU, giving Kubernetes a production-grade distributed LLM inference framework built on vLLM.

Arm Launches AGI CPU, Its First Chip in 35 Years

At its Arm Everywhere event in San Francisco, Arm unveiled the AGI CPU - a 136-core data center processor co-developed with Meta and the company's first owned silicon product in its 35-year history.

Alibaba's C950 - First RISC-V CPU with Native LLM Inference

Alibaba's T-Head division launched the XuanTie C950, a 5nm 3.2GHz RISC-V server chip that sets a new world record for RISC-V single-core performance and natively runs billion-parameter models like DeepSeek V3 and Qwen3.

Nemotron-Cascade 2: 30B Open MoE, One GPU, Beats 120B

NVIDIA's new Nemotron-Cascade-2-30B-A3B activates just 3B parameters per token, runs on a single RTX 4090, and outscores NVIDIA's own 120B model on coding and math benchmarks.

Nemotron 3 Nano 4B: NVIDIA Edge Model Runs on 8GB

NVIDIA's Nemotron 3 Nano 4B packs a Mamba-dominant hybrid architecture, 262K token context, and 95.4% on MATH500 into a model that fits an 8GB Jetson Orin Nano.

← Previous