Articles Tagged "Inference"

Faster Agents, Skewed Evals, and Brand Bias in LLMs

Three new papers: agents that compile runs into 8-13x faster state machines, benchmark scores that shift with compute budget, and big brands monopolizing LLM recommendations.

AMD Instinct MI450 - 2nm, 432 GB HBM4, 40 PFLOPS

AMD's CDNA 5 accelerator on TSMC 2nm with 432 GB HBM4 memory - the GPU behind OpenAI's 1GW deployment and Oracle's 50,000-chip supercluster.

Google DiffusionGemma: Parallel LLM Hits 1,100 t/s

Google DeepMind open-sources DiffusionGemma, a 26B MoE model that generates 256 tokens per denoising pass instead of one at a time, reaching 1,100 tokens per second on a single H100.

DiffusionGemma 26B

DiffusionGemma 26B is Google DeepMind's open-weight discrete diffusion language model that generates 256 tokens in parallel, reaching 1,100+ tokens/sec on H100 - roughly 4x faster than autoregressive models of the same size.

Ministral 3 8B

Mistral AI's mid-tier open-weight edge model - 8B parameters, 256K context, Apache 2.0 license, built for agentic pipelines and cost-sensitive production workloads.

MiniMax M3 Makes 1M Context Viable With Sparse Attention

MiniMax M3 uses sparse attention to cut long-context inference cost 20x, topping GPT-5.5 on coding benchmarks at a fraction of the price.

NVIDIA Ships Nemotron 3 Ultra - 550B Open-Weight MoE

NVIDIA's 550B Nemotron 3 Ultra, released June 4, tops the US open-weight leaderboard with a hybrid Mamba-Transformer MoE architecture and 300-plus tokens per second throughput.

NVIDIA Dynamo Snapshot Slashes Kubernetes AI Cold Starts

NVIDIA's Dynamo Snapshot uses CRIU and cuda-checkpoint to freeze and restore GPU inference containers in seconds, cutting Kubernetes cold-start times by up to 21x for large models.

Rapid-MLX Is 2.6x Faster Than Ollama on Apple Silicon

New open-source inference engine for Apple Silicon benchmarks up to 2.6x faster than Ollama, supports 66 model aliases, and drops in as an OpenAI-compatible server on any Mac.

Llama 3.3 70B Instruct

Meta's Llama 3.3 70B Instruct matches Llama 3.1 405B on instruction following and math while running at 4-5x lower cost, with the lowest hallucination rate of any open-weight model on the Vectara summarization leaderboard.

Intel Crescent Island GPU Skips HBM for 480GB LPDDR5X

Intel's Crescent Island inference GPU trades HBM bandwidth for 480GB of LPDDR5X, targeting customers locked out of NVIDIA's supply chain.

Open Source LLM Hosting Costs - June 2026

Verified June 2026: real cost per million tokens for self-hosting Llama 4 Scout, Maverick, Qwen3-235B, and DeepSeek V3.2 - GPU requirements, cost formulas, and when cheap APIs actually win.

← Previous