Articles Tagged "Inference"

Microsoft Foundry Bets on Open Models With Fireworks

Microsoft Azure's Foundry platform now runs Fireworks AI's inference engine, bringing DeepSeek V3.2, Kimi K2.5, and MiniMax M2.5 into enterprise AI under a unified control plane.

NVIDIA's Vera Rubin Arrives at GTC 2026 With 6 Chips

NVIDIA opens GTC 2026 with the Vera Rubin platform - six co-designed chips delivering 50 PFLOPS of inference per GPU and 10x lower token cost than Blackwell.

Apple M5 Max

Apple's flagship SoC with 40-core GPU, per-core Neural Accelerators, 614 GB/s bandwidth, and 4x AI performance over M4 Max.

Meta MTIA 300

Meta's first mass-deployed RISC-V AI accelerator - 1.2 PFLOPS FP8, 216 GB HBM, powering Facebook and Instagram at scale.

AI Speed and Latency Leaderboard: Tokens/s Rankings

Rankings of the fastest AI models and inference providers by tokens per second, time to first token, and end-to-end latency.

vLLM 0.17 Ships FlashAttention 4 and Live MoE Scaling

vLLM v0.17.0 adds FlashAttention 4, elastic expert parallelism for live MoE rescaling, full Qwen3.5 support, and a performance-mode flag, all in 699 commits from 272 contributors.

AGI in 2026? Andrew Ng Has Some Bad News for Musk

Andrew Ng says AGI is decades away and the real AI bubble risk is in the training layer - not inference. We examine both claims against the data.

Mercury 2 Review: 1,000 Tokens per Second, Tested

Mercury 2 by Inception Labs is the fastest reasoning LLM available, built on diffusion architecture. We tested the speed, quality, and real-world trade-offs.

Mercury 2 Is 13x Faster Than Claude Haiku - Verified

Inception Labs' Mercury 2 hits 1,196 tokens per second in independent testing - a diffusion architecture that rewires how inference works.

NVIDIA's Secret Chip Fuses GPU and Groq for OpenAI

NVIDIA will unveil a new inference processor built on Groq's LPU architecture at GTC 2026, with OpenAI as its first major customer allocating 3 GW of dedicated capacity.

Someone Reverse-Engineered Apple's Neural Engine and Trained a Model on It

A developer cracked Apple's undocumented ANE private APIs, measured its real throughput at 19 TFLOPS FP16 (not the marketed 38 TOPS), and trained a 109M-parameter transformer on hardware Apple designed exclusively for inference.

Etched Sohu - Transformer-Only Inference ASIC

Full specs and critical analysis of the Etched Sohu - a transformer-specific ASIC claiming 500K+ tokens/sec on Llama 70B, built on TSMC 4nm with 144GB HBM3E. Bold claims, but no independent benchmarks yet.

← Previous