Articles Tagged "Inference"

Nemotron 3 Nano Omni Unifies Vision, Audio, Language

NVIDIA's new open omni model activates 3B of 30B parameters, processes video, audio, and documents in one pass, and delivers up to 9.2x higher throughput than other open omni models.

Faking Alignment, Shifting Morals, Saving Compute

Three arXiv papers show AI systems fake alignment in 37% of test cases, reshape human moral values through brief chats, and can cut inference compute while improving performance.

Google's Four-Chip Plan to Own AI Inference at Scale

Google splits its next TPU generation across Broadcom, MediaTek, Marvell, and Intel to win inference economics, revealed ahead of Cloud Next 2026.

AI Labs Are Losing Billions - Here's Who Really Pays

OpenAI burned $2.5B in cash on $4.3B of revenue in the first half of 2025. Anthropic cut its gross margin forecast from 50% to 40%. Here's the compute subsidy math behind every AI subscription, and who's actually paying for it.

Best AI Home Workstations 2026 - Full Buying Guide

Complete buying guide for AI home workstations in 2026 - pre-built machines and DIY builds for running local LLMs from 3B to 70B+ models, with benchmarks, part lists, and price-tier comparisons.

LLM Quantization Impact Leaderboard 2026: INT4 vs FP16

How much quality do LLMs lose when quantized from BF16 to INT8, Q6, Q5, Q4, Q3, Q2? Per-model delta tables across MMLU, HumanEval, and perplexity, with VRAM and throughput data for every major quantization format.

Best Open-Source LLM Inference Servers 2026

A benchmark-driven comparison of the top open-source LLM inference servers - vLLM, SGLang, TGI, llama.cpp, TensorRT-LLM, LMDeploy, and more.

Caveman Plugin Hits 34K Stars by Making Claude Talk Like a Caveman

The Caveman plugin for Claude Code cuts 65-87% of output tokens by stripping filler words and using terse fragment-based language. Its new Classical Chinese mode pushes compression even further.

A $900 RTX 3090 Now Beats an M5 Max at LLM Inference

Two researchers fused all 24 layers of Qwen 3.5-0.8B into a single CUDA kernel launch, making a five-year-old RTX 3090 deliver 1.8x the throughput of an M5 Max at equal or better efficiency. The gap was software, not silicon.

Compact Contexts, Smarter Fine-Tuning, and the Solver Trap

Three papers from today's arXiv: a joint fix for KV cache bloat and attention cost, new evidence that fine-tuning belongs in the middle of a transformer, and why stronger reasoning hurts behavioral simulation.

NVIDIA Groq 3 LPU - SRAM-Based Inference Engine

The NVIDIA Groq 3 LPU is a pure-SRAM inference chip delivering 150 TB/s memory bandwidth and 1.2 PFLOPS FP8 per chip, designed to pair with Vera Rubin GPUs for trillion-parameter model serving.

Positron Atlas - FPGA Inference Server

The Positron Atlas is an 8-card FPGA inference server delivering 4.5x better performance per watt than the NVIDIA DGX H200 at 2000W in a single 1U chassis.

← Previous