Articles Tagged "Inference"

Cut CoT Costs, Fix Agent Memory, Test Clinical AI

Three papers: smarter CoT trimming cuts reasoning length by 50%, a plug-in context manager rescues frozen agents on long tasks, and a 960K-item clinical benchmark exposes LLM gaps in hospitals.

AMD Instinct MI350P - CDNA 4 PCIe Inference Card

AMD Instinct MI350P brings CDNA 4 to a standard PCIe slot: 144GB HBM3E, 4 TB/s bandwidth, 2.3 PFLOPS MXFP8, and 600W passive cooling for air-cooled servers.

NVIDIA RTX Spark - ARM Blackwell Superchip for AI PCs

NVIDIA RTX Spark is a 20-core ARM + Blackwell GPU superchip delivering 1 petaFLOP FP4 and 128GB unified memory for AI-first Windows laptops and desktops.

Nvidia Cosmos 3 Is the First Open Physical AI Model

Nvidia's Cosmos 3 is the first fully open omnimodel for physical AI, trained on 20 trillion tokens to teach robots and autonomous vehicles how to reason and act in the real world.

XCENA Raises $135M Betting Memory Is AI's Real Bottleneck

XCENA raises $135M at a $570M valuation to build the MX1 - a CXL 3.2 chip with thousands of RISC-V cores that processes AI workloads where data lives, eliminating costly data transfers between GPU and RAM.

Groq Raises $650M to Pivot From Chip Maker to Cloud

After licensing its chip technology to Nvidia for $20 billion, Groq is raising $650M from existing investors to rebuild itself as an AI inference cloud provider.

Gemini 3.5 Flash: Real Speed, Selective Benchmarks

Google's Gemini 3.5 Flash is genuinely fast at 289 tok/s and competitive on agentic tasks - but the benchmark portfolio has gaps worth knowing before you build on it.

Chinese Models Claim 60% of OpenRouter Token Traffic

Chinese AI providers now handle over 60% of all tokens routed through OpenRouter, up from less than 2% just a year ago.

Gemini 3.5 Flash Review: When Flash Surpasses Pro

Gemini 3.5 Flash leads on agentic benchmarks, runs 4x faster than Claude and GPT-5.5, and undercuts both on price - but a hidden long-context weakness and a 3x price hike over its predecessor deserve scrutiny.

Gemini 3.5 Flash

Google DeepMind's fastest frontier model, hitting 76.2% on Terminal-Bench 2.1 and 289 tok/s, now powering AI Mode in Search for over 1 billion monthly users.

Thinking Machines Builds AI That Listens While Talking

Mira Murati's startup unveils TML-Interaction-Small, a 276B MoE model that hits 0.40-second response latency by listening and generating speech at the same time.

Lumai Iris Nova: Optical AI Inference Server

Lumai's optical AI inference server uses light-based computing to run billion-parameter LLMs with up to 90% less power than GPUs.

← Previous