Articles Tagged "CUDA"

A $900 RTX 3090 Now Beats an M5 Max at LLM Inference

Two researchers fused all 24 layers of Qwen 3.5-0.8B into a single CUDA kernel launch, making a five-year-old RTX 3090 deliver 1.8x the throughput of an M5 Max at equal or better efficiency. The gap was software, not silicon.

AutoKernel - AI Agents That Write Faster GPU Kernels

RightNow AI releases AutoKernel, an open-source MIT-licensed framework that runs an autonomous LLM agent loop overnight to produce optimized Triton kernels for any PyTorch model.

Meta's KernelEvolve Automates Kernel Tuning in Production

Meta's KernelEvolve AI agent autonomously generates and optimizes hardware kernels across NVIDIA, AMD, and MTIA chips, delivering over 60% inference gains in production.

vLLM 0.17 Ships FlashAttention 4 and Live MoE Scaling

vLLM v0.17.0 adds FlashAttention 4, elastic expert parallelism for live MoE rescaling, full Qwen3.5 support, and a performance-mode flag, all in 699 commits from 272 contributors.

Metal GPU Programming - A Practical Guide for macOS Developers

A hands-on guide to Metal compute programming on Apple Silicon. Covers architecture, unified memory, real code examples in MSL and Swift, and how Metal compares to CUDA.

ByteDance Trained an AI Agent That Writes Faster CUDA Kernels Than You

CUDA Agent uses reinforcement learning trained on actual GPU profiling data to generate optimized CUDA kernels. It beats torch.compile by 2.11x overall and outperforms Claude Opus 4.5 and Gemini 3 Pro by 40 points on the hardest kernels.

CUDA Programming - A Practical Guide for Software Engineers

A hands-on guide to CUDA programming for developers who know how to code but have never written a GPU kernel. Covers architecture, memory, real code examples, and Metal comparison.