
AutoKernel - AI Agents That Write Faster GPU Kernels
RightNow AI releases AutoKernel, an open-source MIT-licensed framework that runs an autonomous LLM agent loop overnight to produce optimized Triton kernels for any PyTorch model.

RightNow AI releases AutoKernel, an open-source MIT-licensed framework that runs an autonomous LLM agent loop overnight to produce optimized Triton kernels for any PyTorch model.

Meta's KernelEvolve AI agent autonomously generates and optimizes hardware kernels across NVIDIA, AMD, and MTIA chips, delivering over 60% inference gains in production.

vLLM v0.17.0 adds FlashAttention 4, elastic expert parallelism for live MoE rescaling, full Qwen3.5 support, and a performance-mode flag, all in 699 commits from 272 contributors.

A hands-on guide to Metal compute programming on Apple Silicon. Covers architecture, unified memory, real code examples in MSL and Swift, and how Metal compares to CUDA.

CUDA Agent uses reinforcement learning trained on actual GPU profiling data to generate optimized CUDA kernels. It beats torch.compile by 2.11x overall and outperforms Claude Opus 4.5 and Gemini 3 Pro by 40 points on the hardest kernels.

A hands-on guide to CUDA programming for developers who know how to code but have never written a GPU kernel. Covers architecture, memory, real code examples, and Metal comparison.