<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Triton | Awesome Agents</title><link>https://awesomeagents.ai/tags/triton/</link><description>Your guide to AI models, agents, and the future of intelligence. Reviews, leaderboards, news, and tools - all in one place.</description><language>en-us</language><managingEditor>contact@awesomeagents.ai (Awesome Agents)</managingEditor><lastBuildDate>Mon, 06 Apr 2026 16:46:45 +0200</lastBuildDate><atom:link href="https://awesomeagents.ai/tags/triton/index.xml" rel="self" type="application/rss+xml"/><image><url>https://awesomeagents.ai/images/logo.png</url><title>Awesome Agents</title><link>https://awesomeagents.ai/</link></image><item><title>AutoKernel - AI Agents That Write Faster GPU Kernels</title><link>https://awesomeagents.ai/news/autokernel-open-source-gpu-kernel-agent/</link><pubDate>Mon, 06 Apr 2026 16:46:45 +0200</pubDate><guid>https://awesomeagents.ai/news/autokernel-open-source-gpu-kernel-agent/</guid><description>&lt;p>You point it at a PyTorch model, tell it to run, and go to sleep. By morning it hands you a set of Triton kernels tuned specifically for your hardware. That is the pitch behind &lt;a href="https://github.com/RightNow-AI/autokernel">AutoKernel&lt;/a>, a framework released today by RightNow AI that applies an autonomous LLM agent loop to GPU kernel optimization - no CUDA expertise required.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>You point it at a PyTorch model, tell it to run, and go to sleep. By morning it hands you a set of Triton kernels tuned specifically for your hardware. That is the pitch behind <a href="https://github.com/RightNow-AI/autokernel">AutoKernel</a>, a framework released today by RightNow AI that applies an autonomous LLM agent loop to GPU kernel optimization - no CUDA expertise required.</p>
<p>The project drops with an <a href="https://arxiv.org/html/2603.21331v1">arXiv paper</a> from authors Jaber Jaber and Osama Jaber and already has roughly 1,000 GitHub stars within hours of the release post going up on Hacker News.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>AutoKernel runs an iterative edit-benchmark-revert agent loop on GPU kernels for ~300-400 experiments overnight per kernel</li>
<li>Beats <code>torch.compile</code> on 12 of 16 tested configurations; reaches 5.29x over PyTorch eager on RMSNorm</li>
<li>Supports 9 kernel types (matmul, softmax, layernorm, RMSNorm, Flash Attention, SwiGLU, cross-entropy, RoPE, parallel reduction) via Triton and CUDA C++</li>
<li>MIT licensed, runs on NVIDIA H100/A100/L40S and AMD MI300X/MI350X; also tested on RTX 4090</li>
<li>Lags clearly behind cuBLAS on compute-bound matmul - the authors are transparent about this</li>
</ul>
</div>
<h2 id="how-the-agent-loop-works">How the Agent Loop Works</h2>
<p>Unlike one-shot kernel generators, AutoKernel runs a closed feedback loop. There are three phases.</p>
<h3 id="phase-a-profiling">Phase A: Profiling</h3>
<p>The system uses <code>torch.profiler</code> with shape recording to capture per-kernel GPU time across the full forward pass. It detects the target GPU automatically and ranks each kernel by runtime contribution using Amdahl's law - the bottleneck kernels get the agent's attention first.</p>
<h3 id="phase-b-the-edit-loop">Phase B: The Edit Loop</h3>
<p>This is the core. The agent adjusts a single file - <code>kernel.py</code> - and a fixed benchmark harness assesses the change through five correctness checks before measuring performance. If the new kernel is faster and correct, the change is kept. Otherwise it reverts. Each iteration takes roughly 90 seconds.</p>
<p>At that rate you get about 40 experiments per hour, or 300-400 over an overnight run. The agent keeps working on a kernel until one of four conditions triggers a move-on: five consecutive reverts, 90% peak GPU utilization, two hours elapsed, or a 2x speedup reached.</p>
<p>The entire strategy the agent draws from is encoded in <code>program.md</code>, a 909-line document that RightNow calls the &quot;research org code.&quot; It's basically a ranked playbook of GPU optimization techniques across six tiers:</p>
<ol>
<li><strong>Block size tuning</strong> (10-50% gains) - power-of-2 tile dimensions, warp counts, pipeline stages</li>
<li><strong>Memory access</strong> (10-30%) - coalesced loads, software prefetching, L2 swizzling, shared memory padding</li>
<li><strong>Compute</strong> (5-15%) - TF32 accumulation, epilogue fusion, loop invariant hoisting</li>
<li><strong>Advanced</strong> (5-20%) - split-K, persistent kernels, Triton autotune, warp specialization</li>
<li><strong>Architecture-specific</strong> (5-15%) - TMA on Hopper, cp.async on Ampere</li>
<li><strong>Kernel-specific</strong> - online softmax for attention, Welford's algorithm for normalization</li>
</ol>
<p>The agent reads from this document, applies a change, sees the benchmark result, and adapts. It isn't doing anything architecturally exotic - but the playbook encoding real expert knowledge is the part that makes it work better than naive sampling.</p>
<h3 id="phase-c-verification">Phase C: Verification</h3>
<p>After optimization completes, AutoKernel runs end-to-end correctness and speedup validation against the original model. The results log everything to a plain <code>results.tsv</code> capturing experiment number, throughput in TFLOPS or GB/s, speedup, correctness status, and VRAM usage.</p>
<p><img src="/images/news/autokernel-open-source-gpu-kernel-agent-hardware.jpg" alt="NVIDIA H100 GPU used for AutoKernel benchmark runs">
<em>The AutoKernel benchmark suite was run on a NVIDIA H100, though the framework supports A100, L40S, and AMD MI300X/MI350X targets as well.</em>
<small>Source: commons.wikimedia.org</small></p>
<h2 id="getting-started">Getting Started</h2>
<p>Running AutoKernel against a model is a few lines:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">git clone https://github.com/RightNow-AI/autokernel
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> autokernel
</span></span><span class="line"><span class="cl">pip install -r requirements.txt
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">autokernel</span> <span class="kn">import</span> <span class="n">optimize</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Point at any PyTorch model</span>
</span></span><span class="line"><span class="cl"><span class="n">optimize</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="n">your_model</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">sample_inputs</span><span class="o">=</span><span class="n">sample_inputs</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">backend</span><span class="o">=</span><span class="s2">&#34;triton&#34;</span><span class="p">,</span>       <span class="c1"># or &#34;cuda&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">target_gpu</span><span class="o">=</span><span class="s2">&#34;H100&#34;</span><span class="p">,</span>      <span class="c1"># auto-detected if omitted</span>
</span></span><span class="line"><span class="cl">    <span class="n">budget_hours</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>         <span class="c1"># overnight run</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><p>The framework handles profiling, kernel extraction, the agent loop, and validation. Outputs land in an <code>optimized_kernels/</code> directory with drop-in replacements for the original PyTorch ops.</p>
<h2 id="benchmark-results">Benchmark Results</h2>
<p>Tested on a H100 against PyTorch 2.x eager mode and <code>torch.compile</code> with max-autotune:</p>
<table>
  <thead>
      <tr>
          <th>Kernel</th>
          <th>Size</th>
          <th>vs PyTorch Eager</th>
          <th>vs torch.compile</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RMSNorm</td>
          <td>8192 x 8192</td>
          <td>5.29x</td>
          <td>2.83x</td>
      </tr>
      <tr>
          <td>Softmax</td>
          <td>8192 x 8192</td>
          <td>2.82x</td>
          <td>3.44x</td>
      </tr>
      <tr>
          <td>Cross-Entropy</td>
          <td>8192 x 32k vocab</td>
          <td>2.21x</td>
          <td>2.94x</td>
      </tr>
      <tr>
          <td>LayerNorm</td>
          <td>8192 x 4096</td>
          <td>1.25x</td>
          <td>3.21x</td>
      </tr>
  </tbody>
</table>
<p>AutoKernel beats <code>torch.compile</code> on 12 of 16 configurations tested. Memory-bound operations - normalization, reduction, loss kernels - see the biggest gains because AutoKernel's loop can tune memory access patterns more aggressively than the one-shot compiler heuristics. The framework also claimed first place on a community vector sum reduction benchmark on B200, hitting 44.086 microseconds.</p>
<p><img src="/images/news/autokernel-open-source-gpu-kernel-agent-benchmark.jpg" alt="AutoKernel benchmark progress chart showing per-kernel speedup over experiment iterations">
<em>AutoKernel's progress.png from the GitHub README shows how speedup builds across the agent's iterative experiment loop.</em>
<small>Source: github.com/RightNow-AI/autokernel</small></p>
<h2 id="hardware-and-compatibility">Hardware and Compatibility</h2>
<table>
  <thead>
      <tr>
          <th>Requirement</th>
          <th>Detail</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Python</td>
          <td>3.9+</td>
      </tr>
      <tr>
          <td>PyTorch</td>
          <td>2.0+</td>
      </tr>
      <tr>
          <td>Triton backend</td>
          <td>NVIDIA H100, A100, L40S, B200; AMD MI300X, MI350X; RTX 4090</td>
      </tr>
      <tr>
          <td>CUDA backend</td>
          <td>Any CUDA 11.8+ GPU</td>
      </tr>
      <tr>
          <td>LLM API</td>
          <td>OpenAI, Anthropic, or local model with API compatibility</td>
      </tr>
      <tr>
          <td>Minimum VRAM</td>
          <td>16 GB recommended (40 GB+ for large model runs)</td>
      </tr>
      <tr>
          <td>License</td>
          <td>MIT (code), CC BY 4.0 (paper)</td>
      </tr>
  </tbody>
</table>
<div class="pull-quote">
<p>The benchmark harness is fixed. Only the kernel changes. That constraint is what makes the loop trustworthy - the agent can't cheat its own eval.</p>
</div>
<h2 id="how-it-compares">How It Compares</h2>
<p>This isn't the first system to apply AI-driven iteration to GPU code. Google's AlphaEvolve hit 23% acceleration on specific kernels and reportedly cut 1% off Gemini training time. The difference is that AlphaEvolve is closed, internal to Google, and not configurable for arbitrary models. AutoKernel is MIT-licensed and runs against whatever you point it at.</p>
<p>Meta's <a href="/news/meta-kernelevolve-agentic-kernel-optimization/">KernelEvolve</a> took a similar agentic approach but was built for Meta's internal production kernels and isn't publicly available. The <a href="/news/autoagent-self-optimizing-harness/">AutoAgent framework from MIT</a> applied self-optimizing loops to agent orchestration; AutoKernel brings the same concept down to the kernel level.</p>
<p>Andrej Karpathy's autoresearch concept - autonomous loops running overnight experiments - is clearly an intellectual ancestor here. RightNow AI is applying that pattern to a domain where iteration has historically required years of human expertise.</p>
<h2 id="where-it-falls-short">Where It Falls Short</h2>
<p>The honest assessment is that matmul performance is a weak spot. AutoKernel's Triton starter reaches 278 TFLOPS on H100 against cuBLAS at 989.5 TFLOPS - roughly 28% of peak. On compute-bound workloads where cuBLAS and CUTLASS are dominant, the framework currently doesn't come close.</p>
<p>Hacker News commenters pushed back on this. User <code>aviinuo</code> noted that for 4kx4kx4k FP16 matmul, &quot;cutlass is like 3x faster than this.&quot; User <code>ademeure</code> flagged inconsistency in the matmul benchmark claim, pointing out that a reported 18.9% peak use on H100 doesn't square with the claimed cuBLAS comparison numbers.</p>
<p>The framework also has hard scope limits: single-GPU only, no support for distributed kernels or multi-device memory management, and code generation limits mean the agent can't yet handle complex techniques like software pipelining or custom PTX emission.</p>
<p>At 40 experiments per hour, a difficult kernel may need multiple overnight runs to converge. RightNow AI is transparent about all of this in the paper.</p>
<p><img src="/images/news/autokernel-open-source-gpu-kernel-agent-servers.jpg" alt="GPU server rack representing the compute infrastructure needed for kernel optimization at scale">
<em>AutoKernel is designed for single-GPU optimization; multi-GPU distributed kernel support is on the roadmap.</em>
<small>Source: commons.wikimedia.org</small></p>
<h2 id="what-to-watch">What To Watch</h2>
<p>RightNow AI is a NVIDIA Inception Program member, and AutoKernel is closely integrated with their commercial AI code editor for CUDA/Triton development (free tier, $20/month Pro). The open-source release follows a pattern of using community visibility to drive paid-product adoption - the framework shows the technology, the editor is where you use it day-to-day.</p>
<p>The interesting question is whether the <code>program.md</code> playbook gets contributed to by the community over time. The six-tier optimization hierarchy is the intellectual core of the system. Opening it up to community improvements could make the agent meaningfully more capable without requiring changes to the loop architecture.</p>
<p>For teams running large training runs who already know their bottleneck kernels, this is worth testing today. For everyone else, the &quot;go to sleep&quot; pitch is real - the system runs unattended and gives you something concrete in the morning, even if it won't compete with cuBLAS on every workload.</p>
<hr>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://github.com/RightNow-AI/autokernel">AutoKernel GitHub repository</a></li>
<li><a href="https://arxiv.org/html/2603.21331v1">arXiv paper 2603.21331</a></li>
<li><a href="https://www.marktechpost.com/2026/04/06/rightnow-ai-releases-autokernel-an-open-source-framework-that-applies-an-autonomous-agent-loop-to-gpu-kernel-optimization-for-arbitrary-pytorch-models/">MarkTechPost coverage</a></li>
<li><a href="https://forums.developer.nvidia.com/t/autokernel-autoresearch-for-kernel-optimization/363215">NVIDIA Developer Forums thread</a></li>
</ul>
]]></content:encoded><dc:creator>Sophie Zhang</dc:creator><category>News</category><media:content url="https://awesomeagents.ai/images/news/autokernel-open-source-gpu-kernel-agent_hu_8bb2b35a132603d8.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/news/autokernel-open-source-gpu-kernel-agent_hu_8bb2b35a132603d8.jpg" width="1200" height="675"/></item><item><title>Meta&amp;#39;s KernelEvolve Automates Kernel Tuning in Production</title><link>https://awesomeagents.ai/news/meta-kernelevolve-agentic-kernel-optimization/</link><pubDate>Mon, 06 Apr 2026 14:01:14 +0200</pubDate><guid>https://awesomeagents.ai/news/meta-kernelevolve-agentic-kernel-optimization/</guid><description><![CDATA[<div class="podcast-embed">
<iframe style="border-radius:12px" src="https://open.spotify.com/embed/episode/4OUDN41mu0FYtNdM1W2qLk?utm_source=generator&theme=0" width="100%" height="152" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>
</div>
<p>Meta published the full technical breakdown of KernelEvolve on April 2 - an AI agent that creates and improves low-level hardware kernels at production scale across NVIDIA GPUs, AMD GPUs, and Meta's custom MTIA silicon. The numbers are real: over 60% inference throughput improvement on their Andromeda ads model, over 25% training throughput gains on MTIA, and a 100% pass rate on KernelBench's 250 benchmark problems. The system has been running in production, serving trillions of daily inference requests.</p>]]></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div class="podcast-embed">
<iframe style="border-radius:12px" src="https://open.spotify.com/embed/episode/4OUDN41mu0FYtNdM1W2qLk?utm_source=generator&theme=0" width="100%" height="152" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>
</div>
<p>Meta published the full technical breakdown of KernelEvolve on April 2 - an AI agent that creates and improves low-level hardware kernels at production scale across NVIDIA GPUs, AMD GPUs, and Meta's custom MTIA silicon. The numbers are real: over 60% inference throughput improvement on their Andromeda ads model, over 25% training throughput gains on MTIA, and a 100% pass rate on KernelBench's 250 benchmark problems. The system has been running in production, serving trillions of daily inference requests.</p>
<div class="news-tldr">
<p><strong>Key Stats</strong></p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Result</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Inference gain (NVIDIA, Andromeda)</td>
          <td>60%+ vs torch.compile baseline</td>
      </tr>
      <tr>
          <td>Training gain (MTIA ads model)</td>
          <td>25%+</td>
      </tr>
      <tr>
          <td>KernelBench pass rate</td>
          <td>100% (250/250 problems)</td>
      </tr>
      <tr>
          <td>Operators confirmed</td>
          <td>160 PyTorch ATen ops across 3 platforms</td>
      </tr>
      <tr>
          <td>Hardware coverage</td>
          <td>NVIDIA GPUs, AMD GPUs, MTIA 300-500</td>
      </tr>
      <tr>
          <td>Output languages</td>
          <td>Triton, CuTe DSL, FlyDSL, CUDA, HIP, MTIA C++</td>
      </tr>
      <tr>
          <td>Paper</td>
          <td>ISCA 2026 (arXiv:2512.23236)</td>
      </tr>
  </tbody>
</table>
</div>
<h2 id="the-problem-kernelevolve-solves">The Problem KernelEvolve Solves</h2>
<p>Vendor libraries like cuBLAS cover standard operations well - GEMMs, convolutions, pooling. Meta's ads ranking stack uses a long tail of custom operators that exist in no vendor library: feature hashing, sequence truncation, fused feature interactions, specialized attention variants for ranking. Every new model architecture adds more.</p>
<p>The combinatorial math gets out of hand fast. The number of unique kernel configurations scales as hardware variants × model architectures × operator types. Meta is shipping four MTIA chip generations in two years - the 300 through 500 series. Running the old playbook of hand-tuned kernels from expert engineers doesn't work at that pace.</p>
<h3 id="why-not-just-use-torchcompile">Why Not Just Use torch.compile?</h3>
<p><code>torch.compile</code> covers the general case reliably. But for Meta's custom operators on MTIA silicon, there's no compiler support, no vendor library, and the chip's architecture wasn't in any model's training data. The 60% NVIDIA inference gain is measured against a baseline that already uses torch.compile and vendor-provided kernels - KernelEvolve finds performance on top of what standard tooling can reach. That's a harder comparison point than most systems report.</p>
<p>Separately, ByteDance's CUDA Agent (which <a href="/news/cuda-agent-bytedance-kernel-generation/">we covered in March</a>) attacks a narrower version of the same problem using reinforcement learning trained on GPU profiling trajectories. CUDA Agent achieves a 2.11x improvement over torch.compile on their target workloads. KernelEvolve trades that focused benchmark performance for broader hardware coverage and the ability to target proprietary silicon that has never appeared in public training data.</p>
<h2 id="architecture-teardown">Architecture Teardown</h2>
<p>KernelEvolve has six components designed to compound - each one improves what the others can find.</p>
<h3 id="llm-synthesizer">LLM Synthesizer</h3>
<p>The synthesizer produces candidate kernels across multiple languages: Triton and TLX for high-level GPU work, CuTe DSL and FlyDSL for closer-to-metal GPU operations, CUDA C++ and HIP for vendor-specific paths, and MTIA C++ for Meta's in-house chips. Prompts are dynamically constructed with runtime diagnostics and performance feedback from prior candidates in the same search session - not static templates.</p>
<p>A representative Triton kernel structure (illustrative of the system's output format, based on the published paper description) shows how KernelEvolve handles tiling and memory access for a fused operation:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">triton</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">triton.language</span> <span class="k">as</span> <span class="nn">tl</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nd">@triton.jit</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">fused_feature_kernel</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">x_ptr</span><span class="p">,</span> <span class="n">w_ptr</span><span class="p">,</span> <span class="n">out_ptr</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">stride_xm</span><span class="p">,</span> <span class="n">stride_xk</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">stride_wk</span><span class="p">,</span> <span class="n">stride_wn</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">BLOCK_M</span><span class="p">:</span> <span class="n">tl</span><span class="o">.</span><span class="n">constexpr</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">BLOCK_N</span><span class="p">:</span> <span class="n">tl</span><span class="o">.</span><span class="n">constexpr</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">BLOCK_K</span><span class="p">:</span> <span class="n">tl</span><span class="o">.</span><span class="n">constexpr</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Tile IDs determined by tree search based on profiler feedback</span>
</span></span><span class="line"><span class="cl">    <span class="n">pid_m</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">program_id</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">pid_n</span> <span class="o">=</span> <span class="n">tl</span><span class="o">.</span><span class="n">program_id</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">offsets_m</span> <span class="o">=</span> <span class="n">pid_m</span> <span class="o">*</span> <span class="n">BLOCK_M</span> <span class="o">+</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_M</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">offsets_n</span> <span class="o">=</span> <span class="n">pid_n</span> <span class="o">*</span> <span class="n">BLOCK_N</span> <span class="o">+</span> <span class="n">tl</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">BLOCK_N</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Block sizes (BLOCK_M=128, BLOCK_N=64) are not hand-picked -</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># the tree search engine selects them from profiler diagnostics</span>
</span></span></code></pre></div><p>The synthesizer creates the kernel structure. The block sizes and memory access patterns are chosen by the tree search engine based on hardware profiler output.</p>
<h3 id="tree-search-engine">Tree Search Engine</h3>
<p>This is the component that separates KernelEvolve from one-shot code generation. It uses Monte Carlo tree search combined with evolutionary strategies. Each node in the tree carries a &quot;configurable memory&quot; controlling how it uses prior knowledge: inherit the parent trajectory, compare against sibling candidates, combine insights from both, or restart completely to escape a local optimum.</p>
<p>The search engine doesn't just observe that kernel A runs 1.2× faster than kernel B. It sees <em>why</em> - whether the bottleneck is memory bandwidth, compute throughput, or occupancy limits - and directs the next generation of candidates accordingly. The evaluation framework provides structured diagnostics, not just timing numbers.</p>
<p><img src="/images/news/meta-kernelevolve-agentic-kernel-optimization-treesearch.jpg" alt="KernelEvolve tree search diagram showing how the optimization engine navigates the kernel space">
<em>The tree search engine navigates the optimization space by combining MCTS with evolutionary strategies. Each node carries configurable memory that can inherit from parents, compare siblings, or restart to escape local optima.</em>
<small>Source: engineering.fb.com</small></p>
<h3 id="retrieval-augmented-knowledge-base">Retrieval-Augmented Knowledge Base</h3>
<p>This is how KernelEvolve handles hardware it has never seen. Meta encodes MTIA's architecture manuals, memory hierarchy specifications, instruction sets, and optimization patterns into a hierarchical knowledge base. The LLM queries it at inference time - so it can write MTIA C++ code for a chip that postdates its training cutoff, by injecting the chip's documentation as context.</p>
<p>The paper describes a compounding effect the authors call &quot;in-context reinforcement learning&quot;: successful optimizations are written back into the knowledge base as reusable skills. Early sessions explore the hardest problems; later sessions start from much better priors. It's closer to a curriculum memory system than RL in any technical sense, but the practical result is the same - the system improves without retraining.</p>
<h3 id="automated-evaluation-framework">Automated Evaluation Framework</h3>
<p>Every candidate kernel gets confirmed on two dimensions: bitwise correctness against a reference implementation, and actual performance on real hardware. The framework uses TritonBench for Triton kernels, PyTorch Profiler and NCU for NVIDIA paths, Proton for instruction-level latency on NVIDIA, and MTIA Insight for Meta's custom silicon. The search engine gets structured diagnostics - not just throughput numbers but memory-bound vs compute-bound classification - which feeds directly into the next search iteration.</p>
<p>Meta validated 160 PyTorch ATen operators across three hardware platforms, producing 480 unique configurations. All passed correctness checks.</p>
<p><img src="/images/news/meta-kernelevolve-agentic-kernel-optimization-pipeline.jpg" alt="KernelEvolve pipeline diagram showing how a kernel optimization request flows through all six system components">
<em>How a kernel optimization request flows through KernelEvolve's six components, from initial synthesis through evaluation, profiling, and knowledge base update.</em>
<small>Source: engineering.fb.com</small></p>
<h3 id="shared-data-foundation-and-agentic-rl">Shared Data Foundation and Agentic RL</h3>
<p>Every session contributes its findings back to a shared pool available to future sessions. The system also generates structured training data from optimization trajectories - code transformations paired with evaluation feedback - used to post-train smaller specialized models with kernel performance as the reward signal. The flywheel is real: the system gets better over time without any human intervention.</p>
<h2 id="where-kernelevolve-fits-in-metas-infrastructure">Where KernelEvolve Fits in Meta's Infrastructure</h2>
<p>KernelEvolve doesn't operate alone. It's the hardware execution layer of Meta's broader Ranking Engineer Agent (REA) system, described in a <a href="https://engineering.fb.com/2026/03/17/developer-tools/ranking-engineer-agent-rea-autonomous-ai-system-accelerating-meta-ads-ranking-innovation/">March 17 engineering post from Meta</a>. REA operates at the ML model layer - it autonomously discovers better ranking model architectures through hypothesis generation and experiment management, with 3 engineers using REA delivering improvement proposals for 8 models where the same work historically required 2 engineers per model. KernelEvolve then handles the hardware execution layer: once REA's exploration surfaces better models, KernelEvolve produces optimized kernels to run them in production.</p>
<div class="pull-quote">
<p>When a new chip arrives, the engineering cost shifts from writing thousands of kernels by hand to curating a set of hardware documents and injecting them into the knowledge base.</p>
</div>
<p>The two-layer architecture closes a loop that previously required substantial human coordination - ML exploration happens in REA, hardware optimization happens in KernelEvolve, and both feed findings back into shared knowledge stores that compound over time.</p>
<h2 id="hardware-and-language-compatibility">Hardware and Language Compatibility</h2>
<table>
  <thead>
      <tr>
          <th>Hardware</th>
          <th>Output Languages</th>
          <th>Profiling Tools</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>NVIDIA GPUs</td>
          <td>Triton, CuTe DSL, CUDA C++</td>
          <td>NCU, Proton, TritonBench</td>
      </tr>
      <tr>
          <td>AMD GPUs</td>
          <td>HIP, Triton</td>
          <td>TritonBench, PyTorch Profiler</td>
      </tr>
      <tr>
          <td>Meta MTIA 300-500</td>
          <td>FlyDSL, MTIA C++</td>
          <td>MTIA Insight</td>
      </tr>
      <tr>
          <td>CPU</td>
          <td>C++, SIMD intrinsics</td>
          <td>Custom benchmarks</td>
      </tr>
  </tbody>
</table>
<p>For teams building similar systems, the <a href="/guides/cuda-programming-guide/">CUDA programming guide</a> covers the underlying kernel optimization concepts that KernelEvolve's synthesizer builds on.</p>
<h2 id="where-it-falls-short">Where It Falls Short</h2>
<p>KernelEvolve is Meta's internal infrastructure - the first and biggest limitation is that you can't use it. The paper is public (<a href="https://arxiv.org/abs/2512.23236">arXiv:2512.23236</a>), the architecture is documented, but the implementation - the knowledge base contents, the job harness, the MTIA integration, the internal tooling scaffolding - is all proprietary. Reproducing the NVIDIA results is theoretically feasible for a well-resourced engineering team; reproducing the MTIA gains requires hardware that isn't commercially available.</p>
<p>The 60% inference gain framing also needs some scrutiny. Meta's custom workloads have unusually high operator diversity - fused feature interactions, feature hashing, and sequence operations that vendor libraries don't support. Teams running more standard workloads controlled by GEMMs and convolutions will find less headroom, because the standard tooling already covers those cases well.</p>
<p>The &quot;in-context reinforcement learning&quot; framing in the paper is worth reading with skepticism. Writing successful optimizations into a retrieval store and improving future session quality is a sound engineering pattern. Calling it reinforcement learning stretches the definition clearly - there's no policy gradient, no spread reward signal, no neural network being updated. The results are real; the terminology is chosen for impact.</p>
<hr>
<p>ISCA 2026 acceptance is the meaningful signal here. The International Symposium on Computer Architecture is a top-tier venue with serious peer review, not a workshop or company blog. The committee evaluates measurement methodology and implementation details carefully. For teams designing their own kernel optimization pipelines, the paper is worth reading - especially the tree search engine design and the hierarchical knowledge base structure. Both are applicable regardless of whether you have access to Meta's infrastructure.</p>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://engineering.fb.com/2026/04/02/developer-tools/kernelevolve-how-metas-ranking-engineer-agent-optimizes-ai-infrastructure/">KernelEvolve - Meta Engineering Blog</a></li>
<li><a href="https://engineering.fb.com/2026/03/17/developer-tools/ranking-engineer-agent-rea-autonomous-ai-system-accelerating-meta-ads-ranking-innovation/">Ranking Engineer Agent - Meta Engineering Blog</a></li>
<li><a href="https://arxiv.org/abs/2512.23236">KernelEvolve arXiv paper (arXiv:2512.23236)</a></li>
</ul>
]]></content:encoded><dc:creator>Sophie Zhang</dc:creator><category>News</category><media:content url="https://awesomeagents.ai/images/news/meta-kernelevolve-agentic-kernel-optimization_hu_f08c8374f24eba0b.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/news/meta-kernelevolve-agentic-kernel-optimization_hu_f08c8374f24eba0b.jpg" width="1200" height="675"/></item></channel></rss>