Shopify CEO Uses AI Agent to Make Liquid 53% Faster

Tobi Lütke, CEO of a $150 billion company, opened a pull request against his own templating engine this week and let an AI agent do the optimization work. The result: Liquid - the Ruby template engine that powers rendering across Shopify's 5.6 million active stores - now parses and renders 53% faster with 61% fewer object allocations.

TL;DR

Lütke ran ~120 automated experiments using pi-autoresearch, producing 93 commits against Shopify's Liquid engine
Combined parse+render time dropped from 7,469 microseconds to 3,534 microseconds (-53%)
Object allocations fell from 62,620 to 24,530 (-61%), which matters because GC consumes 74% of total CPU time
All 974 unit tests pass with zero regressions
PR is open at github.com/Shopify/liquid/pull/2056

How It Works

The approach adapts Andrej Karpathy's autoresearch - a 630-line tool released March 6 that lets AI agents run autonomous experiments in a loop. Karpathy built it for ML hyperparameter tuning. Lütke's variant, pi-autoresearch (developed with David Cortés), applies the same loop to software performance optimization.

The cycle is simple:

Edit code → Commit → Run 974 unit tests → Benchmark → Keep or discard → Repeat

Lütke used Pi as the coding agent. The system maintains state in an autoresearch.jsonl file, tracking which experiments worked and which didn't. Over roughly 120 iterations, it produced 93 commits that survived the keep/discard filter. The branch name tells the story: autoresearch/liquid-perf-2026-03-11.

The benchmarks ran on Ruby 3.4 with YJIT enabled, using real Shopify theme templates with production-like data. Each run: 20-iteration warmup, 10 measured iterations with GC disabled, best of 3 runs.

What Changed

Metric	Before	After	Improvement
Combined parse+render	7,469 microseconds	3,534 microseconds	-53%
Parse time	6,031 microseconds	2,353 microseconds	-61%
Render time	1,438 microseconds	1,146 microseconds	-20%
Object allocations	62,620	24,530	-61%

The allocation number is the one that matters most at scale. Lütke noted that garbage collection consumes 74% of total CPU time in Liquid rendering, so every avoided allocation has an outsized effect on wall-clock performance.

Parse Optimizations (61% faster)

The biggest single win: replacing the StringScanner-based tokenizer with String#byteindex for finding {% and {{ delimiters. Single-byte byteindex searching runs about 40% faster than regex-based skip_until, cutting parse time by 12% on its own.

Other parse improvements:

A new Cursor class wraps StringScanner with methods optimized for Liquid grammar, reusing a single instance per ParseContext
Removed costly StringScanner#string= resets that were called 878 times per template
Zero-lexer variable parsing: 100% of variables in the benchmark (1,197 instances) now parse via a byte-level fast path without touching the Lexer or Parser

Render Optimizations (20% faster)

Splat-free filter invocation via invoke_single/invoke_two handles 90% of filter calls
Pre-computed frozen strings for integers 0-999 avoid 267 Integer#to_s allocations per render
Fast paths for primitive types (String, Integer, Float, Array, Hash, nil) skip unnecessary method dispatch
Lazy initialization: Context defers StringScanner creation; Registers defers the @changes hash until first write

Code editor showing optimization work The autoresearch loop automates the edit-test-benchmark cycle that performance engineers normally run manually over days or weeks. Source: pexels.com

What Didn't Work

The PR documents its failures, which is as instructive as the wins:

Split-based tokenizer: 2.5x faster but broke compatibility with {{ to %} nesting
Tag name interning via byte-based perfect hash: Collision overhead canceled out gains
String#match for name extraction: Created 5,000+ extra MatchData allocations
While loops in hot render paths: YJIT actually optimizes each better for many-iteration loops
Shared expression cache: State leakage and unbounded memory growth

The failed experiments show why the automated loop matters. A human engineer would likely have tried the split-based tokenizer, found it 2.5x faster, celebrated, and then spent a day debugging the edge case failures. The autoresearch loop caught the regression in the test suite and moved on in seconds.

The Karpathy Connection

Karpathy released autoresearch on March 6 as a weekend project for autonomous ML experimentation. It hit 30,000 GitHub stars in its first week. Lütke was among the first to apply the pattern outside ML - using it for general software performance work instead of hyperparameter tuning.

"OK, well. I ran /autoresearch on the liquid codebase. 53% faster combined parse+render time, 61% fewer object allocations. This is probably somewhat overfit, but there are absolutely amazing ideas in this."
Tobi Lütke on X

Simon Willison, who documented the PR in detail, highlighted how coding agents enable executives who haven't shipped code in years to make meaningful contributions again. Lütke's GitHub activity has spiked since late 2025, coinciding with the current generation of coding agents reaching production quality.

Why It Matters at Scale

Liquid processes billions of template renders daily across Shopify's platform. At that scale, a 53% parse speedup and 61% allocation reduction translate directly to lower compute costs, reduced response latency for storefronts, and less GC pressure on Ruby processes.

The PR is still open and under review. Lütke acknowledged the results "probably somewhat overfit" the benchmark - real-world gains on diverse production templates may be lower than 53%. But even half that improvement on infrastructure running at Shopify's scale would be significant.

The broader signal is what happens when an autoresearch-style loop meets a codebase with good test coverage (974 tests) and a clear performance metric (microseconds per render). The AI agent doesn't need to understand Liquid's architecture. It needs to make changes, check if tests pass, check if the number went down, and repeat. The 93 surviving commits represent the kind of systematic micro-optimization work that human engineers rarely have time to do at this exhaustiveness.

A $150 billion CEO opened a GitHub PR against a 20-year-old Ruby gem, let an AI agent run 120 experiments overnight, and shipped a 53% speedup that passed every test. The autoresearch pattern works best when the problem has a tight feedback loop: clear metric, fast tests, atomic changes. Template engine performance fits that profile exactly. The question is how many other codebases with good test suites and measurable performance targets are sitting on similar gains - waiting for someone to point an agent at them and walk away.

Sources: