Percepta Builds a Computer Inside a Transformer

Percepta AI claimed to build a computer inside a transformer. Not a metaphor - they compiled a WebAssembly interpreter directly into the weights of an autoregressive transformer, enabling it to execute arbitrary programs token by token with 100% accuracy. The model solved the world's hardest Sudoku puzzle in under three minutes and performed multi-digit addition without a single probabilistic error.

The March 11 blog post by Christos Tzamos, published at percepta.ai, sparked immediate excitement on Hacker News. It also sparked immediate skepticism. The gap between the two reactions tells you everything about where this research stands.

TL;DR

Percepta compiled a WebAssembly interpreter into transformer weights - the model executes programs through its forward pass, not via external tool calls
2D attention heads with HullKVCache enable O(k + log n) decoding instead of standard O(n^2) attention
Architecture: 7 layers, d_model=36, 18 heads (2 dimensions per head)
Performance: 33,000+ tokens/sec on CPU, 100% accuracy on Sudoku and arithmetic
Critical caveat: the weights aren't trained via gradient descent - they're compiled directly, and no training methodology is demonstrated

How It Actually Works

The Core Idea

Standard LLMs are probabilistic text generators. They predict the most likely next token given context. Percepta's approach is different: instead of training a model to approximate computation, they compile a deterministic interpreter into the transformer's weights. The model does not learn to add numbers - it executes an addition program.

The pipeline:

Write a program in C
Compile it to WebAssembly (WASM)
Encode the WASM interpreter into the transformer's weight matrices
The model's forward pass executes the program step by step, producing an execution trace token by token

Each token the model generates represents one step of program execution. The output isn't a prediction - it's a deterministic computation. The same input always produces the same output.

The Architecture

The transformer uses a compact 7-layer design with d_model=36 and 18 attention heads, giving exactly 2 dimensions per head. This is the "2D attention" that enables the key trick:

gate, val = ff_in(x).chunk(2, dim=-1)

By restricting lookup heads to 2 dimensions, the model can perform efficient log-time operations on the sequence length. Combined with HullKVCache - a mechanism for fast state lookups in the key-value cache - the decoding complexity drops from O(n^2) to O(k + log n), where k is the program state size and n is the sequence length.

This means the model can execute programs for millions of steps without the quadratic attention bottleneck that normally limits transformer sequence length.

The Results

Benchmark	Result
Arto Inkala's hardest Sudoku	Solved, 100% accuracy, under 3 minutes
Multi-digit addition	100% accuracy across millions of tokens
Throughput (CPU)	33,000+ tokens/sec
Throughput (GPU)	Not reported

The Sudoku result is the headline demo. Arto Inkala's puzzle is widely cited as one of the hardest Sudoku puzzles ever constructed. The model solves it by executing a backtracking solver compiled into its weights - not by "reasoning" about the puzzle.

Concrete Example

A traditional LLM asked to add 1847392 + 9284716 will often get it wrong because token-by-token prediction isn't the same as carrying digits. Percepta's model executes an actual addition algorithm - the same one a CPU would run - through its transformer weights. The result is always correct because the computation is deterministic, not probabilistic.

Approach	How It Works	Accuracy
Standard LLM	Predicts likely next tokens	Variable (often wrong for large numbers)
LLM + tool call	Calls Python/calculator externally	100% (but exits the model)
Percepta	Executes program inside the model	100% (stays inside the forward pass)

The key distinction from tool-calling: the computation happens inside the transformer's forward pass. The execution trace is part of the model's output. The authors claim this means "the whole process remains differentiable: we can even propagate gradients through the computation itself."

Why It Matters Now

The timing is significant. As LLMs are increasingly used for coding and mathematical reasoning, their inability to perform reliable arithmetic remains an embarrassing limitation. Percepta's work suggests a path where computation is not bolted onto LLMs as an external tool but embedded into the architecture itself.

If the approach can be extended - compiling arbitrary programs into trainable transformer components - it could allow models to learn when to compute deterministically and when to reason probabilistically. The model would decide: "this is a math problem, switch to computation mode" versus "this is a creative writing prompt, stay in generation mode."

What To Read Next

Before getting too excited, the Hacker News discussion raised serious concerns that the blog post doesn't address:

No Training Was Demonstrated

The most significant criticism: the model's weights aren't learned through gradient descent. They're compiled directly from the WebAssembly interpreter. This is closer to writing a very unusual computer program than to training an AI model. One commenter noted: "If you want a WASM interpreter, just run a WASM interpreter."

The Differentiability Claim Is Unproven

The blog post claims the execution trace is differentiable, which would theoretically allow integrating the computational substrate into a trainable model. But the post uses "average-hard attention," which isn't differentiable about keys and queries. The authors acknowledge that differentiable variants "should" work but don't show this.

Missing Performance Comparisons

33,000 tokens/sec on CPU sounds fast, but one HN commenter estimated the approach may be 10,000x slower than native WASM execution for the same programs. The blog post provides no benchmarks against Python tool-calling, native WASM, or even a simple calculator.

The Blog Post Itself Raised Flags

Multiple HN commenters flagged the writing as likely AI-generated, citing "repetitiveness," a "schmoozing salesman feel," and lack of substantive detail. For a research announcement of this significance, the presentation undermined the credibility of the underlying work.

Percepta's work is technically interesting: compiling a WASM interpreter into transformer weights and achieving O(log n) attention through 2D heads is a novel architectural contribution. But the gap between "we compiled a program into a transformer" and "transformers can compute" is enormous. The model doesn't learn to compute - it has computation injected into its weights. Whether that injection can be made trainable, integrated into larger models, and shown to beat simply calling an external tool remains completely unproven. The research community's reaction - equal parts "this is fascinating" and "but why?" - is the right one.

Sources: