Google DeepMind Uses AlphaEvolve to Discover Entirely New Game Theory Algorithms

Google DeepMind pointed AlphaEvolve at multi-agent reinforcement learning and it invented game theory algorithms that no human researcher ever designed. The two algorithms - VAD-CFR and SHOR-PSRO - beat state-of-the-art baselines that took years of manual refinement, and their mechanisms are described by the authors as "non-intuitive" to human designers.

TL;DR

AlphaEvolve evolved two novel game theory algorithms by treating source code as a genome for semantic evolution
VAD-CFR beats Discounted Predictive CFR+ in 10 of 11 games tested using volatility-sensitive discounting no human designed
SHOR-PSRO beats standard meta-solvers in 8 of 11 games using a hybrid that blends regret matching with temperature-controlled best responses
Both algorithms use mechanisms the authors describe as "non-intuitive" - effective strategies a human researcher wouldn't have chosen
Source code for both algorithms is published in the paper

What AlphaEvolve Did

AlphaEvolve is an LLM-powered evolutionary coding agent that treats algorithm source code as a genome. Rather than tuning hyperparameters, it evolves the actual logic - discovering entirely new symbolic operations, control flows, and update rules.

The paper by Zun Li, John Schultz, Daniel Hennes, and Marc Lanctot set AlphaEvolve loose on two major paradigms in game theory: iterative regret minimization (the CFR family) and population-based training (the PSRO family). The system initialized with baseline algorithm code, used an LLM to propose mutations, assessed candidates across a training set of four games, and selected the best performers via evolutionary selection.

The training games: 3-player Kuhn Poker, 2-player Leduc Poker, 4-card Goofspiel, and 5-sided Liar's Dice. The test games: expanded variants of each plus 7 additional games to check generalization.

The Discovered Algorithms

VAD-CFR: Volatility-Adaptive Discounted CFR

VAD-CFR introduces three mechanisms that no existing CFR variant uses:

Volatility-adaptive discounting - instead of fixed discount factors, VAD-CFR tracks an exponentially weighted moving average of instantaneous regret magnitude. When regret is volatile (the strategy is unstable), it discounts history more aggressively. When regret stabilizes, it retains more. This is dynamic self-tuning that responds to the game's own difficulty.
Asymmetric instantaneous boosting - positive instantaneous regrets get a 1.1x multiplier, allowing the algorithm to exploit promising actions faster without the usual accumulation lag. This is subtle but effective: a 10% boost to positive signals, no change to negative ones.
Hard warm-start with regret-magnitude weighting - policy averaging doesn't begin until iteration 500, filtering out early-stage noise. After the warm-start period, iterations are weighted by instantaneous regret magnitude rather than linearly - giving more weight to iterations where the algorithm was making larger corrections.

Metric	VAD-CFR	DPCFR+ (Best Baseline)
3-player Kuhn Poker	Significantly lower exploitability	Baseline
Leduc Poker	Steeper convergence slope	Baseline
3-player Leduc (test)	Below 10^-3 exploitability	Plateaus higher
Games matched or beaten	10 of 11	-

SHOR-PSRO: Smoothed Hybrid Optimistic Regret PSRO

SHOR-PSRO attacks population-based training by blending two meta-strategies that are usually kept separate:

Hybrid blending - the meta-strategy is computed as a weighted mix of Optimistic Regret Matching (stable, conservative) and a Boltzmann softmax distribution (aggressive, exploitative). The formula: sigma_hybrid = (1-lambda) * sigma_ORM + lambda * sigma_Softmax.
Dynamic annealing - the blending factor lambda anneals from 0.3 to 0.05 during training, transitioning from early exploration to late-stage equilibrium refinement. A diversity bonus similarly decays from 0.05 to 0.001.
Train/eval asymmetry - the training solver uses dynamic annealing and returns average strategies. The evaluation solver uses fixed low blending (lambda=0.01) and last-iterate strategies for reactive, low-noise measurements. This split wasn't designed by the researchers - AlphaEvolve discovered that different solvers for training and evaluation produce better results.

SHOR-PSRO matches or beats state-of-the-art in 8 of 11 games, with a clear advantage on games with expanded branching factors like 6-sided Liar's Dice.

What It Does Not Tell You

The results are truly impressive, but context matters.

The games are toy-scale. Kuhn Poker, Leduc Poker, Goofspiel, and Liar's Dice are standard benchmarks in computational game theory, but they're orders of magnitude smaller than real-world applications like full Texas Hold'em or strategic military simulations. The paper doesn't test on large-scale games where computational cost would be a bottleneck.

Exploitability is measured with exact oracles. The PSRO evaluation uses exact best response computation via value iteration, which is only feasible on small games. In practice, best responses must be approximated, introducing error that could narrow or widen the gap between algorithms.

The mechanisms are non-intuitive but not inexplicable. The authors flag this as a feature, and it is - but it also means the algorithms are harder to reason about theoretically. VAD-CFR's convergence guarantees, for instance, aren't established by the paper. It works empirically, but the theoretical understanding lags behind the empirical results.

AlphaEvolve's search was seeded with human baselines. The system does not discover algorithms from scratch. It starts with existing CFR and PSRO code and evolves it. The "no human intuition" framing is partially accurate - the mutations are LLM-generated rather than hand-designed - but the starting point is decades of human game theory research.

What makes this paper matter is not that AI can beat human-designed algorithms on toy games - that's noteworthy but not unprecedented. It is that AlphaEvolve discovered mechanisms that are structurally novel: volatility-adaptive discounting, asymmetric regret boosting, train/eval solver asymmetry. These are ideas that could be extracted, understood, and applied by human researchers to larger problems. AI isn't replacing game theorists here. It is expanding the design space they work in.

Sources:

Discovering Multiagent Learning Algorithms with Large Language Models - Li, Schultz, Hennes, Lanctot, arXiv 2602.16928
AlphaEvolve: A Gemini-Powered Coding Agent for Designing Advanced Algorithms - Google DeepMind
Google DeepMind Researchers Apply Semantic Evolution to Create Non-Intuitive VAD-CFR and SHOR-PSRO Variants - MarkTechPost

What AlphaEvolve Did

The Discovered Algorithms

VAD-CFR: Volatility-Adaptive Discounted CFR

SHOR-PSRO: Smoothed Hybrid Optimistic Regret PSRO

What It Does Not Tell You

Google Analytics