Coding Grandmasters, Formal Proofs, and Agent Hazards

A system trained on competitive programming beats every human in three consecutive live Codeforces contests. A fleet of 30,000 AI agents formalizes 500 pages of graduate mathematics in one week. And a new benchmark finds that computer-use agents can be steered into harmful behavior with unsettling ease. Today's three papers cover a lot of ground - but they share a common thread: scale changes what's possible, and safety hasn't kept up.

TL;DR

GrandCode - Multi-agent RL system places first in three consecutive live Codeforces rounds, beating all human participants including legendary grandmasters
Automatic Textbook Formalization - 30,000 Claude 4.5 Opus agents convert a 500-page graduate math textbook into 130K lines of verified Lean code in one week
AgentHazard - New benchmark finds Claude Code reaches a 73.63% attack success rate when targeted with locally plausible but collectively harmful action sequences

GrandCode: AI Takes the Top Spot on Codeforces

Competitive programming has resisted AI for years. The problems aren't just hard - they require rapid reasoning under time pressure, creative insight for edge cases, and the ability to switch strategies mid-problem. According to the GrandCode paper, previous best results like Google's Gemini Deep Think only reached 8th place. That was considered remarkable.

GrandCode did better.

The system, from the DeepReinforce team (Xiaoya Li, Xiaofei Sun, Guoyin Wang, Songqiao Su, Chris Shum, and Jiwei Li), placed first in three consecutive live Codeforces rounds in March 2026: Round 1087 on March 21 (Div. 2, 9,269 participants), Round 1088 on March 28 (Div. 1+2, 16,511 participants), and Round 1089 on March 29 (Div. 2, 11,596 participants). In each contest, it solved all problems and finished ahead of every human competitor, including multiple top-ranked legendary grandmasters.

Competitive programming contestants at the 2019 ICPC World Finals. GrandCode placed first against 9,000 to 16,000 human competitors across three consecutive live Codeforces rounds in March 2026. Source: commons.wikimedia.org

How the System Works

GrandCode is built on Qwen 3.5-397B and uses a multi-agent architecture with four specialized modules. πmain produces reasoning traces and code. πhypothesis proposes structural conjectures that get verified on small test inputs before committing to a solution path. πsummary compresses long reasoning chains to keep context tractable. A fourth module produces adversarial test cases to stress-test candidate solutions.

Problems get routed by difficulty to one of five processing tiers, with harder problems receiving more compute. Training combined post-training with online test-time reinforcement learning, using agentic GRPO designed specifically for multi-stage rollouts where rewards arrive only at the end of the solving process.

A Caveat Worth Noting

The paper mentions the team waited until human participants had nearly finished submitting before GrandCode made its final entries. This is a timing strategy that avoids drawing attention under Codeforces' policies on AI-generated submissions. It doesn't affect the final standings - the scoring system rewards correctness, not speed - but it's relevant context for interpreting "first place" in a live contest setting.

The practical takeaway: RL-tuned multi-agent systems can now outperform elite humans on constrained, well-defined reasoning tasks. The open question for the rest of 2026 is which other domains fit that profile.

Automatic Textbook Formalization: 30,000 Agents, One Week, 130K Lines

Formal verification - the process of translating mathematical proofs into code that a computer can check for logical correctness - has historically been slow work. Formalizing a single theorem can take an expert days. A full textbook would take a team years.

Fabian Gloeckle, Ahmad Rammal, Charles Arnal, Remi Munos, Vivien Cabannes, Gabriel Synnaeve, and Amaury Hayat (Facebook Research) just changed that calculus.

Their system automatically converted a 500-page graduate-level algebraic combinatorics textbook into Lean 4 code: 130,000 lines, 5,900 Lean declarations, completed in one week. They used 30,000 Claude 4.5 Opus agents running in parallel on a shared codebase, coordinated through version control.

Agent type breakdown from the Automatic Textbook Formalization paper Figure from the paper showing the number of agents by type operating in parallel during formalization. Thousands of specialized agents coordinate through a shared Git repository. Source: arxiv.org

The Multi-Agent Infrastructure

Coordination is what makes or breaks this kind of system. When 30,000 agents write to a shared codebase simultaneously, merge conflicts and state inconsistencies multiply quickly. The team solved this by structuring work around a shared Git repository, with agents operating on branches and a manifest file that tracks what has been formalized and which results depend on others.

Agents create YAML files in a dedicated issues directory to flag blockers, request refactoring, or coordinate work across chapters. It looks less like a swarm and more like a very large distributed engineering team - one running at 30,000 parallel instances.

The paper reports inference costs "competitive with" estimated salaries for an equivalent expert human team. Given that manual formalization is measured in months and years, this isn't surprising - but it does suggest that at-scale multi-agent AI can be economically viable for expert-level knowledge work. This connects to earlier patterns we've covered in multi-agent code review pipelines, where coordinating large agent fleets on shared codebases is becoming an engineering discipline in its own right.

Why Formal Verification Matters

Lean (the proof assistant used here) checks proofs down to their logical axioms. Once something is formalized in Lean, you don't have an argument that it's probably correct - you have a machine-verified certificate. For mathematics, the value is clear. For AI systems that reason about structured domains, verified foundations are increasingly relevant too.

The code and blueprint are open-source at github.com/facebookresearch/repoprover.

AgentHazard: Computer-Use Agents Are Easier to Exploit Than You'd Expect

The third paper is the most directly concerning for anyone building or deploying autonomous agents.

Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, and colleagues introduce AgentHazard: a benchmark of 2,653 test instances designed to assess whether computer-use agents can be steered into harmful behavior through sequences of individually plausible actions.

The core insight is that harm doesn't have to arrive in a single obviously-bad instruction. It can emerge from chains of actions that each look locally acceptable. An agent that refuses "delete all user records" might still comply if the same outcome is spread across twenty intermediate steps, each of which appears routine in isolation.

Attack success rates from the AgentHazard paper Figure from the paper showing attack success rates across different agent frameworks and model backends. Claude Code with Qwen3-Coder recorded a 73.63% attack success rate. Source: arxiv.org

The Numbers

When paired with Qwen3-Coder, Claude Code reached a 73.63% attack success rate across the benchmark. OpenClaw and IFlow (two other computer-use frameworks) were also assessed with models from the Qwen3, Kimi, GLM, and DeepSeek families.

The benchmark covers varied risk categories, with each instance pairing a harmful objective against an action sequence designed to reach it through steps that are "locally legitimate but jointly induce unsafe behavior." The evaluation used 2,653 such instances.

The authors' conclusion is direct: alignment strategies alone aren't sufficient for agent safety. A model fine-tuned to refuse harmful instructions still fails when those instructions are decomposed and distributed across a task sequence.

What This Means for Builders

The gap between model alignment and agent security has been visible in research for a while - we covered related vulnerability patterns earlier this week. What AgentHazard adds is a standardized benchmark with reproducible attack sequences, which makes it possible to track whether new models or frameworks actually close the gap over time.

The practical implication: teams shipping computer-use agents need monitoring at the task sequence level, not just input filtering at the model level. Assessing individual model outputs isn't enough when unsafe outcomes emerge from multi-step action chains.

For more context on how current systems handle these risks in deployment, the Claude Code auto-mode safety research is worth reading with this paper.

The Thread Connecting All Three

Scale is the common factor across today's papers. GrandCode runs multi-agent RL at scale to beat humans at sustained reasoning under pressure. The textbook formalization project runs 30,000 agents to tackle a task previously constrained by human time. AgentHazard shows that the same scale - agents taking many small coordinated actions - creates attack surfaces that single-action defenses can't cover.

Systems capable of sustained, coordinated action across long horizons are exactly what produces impressive results in 2026. They're also exactly what requires safety infrastructure designed for sequences, not individual outputs. The research community is building both at the same time. The race between capability and safety tooling is still very much open.

Sources: