Olympiad Gold, Broken Memories, and Attention Loss

Three papers this week span the extremes of what AI systems can do and how quietly they can fail. A compact 30B model earns gold medals at international science olympiads. At the same time, two papers expose degradation patterns inside systems that already appear to work: agent memories that corrupt over time, and LLMs that silently drop instructions mid-conversation.

TL;DR

SU-01 - A 30B-A3B model scores gold-medal on IMO 2025, USAMO 2026, and IPhO 2024/2025 with a three-stage training recipe any lab can replicate
Memory consolidation - GPT-5.4 failed 54% of previously solved ARC-AGI problems after continuous memory updates; raw episodic storage outperforms abstraction
When attention closes - A new diagnostic metric (Goal Accessibility Ratio) predicts when LLMs will lose track of multi-turn instructions before failure happens

SU-01: Gold Medals From a 30B Model

Yafu Li et al. - "Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling" (arXiv:2605.13301)

SU-01 is a 30B Mixture-of-Experts model (30B total parameters, 3B active) that achieves gold-medal performance on four international academic competitions: IMO 2025, USAMO 2026, IPhO 2024, and IPhO 2025.

The headline numbers: with test-time scaling, SU-01 scores 35 points on both IMO 2025 and USAMO 2026 - clearing the gold medal threshold on each. On USAMO 2026, that score matches the highest human total among 340 competitors. Without test-time scaling, the model scores 21 points on IMO 2025 and 57.6% on IMO-ProofBench, rising to 70.2% with scaling. The math olympiad leaderboard already shows that AIME 2025 is saturated; the frontier has shifted to harder open-ended proof tasks.

What makes this work interesting isn't the benchmark numbers alone. It's the recipe.

Mathematical equations filled across a blackboard Competition mathematics gives AI no shortcuts - a model either derives the proof or it doesn't. Source: unsplash.com

The Training Pipeline

The team built a three-stage post-training pipeline on top of an existing backbone (P1-30B-A3B):

Stage 1: Supervised fine-tuning with reverse-perplexity curriculum. Standard curriculum learning starts easy and gets harder. SU-01 inverts this: training begins with the trajectories the model finds hardest (highest perplexity under the current policy), then works downward. The 338K training trajectories are all under 8K tokens. The authors find the reversed ordering accelerates capability acquisition and improves training stability.

Stage 2: Two-stage reinforcement learning. Coarse RL first - 8,967 verifiable prompts with GSPO (Group Sequence Policy Optimization), which provides gradient signal from solution correctness. Then refined RL on 16,287 non-verifiable problems, using proof-level generative rewards from DeepSeekMath-V2 to assess solution rigor rather than just final answers. Experience replay preserves rare successful proofs on hard problems so they're not forgotten as training progresses.

Stage 3: Test-time scaling via self-verification. SU-01 sustains reasoning chains beyond 100K tokens, conditioning on its own drafts, error analyses, and repeated proof attempts. The model learns to verify and repair candidate solutions rather than committing to a first answer.

On standard answer-verifiable benchmarks, SU-01 averages 77.3% - matching Qwen3.6-35B-A3B's 77.4% while being structurally smaller. The gap opens on proof-heavy olympiad tasks where self-refinement matters most.

The result is a replicable path to gold-medal reasoning from a sub-frontier model. Raw scale isn't the limiting factor.

When Memory Consolidation Breaks Agents

Dylan Zhang, Yanshan Lin et al. - "Useful Memories Become Faulty When Continuously Updated by LLMs" (arXiv:2605.12978)

Most agent frameworks treat memory consolidation as a background optimization. Raw experience trajectories are expensive to store and hard to search, so you summarize them into reusable abstractions. This seems sensible. The authors of this paper tested whether it is actually true.

It isn't.

The study tracks memory quality across repeated consolidation cycles in a controlled ARC-AGI Stream environment. The pattern holds consistently: memory utility rises initially, then degrades, and eventually falls below the no-memory baseline. The agents end up worse than if they had remembered nothing.

The most concrete result: GPT-5.4 with continuous memory consolidation failed 54% of ARC-AGI problems it had previously solved without memory. The consolidation process corrupted the knowledge rather than preserving it.

Why Consolidation Drifts

The root cause is compounding abstraction. When a LLM summarizes a trajectory into a memory, it makes interpretive choices - which details matter, which patterns to generalize. Those choices are reasonable the first time. On the second consolidation, the model is abstracting an abstraction. By several cycles in, what remains is a steadily distorted echo of the original experience.

Different update schedules produce qualitatively different memories from identical source trajectories. The problem isn't frequency; it's that the model can't faithfully compress the specific features that made a solution work.

The Fix: Preserve the Evidence

The authors' central recommendation is counterintuitive: don't consolidate. Retaining raw episodic memories - the original trajectories - as first-class storage matches or outperforms sophisticated consolidation schemes across their tests.

This connects to prior work on agent memory architecture that identified similar instability in tiered systems under continuous update. The pattern is consistent: memory consolidation should be an explicit, gated choice, not an automatic background process. When consolidation happens, verify the abstract memory still solves the original task before discarding the raw evidence.

For practitioners: if your agent framework auto-consolidates on every step, you're likely degrading performance that starts strong. The default should be episodic retention with selective, verified consolidation.

Close-up of a neuron with branching dendrites against dark background Biological memory benefits from forgetting irrelevant detail. LLM memory consolidation, this paper shows, can't reliably tell which details matter. Source: unsplash.com

When LLMs Lose the Thread

Vardhan Dongre, Joseph Hsieh et al. (Adobe Research, UCSB) - "When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction" (arXiv:2605.12922)

The failure mode is familiar to anyone who has run long agentic workflows: 20 turns in, the model ignores constraints from the first message. This paper explains the mechanism and introduces a metric that predicts failure before it occurs.

The Channel-Transition Account

The authors argue that as conversations extend, created tokens attend progressively less to the goal-defining tokens from the initial instructions. The attention channel to the task specification closes.

The key finding is that goal information doesn't disappear from the model. Linear probes on residual streams recover episode-level outcomes with up to 0.99 AUC, even when attention to the original goal tokens has dropped to near zero. The information persists in the model's internal representations. The model just can't access it through attention anymore.

The causal link is sharp. In ablation experiments on Mistral, disrupting the attention pathway to goal tokens dropped fact retention from near-perfect to 11% on a 20-item recall task.

Goal Accessibility Ratio

The paper introduces the Goal Accessibility Ratio (GAR) as a diagnostic: the fraction of attention from generated tokens that flows back to task-defining goal tokens. As GAR falls, failure probability rises.

Different architectures fail differently. Some models preserve goal-conditioned behavior at near-zero GAR - they route through residuals instead of attention. Others collapse despite retaining decodable goal information in residuals; they need the attention channel and can't compensate. The layer where goal information first encodes varies from layer 2 to layer 27 across architectures tested.

Practical Implications

If you're building multi-turn systems, you need to know how your model routes goal conditioning. GAR gives you a way to measure this empirically. Periodic goal re-injection - restating key constraints mid-conversation - is a well-known mitigation; this paper provides the mechanistic explanation for why it works (it reopens the attention channel) and why it's necessary for attention-routing models but less critical for residual-routing ones.

This extends earlier behavioral observations of attention degradation with a circuit-level explanation. Knowing which failure mode applies to your model changes which mitigation to use.

The Pattern Across All Three

What connects these papers is the gap between peak performance and sustained performance. SU-01 reaches gold-medal level - but requires self-verification loops to stay there. Memory-based agents start strong and degrade silently through consolidation. LLMs follow complex instructions at turn one and drop them by turn twenty.

The research direction is consistent: systems need explicit mechanisms for verifying their own outputs, preserving evidence rather than abstractions, and maintaining goal access throughout extended operation. Hitting impressive single-session results is a solved problem. The harder question - reliability under sustained deployment conditions - is where the field is focused now.

Sources: