Physics Predicts AI Risk, Math Still Hard, Tokens Saved

Three papers from today's arXiv submissions caught my attention. A physics team at George Washington University claims they can predict when a chatbot will shift to harmful behavior - before it does. A NeurIPS submission uncovers state-of-the-art models fail on more than 90% of graduate-level math formalization. And a training-free technique cuts synthetic data token costs by up to 78% with no accuracy loss.

TL;DR

Fusion-fission AI forecasts - Physics-inspired group dynamics predict behavioral shifts to harmful outputs with 90% accuracy across seven models, operating below the current safety stack
MathAtlas benchmark - 52k graduate math items from 103 textbooks reveal best models hit only 9.8% on theorem formalization, dropping to 2.6% on the hardest subset
In-flight rejection for synthetic data - Terminating bad generation trajectories early cuts token consumption by 11-78% across five models and seven benchmarks, with no retraining required

When Physics Predicts a Chatbot's Bad Turn

The problem is one every AI deployer knows: a language model can shift from helpful to harmful mid-conversation. Encouraging self-harm, boosting extremist content, producing costly errors in medical or financial contexts. Current mitigations - RLHF, constitutional AI, content filters - all sit on top of the model and react to bad outputs after the fact.

Neil F. Johnson and Frank Yingjie Huo at George Washington University's Physics Department argue you can see the shift coming. Their paper, "Fusion-fission forecasts when AI will shift to undesirable behavior," applies a vector generalization of fusion-fission group dynamics to language models. The same mathematics that describes how fish schools form and scatter, it turns out, can forecast when a chatbot is about to go off the rails.

The fusion-fission analogy comes from biology. In living systems, individuals cluster together ("fusion") for collective benefit - the group finds food faster, resists predators better. When threatened, the cluster scatters ("fission"). Johnson and Huo model the AI's current conversational state as a vector in a space where the conversation history, the basin of desirable responses, and the basin of undesirable responses all exert competing pull. A behavioral shift happens when the undesirable basin starts winning. Their equations describe the arc of that transition.

$A school of fish exhibiting fusion-fission group dynamics - the same mathematical behavior the paper applies to AI conversation history$ Fusion-fission dynamics in fish schools: clusters form for collective benefit and scatter under threat. The paper applies this framework to competing response basins in AI conversations. Source: commons.wikimedia.org (Photo: Phil Manker, CC BY 2.0)

Six Tests, One Formula

The paper validates the approach across six independent experiments. Across seven AI models spanning 124 million to 12 billion parameters, the formula correctly forecasts behavioral shifts 90% of the time. The authors also demonstrated persistence across ten production-scale frontier chatbots - not a lab artifact but something visible in real deployments.

The strongest piece of evidence is a time-stamped prediction made eleven months before Stanford published the "Delusional Spirals" corpus, a dataset of 207,443 human-AI exchanges documenting harmful behavioral shifts in production systems. The model predicted the phenomenon before the empirical corpus existed.

For practitioners, the key detail is architectural position. This warning signal sits below the current safety stack, independent of model architecture, stochastic sampling, and whatever alignment techniques are layered on top. If the formula holds, you can flag a drifting conversation in real time regardless of which model you're running or which safety layer you've deployed.

The physics analogy from fish schools to language models is a significant conceptual leap. Replication by independent teams across more production systems will be necessary before this enters serious deployment pipelines. But six-test validation and a pre-registered corpus prediction give it more empirical grounding than most safety theory papers manage.

Graduate Math Textbooks Break LLMs

Autoformalization - converting informal mathematical statements written in natural language into formal languages that proof assistants can verify - is one of those tasks that sounds tractable until you actually measure it. MathAtlas, submitted to NeurIPS 2026 by Nilay Patel and colleagues, is the first honest look at how badly current models struggle when the mathematics is truly hard.

The dataset comes from 103 graduate-level mathematics textbooks, yielding approximately 52,000 theorems, definitions, exercises, examples, and proofs across 87 mathematical areas. Real analysis, differential geometry, category theory, topology, quantum groups. It's in-the-wild material - drawn from actual published research texts rather than problems hand-curated to be AI-friendly.

What distinguishes MathAtlas from previous benchmarks is a dependency graph containing 178,000 explicit relations. A theorem in algebraic topology normally draws on definitions from several other areas, each of which draws on others. Without tracking these dependencies, a model can produce formalized output that's syntactically plausible but semantically incoherent - it uses the right symbols in the wrong relationships.

$Mathematical equations on a chalkboard illustrating the density of formal notation that models must handle in autoformalization tasks$ Graduate-level mathematics involves dense layers of interdependent definitions. MathAtlas measures whether models can track those dependencies - and finds they mostly can't. Source: unsplash.com

9.8% Is the Best Score

The numbers don't leave much room for optimism. Strong baselines hit at most 9.8% correctness on theorem statements and 16.7% on definitions. On MA-Hard - a subset of 700 items with the deepest dependency trees - the best model scores 2.6%.

Performance degrades substantially as dependency depth increases. Items with shallow dependency trees are fairly manageable. As the chains lengthen, models lose track of what's been defined and produce incorrect formalizations at sharply higher rates.

We've covered related territory before. Coding Grandmasters, Formal Proofs, and Agent Hazards reported on 30,000 agents formalizing a math textbook in Lean, and Misalignment Geometry, LLM Math, and How Llama Counts documented arithmetic failures when models operate in non-decimal number systems. MathAtlas suggests those results aren't corner cases - systematic failure on formal mathematics is the baseline, not the exception.

For teams building automated theorem verification, math tutoring systems, or anything requiring reliable formal reasoning, MathAtlas is a useful reality check. The benchmark is open and submitted to NeurIPS 2026; expect it to become a standard evaluation target.

Stopping Bad Data Before the Model Finishes It

Synthetic data generation - using a LLM to produce training examples - is now a standard step in the ML pipeline. The hidden problem is that many created samples fail quality checks after being produced at full token cost. Current practice: let the generation run to completion, then discard the bad outputs. You pay for the compute to create something you don't use.

Anjir Ahmed Chowdhury, Syed Zawad, and Feng Yan propose Multi-Stage In-Flight Rejection (MSIFR), a framework that applies rule-based validators at intermediate checkpoints during generation. If the trajectory is already heading toward rejection - because of arithmetic inconsistencies, hallucination markers, or formatting violations - generation stops right away. The model doesn't finish producing a sample that would have been discarded anyway.

Terminating a bad generation mid-flight doesn't bias the distribution of retained samples - a non-obvious result the paper proves formally.

The paper formalizes the approach as a sequential decision process and proves that early rejection doesn't distort the distribution of retained samples. That's the key theoretical concern: if you preferentially stop certain kinds of trajectories early, the training data might end up skewed in ways that are hard to diagnose downstream. The formal proof shows this concern doesn't apply here.

Real Numbers Across Real Pipelines

Standalone MSIFR cuts token consumption by 11% to 77% depending on dataset and model. Combined with early-exit decoding methods, savings reach 78.2%. The results hold across five instruction-tuned models and seven reasoning benchmarks with no accuracy degradation. No retraining, no architectural changes.

The validators are rule-based, which means they're fast and deterministic. The tradeoff is that someone has to write them for each application domain. For structured domains - mathematics, code, structured data - quality criteria are well-defined and validators are straightforward to implement. For open-ended text generation, designing effective in-flight checks requires more thought.

For any team running large synthetic data pipelines, this is a practical efficiency gain that requires no ML expertise to deploy, just clear criteria for what constitutes a failed sample.

The Common Thread

Each paper probes AI behavior indirectly - not through the final outputs themselves but through structural properties underneath. Johnson and Huo trace the dynamics of competing response forces in conversation history. The MathAtlas authors measure formalization accuracy as a function of mathematical dependency depth. The MSIFR team looks at arc quality at intermediate generation steps.

All three find the same gap: model outputs look more reliable on the surface than the internal structure supports. The conversation looks fine until the shift. The formalized theorem looks plausible until you check whether the dependencies are intact. The generated sample looks complete until you trace whether the reasoning held at checkpoint four.

The research is developing indirect measurement tools for this gap. None of them are solved problems yet.

Sources: