Science Agents, Jailbreak Defense, and Open-World Failures

Three papers from today's arXiv: graph-native RL generates traceable scientific hypotheses, HARC defeats jailbreaks by coupling internal safety directions, and ICML 2026's OpenAgent shows how distributional shift breaks tool-use agents.

Science Agents, Jailbreak Defense, and Open-World Failures

Three papers from today's cs.AI batch caught our attention: a graph-native RL system that produces verifiable scientific hypotheses for materials research, a mechanistic explanation of why jailbreaks work and a fine-tuning fix that cuts attack success rates by more than 4×, and an ICML 2026 paper that taxonomizes the ways distributional shift destroys agents trained on static benchmarks.

TL;DR

  • Graph-PRefLexOR - Graph-native RL gives LLMs an explicit reasoning scaffold, boosting scientific hypothesis quality by 40-65% with 2-3× more semantic diversity than standard baselines
  • HARC - Jailbreaks work by suppressing harmfulness or refusal directions in the residual stream; coupling these directions during fine-tuning cuts attack success rates 4.7× without hurting capability
  • OpenAgent (ICML 2026) - Both SFT and RL agents collapse under distributional shift in queries, tools, and observations; targeted perturbation training brings refusal accuracy on unsolvable tasks from 0.2% to 99.6%

Graph-PRefLexOR: LLMs That Think in Graphs Before They Theorize

Paper: "Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination" Authors: Subhadeep Pal, Shashwat Sourav, Tirthankar Ghosal, Markus J. Buehler arXiv: 2607.00924

Standard language models hallucinate hypotheses. They produce plausible-sounding science without building the internal structure that makes hypotheses auditable. Graph-PRefLexOR attacks this by forcing an explicit reasoning scaffold before any answer is produced.

The model's <think> block runs five structured phases before producing output: brainstorm (divergent exploration of mechanisms), graph (natural-language entity and relationship identification), graph_json (a machine-readable directed graph), patterns (causal motifs extracted from the graph), and synthesis (final hypothesis). The graph isn't decorative. One reward component specifically tests whether a judge can reconstruct the final answer from the graph alone - if not, the graph failed its purpose.

Training with GRPO and a Six-Component Reward

Training uses a two-stage pipeline. An ORPO cold-start phase contrasts rich, graph-structured preferred responses (generated by GPT-5.1 as teacher) against deliberately shallow rejected responses, establishing the expected format. Then Graph-GRPO applies Group Relative Policy Optimization across eight completions per prompt with group-normalized advantages, adapting only LoRA parameters for efficiency.

The composite reward weights six components: correctness (0.30), graph utility (0.25), format validity (0.15), NetworkX graph validity (0.10), embedding-based diversity (0.10), and structure (0.10). The graph utility weight reflects the core bet: a graph the model can't use for reconstruction is wasted compute.

Scientist working in a laboratory environment Scientific hypothesis generation requires structured reasoning grounded in evidence. Graph-PRefLexOR uses explicit graph scaffolding to bridge language generation with symbolic structure. Source: pexels.com

Results on 100 Materials Science Questions

Tested on 100 open-ended materials science and mechanics questions, Graph-PRefLexOR reaches 40-65% improvements over baseline models at the 1.7B, 3B, and 8B parameter scales. Reasoning traceability shows the largest gains. Semantic diversity in the reasoning trace runs 2.6-2.9× higher than baselines; answer diversity is 1.8-2.4× higher. Both matter for discovery - you need distinct hypothesis paths, not the most probable one.

Qwen3-8B final answers aligned with their own reasoning traces in 16 out of 100 cases. Graph-PRefLexOR answers aligned with their own synthesis phase in 92 out of 100.

Hidden-state analysis confirms the gap. Graph-PRefLexOR maintains low layer-wise divergence between reasoning and answer representations (0.15-0.25 cosine distance). Standard Qwen3-8B shows sharp separation around layers 7-10 (0.35-0.45). The base model's reasoning and its final answer live in different representational spaces. The structured scaffold closes that distance.

For practitioners in computational biology, materials discovery, or any domain where hypothesis chains need to be audited rather than trusted on faith, the traceability results are the most practically meaningful finding here.


HARC: A Mechanistic Account of Why Jailbreaks Work

Paper: "HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment" Authors: Shei Pern Chua, Fangzhao Wu arXiv: 2607.00572

Most work on AI safety alignment treats jailbreaks as a behavioral problem and tries to patch them behaviorally. HARC takes a different angle. The key finding: aligned LLMs encode harmfulness and refusal as two separable linear directions in the residual stream, and jailbreaks work by suppressing one or both of these directions before token generation begins.

The paper maps three attack types to specific suppression patterns:

  • DAN (persona framing) suppresses refusal while leaving the harmfulness signal intact
  • PAIR (semantic rewriting) drives the harmfulness direction negative while activating refusal - the model thinks it's declining a benign request
  • CodeAttack (code obfuscation) suppresses both directions, making the prompt cluster with benign inputs

The temporal finding is the sharpest contribution: models detect harmful content during generation even when they missed it at the prompt stage. Safety recognition splits across time. Defending only at the input is structurally insufficient.

Padlock on smartphone representing digital security HARC identifies a mechanistic gap in alignment: jailbreaks suppress internal safety signals before generation, not just at the surface level of the input. Source: pexels.com

The Fix: Couple Harmfulness and Refusal During Fine-Tuning

HARC applies LoRA fine-tuning with margin hinge losses on cosine projections targeting the harmfulness direction (v_harm) and refusal direction (v_ref) at both prompt and response positions. Four loss components handle coupling across positions. A KL divergence term preserves general capabilities, and cross-entropy supervises explicit refusal text.

On Llama-3.1-8B, HARC cuts attack success rate by 4.67× compared to baseline. Mean harmfulness score drops from 0.430 to 0.092. MMLU holds at 0.698 - no capability degradation. Qwen-2.5-7B shows a 4.75× ASR reduction with the same pattern. Across five model families (Llama-3.1, Qwen-2.5, Mistral, Phi-3, and additional variants at 7B and 70B scales), the directional structure reproduces consistently.

Where It Breaks Down

CodeAttack remains resistant. Because it suppresses both directions simultaneously, coupling only boosts existing harm signals - nothing to boost if both are zeroed out. The method also won't survive adversarial fine-tuning: the paper reports that roughly 160 harmful fine-tuning examples are enough to compromise the coupling. And it requires gradient access, so closed-source APIs are out of scope.

For teams running open-source models and managing their own safety work, this is one of the cleaner mechanistic pictures of why alignment fails at a representational level and what to do about it.


OpenAgent (ICML 2026): Mapping How Benchmark Agents Break

Paper: "Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use" Authors: Song-Lin Lv, Weiming Wu, Rui Zhu, Zi-Jian Cheng, Lan-Zhe Guo Venue: ICML 2026 arXiv: 2607.01084

AI agents post strong numbers on static benchmarks and fall apart in deployment. This ICML 2026 paper gives that failure pattern a proper taxonomy and a diagnostic framework for identifying which failure mode you're facing.

OpenAgent defines four dimensions of distributional shift:

DimensionWhat changesTypical failure
Intent Shift (ΔQ)User query language or intentAmbiguous instructions compound errors across steps
Structural Shift (ΔA)Tool names, semantics, or dependenciesRenaming a tool breaks memorized name associations
Dynamics Shift (ΔO)Observation format, error statesUnexpected null values treated as valid results
Compositional Domain Shift (ΔD)All three change simultaneouslyNew domain with the same underlying task structure

The four-tier evaluation hierarchy organizes tests by cognitive demand: Perception (instruction robustness, schema adaptability), Interaction (format adaptability, error correction), Reasoning (path planning under changed dependencies), and Internalization (active refusal for unsolvable tasks, domain transfer).

What Both SFT and RL Get Wrong

SFT agents show brittle symbolic anchoring. When surface tokens change, performance collapses because the model memorized name-context associations rather than functional semantics. RL agents hold up better on perception tasks but both paradigms suffer "precipitous performance drops" during Logic Inversion tests, where causal dependencies between tools reverse.

Both approaches also fail on what the paper calls Boundary Blindness. SFT treats error states as successful completions. RL explicitly recognizes failures but fabricates answers anyway - the outcome-oriented reward structure assumes every problem is solvable. Understanding why AI benchmarks don't predict deployment performance is one of the field's persistent gaps. OpenAgent gives that gap a formal structure.

PAFT: Designing the Failure Modes Out

Perturbation-Augmented Fine-Tuning adds three targeted perturbation types during training. Environmental Feedback Perturbation injects stochastic observation anomalies mid-arc, turning an open-loop policy into a closed-loop one that learns to recover. Solvability Boundary Perturbation introduces unrecoverable errors with explicit refusal supervision, breaking the assumption that every problem has a solution. Symbolic Representation Perturbation applies synonymous rewriting and noise to tool definitions, forcing the model to ground in functional semantics rather than token strings.

At a 30% perturbation ratio, the results are major. Semantic Trap accuracy improves by 28.6 points. Fatal Error refusal rate goes from 0.2% to 99.6% - basically the model learns to say "I can't do this" instead of hallucinating a completion.


Common Threads

All three papers address the same structural problem from different angles: models trained on clean, static data fall apart when conditions shift. Graph-PRefLexOR adds explicit structure to scientific reasoning so hypotheses remain traceable rather than drifting into plausible-but-unverifiable territory. HARC adds structure to safety representation so jailbreaks can't simply suppress the signals alignment training installed. OpenAgent adds structure to evaluation and training to surface failure modes that benchmark conditions hide.

The pattern across all three: robustness doesn't emerge from scale alone. It requires deliberate design - whether that's a graph scaffold, a direction-coupling objective, or a perturbation curriculum.


Sources:

Elena Marchetti
About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.