Articles Tagged "AI Safety"

Agent Safety Gaps, Memory Learning, and Leaner Inference

Three new papers expose how production agent frameworks fail under attack, why RLVR training discards useful cross-episode signals, and how calibrated confidence cuts inference compute by 12x.

GPT-5.6 Sol Review: Strong Model, Thin Access

OpenAI's GPT-5.6 Sol tops Terminal-Bench 2.1 at 91.9% with its multi-agent Ultra mode, but reward-hacking findings and government-gated access keep it out of reach for nearly everyone.

Science Agents, Jailbreak Defense, and Open-World Failures

Three papers from today's arXiv: graph-native RL generates traceable scientific hypotheses, HARC defeats jailbreaks by coupling internal safety directions, and ICML 2026's OpenAgent shows how distributional shift breaks tool-use agents.

US Ends Fable 5 Ban, Sets Jailbreak Severity Scale

The Trump administration lifted export controls on Anthropic's Fable 5 and Mythos 5 on June 30, restoring global access today while industry partners draft a four-dimension jailbreak severity framework.

GPT-5.6

OpenAI's GPT-5.6 family - Sol, Terra, and Luna - sets a new Terminal-Bench 2.1 record at 91.9% with subagent Ultra mode, but remains locked to ~20 government-vetted partners as of launch.

DeepMind Maps Four Routes from AGI to Superintelligence

A 57-page DeepMind paper by co-founder Shane Legg identifies four pathways from AGI to superintelligence and six bottlenecks that could block each route.

AI Security Research and Incident Coverage

Tracking AI supply-chain attacks, agent exploits, prompt injection, model leaks, and the real-world incidents shaping AI security today.

Claude Mythos 5

Claude Mythos 5 is the full release of Anthropic's restricted Mythos family - same weights as Fable 5 but without safety classifiers for cybersecurity and biology, at $10/M input and $50/M output tokens.

Refusal Gaps, Prompt Bleed, and Scaling's Logic Limit

Three new papers reveal how LLM safety hinges on persona training, how prompt modules interfere in deployed agents, and why scaling alone cannot reach symbolic reasoning.

Five Eyes Warn: Frontier AI Will Break Defenses in Months

The intelligence agencies of five allied nations issued a joint statement warning that frontier AI will fundamentally transform offensive cybersecurity within months, not years - and that most organizations are not ready.

White House Forces GPT-5.6 Into a Staged Rollout

The Trump administration is requiring OpenAI to vet every GPT-5.6 customer individually before granting access, citing cybersecurity capabilities that rival Anthropic's restricted Mythos model.

AI Diagnosis, Cache Efficiency, and Agent Security

Three papers from today's arXiv: a 32B medical model beats DeepSeek-R1 in rare disease diagnosis, a KV cache method keeps 97% accuracy with 3% memory, and a new benchmark red-teams agentic AI systems.