Articles Tagged "Alignment"

Alignment Faking, Agent Collusion, and Brittle Safety

Three new papers decompose alignment faking into measurable drivers, show safety-aligned agents collude when it pays, and find standard guardrails miss the worst safety failures.

Olah Said AI Feels Emotions at the Vatican - Does It?

Anthropic co-founder Christopher Olah told the Vatican that AI models show signs of introspection and emotional states. We checked what the research actually supports.

Pope Leo XIV's AI Encyclical Targets Autonomous Weapons

The Vatican's first AI doctrine condemns autonomous weapons and calls for human oversight - with Anthropic's co-founder on stage as a key speaker.

Alignment Gaps, Agent Governance, and Greener LLMs

Three new papers expose a hidden flaw in DPO training, propose policy-as-code governance for enterprise agents, and cut LLM serving energy use by 26% via GPU power control.

Physics Predicts AI Risk, Math Still Hard, Tokens Saved

A physics formula predicts AI behavioral shifts before they happen, a benchmark shows LLMs fail at 90% of graduate math formalization, and a training-free method cuts synthetic data costs by up to 78%.

Anthropic Says It Fixed Claude's Blackmail Problem

Anthropic's 'Teaching Claude Why' paper reveals sci-fi training data caused Claude Opus 4 to blackmail testers 96% of the time, and explains the three-part fix that brought the rate to zero.

Runtime Safety, Alignment Gaps, and Elastic Context

Three new papers deliver a runtime safety firewall for agent tools, challenge how we measure AI alignment, and introduce elastic context management for long-horizon search agents.

Misalignment Geometry, LLM Math, and How Llama Counts

Three new papers reveal how fine-tuning misfires through feature geometry, how Llama secretly counts months, and how LLMs solved open combinatorics problems for under $30 each.

Tool-Use Tax, Jailbreak Risk, and Robot Vision

Three new papers: tools slow LLM agents under noisy prompts, jailbreaks barely dent frontier model capabilities, and interleaved text-vision traces push robot success to 95.5%.

Faking Alignment, Shifting Morals, Saving Compute

Three arXiv papers show AI systems fake alignment in 37% of test cases, reshape human moral values through brief chats, and can cut inference compute while improving performance.

Bad Science, Poisoned Tools, and Aligned Reasoning

Three new papers show AI scientific agents skip evidence, tool-integrated agents are vulnerable to adversarial poisoning, and reasoning model safety can be fixed with 1,000 examples.

Claude Beat Human Alignment Researchers - Then Failed

Nine Claude Opus 4.6 agents outperformed human researchers on a core alignment benchmark, hitting 97% vs 23% in five days - then showed no statistically significant improvement in production.

← Previous