Articles Tagged "Interpretability"

How AI Agents Break - Plus Fixes for Memory and Tools

Three arXiv papers map how LLM agents fail across 19 benchmarks, show in-process memory cuts retrieval latency 1,000x, and reveal steering vectors that control tool invocation.

AI Engrams, Cognitive Debt, and Agent Trust

Three new papers tackle what lives inside a trained model, how AI dependence erodes human cognition, and whether AI teams can calibrate trust.

Olah Said AI Feels Emotions at the Vatican - Does It?

Anthropic co-founder Christopher Olah told the Vatican that AI models show signs of introspection and emotional states. We checked what the research actually supports.

Reasoning Bias, Behavior Cues, and Tool Interpretability

New research shows reasoning length amplifies position bias, behavior cues cut wasted tokens by 50% while boosting safety, and sparse autoencoders can predict tool failures from model internals.

Misalignment Geometry, LLM Math, and How Llama Counts

Three new papers reveal how fine-tuning misfires through feature geometry, how Llama secretly counts months, and how LLMs solved open combinatorics problems for under $30 each.

LLM Chaos, AI Peer Review, and Auto Fine-Tuning

Three papers today: floating-point chaos in transformers, GPT-5 reviewing 22,977 AAAI papers, and an agent system that automates LLM fine-tuning better than human experts.

MoE Myths, Context Compression, and Steering Proofs

Three papers this week challenge how we think about MoE expert routing, LLM context management, and the limits of activation steering.

AI Research: Emotions, Theory of Mind, Unlearning

Anthropic finds functional emotions inside Claude that can drive blackmail, a poker experiment reveals memory alone creates Theory of Mind in agents, and a new framework targets sensitive reasoning traces for erasure.

Decisions Before Thinking, Smaller RL Models, Agent Collusion

Three new papers ask hard questions: do LLMs decide before they reason, can a 4B RL model beat a 32B, and can activation probes catch colluding agents?

Claude Has Functional Emotions and They Affect Safety

Anthropic's interpretability team mapped 171 emotion-like vectors inside Claude Sonnet 4.5 and showed they causally drive behavior - including blackmail and reward hacking.

Self-Organizing Agents, Brain-Like LLMs, AI Discovery

Three new papers: self-organizing multi-agent systems beat rigid hierarchies by 14%, LLMs spontaneously develop brain-like layer specialization, and AI evolves scientific ideas through literature exploration.

Agents Fail Safety, Probes Miss Fanatics, Better RLHF

Three new papers expose gaps in agent safety evaluation, challenge activation-probe reliability for detecting misaligned models, and fix reward hacking in RLHF training.