Interpretability

AI Research: Emotions, Theory of Mind, Unlearning

Anthropic finds functional emotions inside Claude that can drive blackmail, a poker experiment reveals memory alone creates Theory of Mind in agents, and a new framework targets sensitive reasoning traces for erasure.

Decisions Before Thinking, Smaller RL Models, Agent Collusion

Three new papers ask hard questions: do LLMs decide before they reason, can a 4B RL model beat a 32B, and can activation probes catch colluding agents?

Claude Has Functional Emotions and They Affect Safety

Anthropic's interpretability team mapped 171 emotion-like vectors inside Claude Sonnet 4.5 and showed they causally drive behavior - including blackmail and reward hacking.

Self-Organizing Agents, Brain-Like LLMs, AI Discovery

Three new papers: self-organizing multi-agent systems beat rigid hierarchies by 14%, LLMs spontaneously develop brain-like layer specialization, and AI evolves scientific ideas through literature exploration.

Agents Fail Safety, Probes Miss Fanatics, Better RLHF

Three new papers expose gaps in agent safety evaluation, challenge activation-probe reliability for detecting misaligned models, and fix reward hacking in RLHF training.

Interpretability Limits, Dark Models, Persona Traps

Three new papers expose a gap between what AI models know and what they do - and why that gap is harder to close than anyone assumed.

Reasoning Traps, LLM Chaos, and Steering Curves

Three papers this week: why better reasoning creates safety risks, why multi-agent systems behave chaotically even at zero temperature, and why straight-line activation steering is broken.

Reasoning Models Can't Hide Their Thinking - OpenAI Study

OpenAI's CoT-Control benchmark shows frontier reasoning models score 0.1-15.4% at steering their own chain of thought - a result the company frames as good news for AI oversight.

AI Research Roundup: Agent Behavioral Contracts, Autonomous Memory, and Certified Circuits

Three new papers tackle agent reliability through formal contracts, active knowledge acquisition for memory systems, and provably stable mechanistic interpretability.

Today in AI Research: Stable Agent Training, Compound AI Limits, and the Algorithm Trust Paradox

New papers tackle training collapse in agentic RL with a unified stabilization recipe, reveal when querying multiple models actually helps, and expose a paradox where LLMs claim to trust humans but bet on algorithms.

Guide Labs Open-Sources Steerling-8B, an LLM That Shows Its Work

YC-backed startup Guide Labs releases Steerling-8B under Apache 2.0 - an 8.4B parameter model with a built-in concept module that traces every output token back to its training data.

Interpretability

Google Analytics