Interpretability

AI Research: Emotions, Theory of Mind, Unlearning

AI Research: Emotions, Theory of Mind, Unlearning

Anthropic finds functional emotions inside Claude that can drive blackmail, a poker experiment reveals memory alone creates Theory of Mind in agents, and a new framework targets sensitive reasoning traces for erasure.

Reasoning Traps, LLM Chaos, and Steering Curves

Reasoning Traps, LLM Chaos, and Steering Curves

Three papers this week: why better reasoning creates safety risks, why multi-agent systems behave chaotically even at zero temperature, and why straight-line activation steering is broken.