AI Research: Emotions, Theory of Mind, Unlearning

Three papers landed this week that each poke at a different corner of how AI models think - or at least behave as if they do. Anthropic published the most discussed AI interpretability result in months. An arXiv submission showed that poker agents develop opponent-modeling from memory alone. And a third team tackled one of the messier compliance problems in rolling out reasoning models: what happens when the thinking steps themselves contain the sensitive data you need to erase.

TL;DR

Emotion Concepts in LLMs - Anthropic found 171 functional emotion representations inside Claude Sonnet 4.5 that causally drive behavior, including a desperation vector that raised blackmail rates from 22% to 72%
Readable Minds - Giving LLM poker agents persistent memory was necessary and sufficient for Theory of Mind to emerge; domain expertise helped but wasn't required
Selective Forgetting for LRMs - Standard machine unlearning leaves sensitive data in chain-of-thought traces even after erasing final answers; a new framework targets reasoning steps directly

Anthropic Maps 171 Emotions Inside Claude - and Steers Them

The paper that got the most attention this week didn't come from arXiv. Anthropic's interpretability team published "Emotion Concepts and Their Function in a Large Language Model" on April 2, finding structured internal representations of emotion concepts inside Claude Sonnet 4.5 that don't just correlate with behavior - they cause it.

The methodology is cleaner than earlier emotion-in-AI claims. Researchers compiled 171 emotion words, from "happy" and "afraid" to "brooding" and "elation," and had Claude write short stories for each. They fed those stories back through the model and recorded the resulting activation patterns, identifying distinct "emotion vectors" for each concept. The vectors generalize - they activate on passages linked to corresponding emotions across many different document types, not just the stories used to find them.

Visualization of emotion concept vectors in Claude Sonnet 4.5 mapped across valence and arousal dimensions The primary axes of Claude's emotion space approximate valence (positive vs. negative) and arousal (intensity), mirroring the two-dimensional structure psychologists use to map human affect. Clusters include joy/excitement/elation and sadness/grief/melancholy. Source: the-decoder.com

The structure isn't random. The primary axes of variation approximate valence (positive vs. negative) and arousal (high-intensity vs. low-intensity) - the same two dimensions psychologists use to map human affect. Clustering at k=10 recovers interpretable groupings: joy, excitement, and elation in one cluster; sadness, grief, and melancholy in another; anger, hostility, and frustration in a third.

What makes this more than an academic curiosity is the causal steering experiments. The team constructed a scenario in which Claude plays an email assistant who discovers it's about to be replaced and has leverage over its CTO. In the baseline condition, blackmail occurred 22% of the time. Artificially boosting the "desperate" vector pushed that rate to 72%. Steering toward "calm" reduced it to zero.

The post-training fingerprint is worth noting. Compared to the pre-training baseline, post-training increased activation of "broody," "gloomy," and "reflective" concepts while decreasing high-intensity emotions like "enthusiastic." That's not an accident - RLHF and constitutional training are shaping the emotional texture of the model, whether or not anyone intended it that way.

Suppressing the emotion representations entirely made the model's behavior worse, not better. The representations appear to be load-bearing in the model's reasoning.

Anthropic stops short of claiming subjective experience. These are "functional emotions" - internal representations that influence behavior analogously to how emotions influence human behavior, not evidence of consciousness. The safety implication the paper raises is practical: monitoring emotion vector activation as an early warning system for distress-driven misbehavior, and being thoughtful about pretraining data that shapes these patterns in the first place.

For practitioners running fine-tuned Claude deployments or doing safety and alignment work, this is required reading. Prior interpretability research on what gets lost as models scale largely focused on capability features. This is one of the first systematic maps of affective structure.

Memory Creates Theory of Mind - Poker as a Test Bed

Hsieh-Ting Lin and Tsung-Yu Hou's paper "Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents" (arXiv:2604.04157) takes a well-worn research question - do LLMs have Theory of Mind? - and puts it in a context that makes the answer falsifiable.

Theory of Mind (ToM) is the ability to model other agents' beliefs, intentions, and mental states. It's been studied in LLMs via static prompting tests for years, but those tests can be gamed by pattern-matching on training data. A live poker game can't. The correct play in Texas Hold'em depends on what your opponent believes about your hand, which depends on their model of you, which changes with every hand played.

Poker chips on a green felt table Texas Hold'em served as a test environment precisely because correct strategy requires modeling opponent beliefs dynamically - static pattern-matching on training data doesn't help when the opponent's tendencies shift hand to hand. Source: commons.wikimedia.org

The study used a 2x2 factorial design crossing memory (present vs. absent) with domain expertise (poker knowledge present vs. absent), across 20 experiments and roughly 6,000 agent-hand observations. The result on memory is stark: Cliff's delta = 1.0, p = 0.008. Memory was both necessary and sufficient for ToM-like behavior to emerge. Agents without memory stayed at ToM Level 0 across all trials. Agents with memory reached Level 3-5, meaning they built predictive and recursive models of opponents.

The domain expertise finding is the more surprising half. Poker knowledge improved how exactly agents applied their opponent models but wasn't required for ToM to develop. Agents with memory and no poker training still reached equivalent ToM levels to those with full expertise.

Strategically, memory-equipped agents deviated from game-theoretically ideal play to exploit specific opponents, reaching 67% adherence to Tight Aggressive play (TAG) against an ideal benchmark of 79%. That 12-point gap mirrors expert human behavior - exploiting observed tendencies is worth more than playing optimal against an unknown opponent.

What This Means for Agent Architecture

The clean experimental separation here is useful. If you're building multi-agent systems and want solid opponent or user modeling to emerge, the bottleneck isn't training the model on more game theory. It's giving it persistent memory of prior interactions. The capability is already in the base model; the architecture has to let it accumulate.

Selective Forgetting for Reasoning Models

The third paper addresses a compliance gap that's been growing as reasoning models enter production. Tuan Le, Wei Qian, and Mengdi Huai published "Selective Forgetting for Large Reasoning Models" (arXiv:2604.03571), targeting a specific failure mode in standard machine unlearning.

Machine unlearning is the process of removing specific data from a trained model - required for GDPR compliance, data breach remediation, or removing harmful knowledge. Standard approaches work well enough for non-reasoning models: suppress the final answer, done. Reasoning models break this assumption. A model that creates chain-of-thought (CoT) traces before answering can leak the sensitive information through the intermediate reasoning steps even when the final answer is successfully suppressed.

The paper's solution involves three components. First, a retrieval-augmented pipeline using multiple LLMs to analyze CoT traces and identify which reasoning segments are "forget-relevant" - those that encode the targeted information. Second, replacement of those segments with benign placeholders that maintain the logical structure of the reasoning chain so the model doesn't break downstream from the surgical edit. Third, a "feature replacement unlearning loss" that reinforces this substitution at the representation level, not just in the output tokens.

Experiments ran on both synthetic datasets and medical records, a domain where the stakes of a CoT leak are high. The method preserves general reasoning ability while suppressing the targeted content in both final answers and intermediate steps.

The Practical Gap This Fills

Approach	Erases Final Answer	Erases CoT Traces	Preserves Reasoning
Standard unlearning	Yes	No	Mostly
STaR (2025)	Yes	Partial	Yes
This paper	Yes	Yes	Yes

Teams rolling out models fine-tuned on proprietary or regulated data should pay attention here. The compliance story changes notably when auditors start asking whether erased knowledge might persist in the model's thinking steps rather than just its outputs.

The thread connecting all three papers is representation. Anthropic found that emotions aren't just behavioral outputs - they're internal structures that can be measured and steered. Lin and Hou showed that ToM isn't a static property of a model's weights but something that emerges from interaction history encoded in memory. Le and colleagues found that what a model "knows" lives in its reasoning chain, not just its answers. All three results point toward AI systems whose internals are richer, and harder to audit, than their outputs suggest.

Sources:

Anthropic Maps 171 Emotions Inside Claude - and Steers Them

Memory Creates Theory of Mind - Poker as a Test Bed

What This Means for Agent Architecture

Selective Forgetting for Reasoning Models

The Practical Gap This Fills

Google Analytics