Compact Contexts, Smarter Fine-Tuning, and the Solver Trap

Three papers today, each aimed at a different pressure point in how we build and run language models. One tackles the memory wall for long contexts. Another cracks open the fine-tuning process to show which parts of a transformer actually change. And the third delivers a counterintuitive finding: make a model smarter, and it may become a worse simulator of human behavior.

TL;DR

Latent-Condensed Attention - A new attention mechanism jointly cuts prefilling cost by 2.5x and KV cache memory by 90% at 128K context, accepted to ACL 2026
Mid-Block Efficient Tuning - Layer-wise analysis of supervised fine-tuning shows middle layers stay stable; targeting them for updates beats standard LoRA by up to 10.2% on GSM8K
Solver-sampler mismatch - GPT-5.2 with native reasoning chose the dominant-strategy outcome in 45 of 45 negotiation runs; bounded reflection recovered compromise outcomes every time

Latent-Condensed Attention: Shrinking the Context Bottleneck

Authors: Zeng You, Yaofo Chen, Qiuwu Chen, Ying Sun, Shuhai Zhang, Yingjian Li, Yaowei Wang, Mingkui Tan | arXiv: 2604.12452 | ACL 2026

Running a model over a 128K-token document is expensive in two distinct ways. Attention computation scales quadratically with sequence length. And the KV cache - the memory where a model stores key and value representations for every token it has seen - grows linearly, quickly filling GPU RAM. Most optimization work fixes one or the other. This paper fixes both at the same time.

The paper introduces Latent-Condensed Attention (LCA), which works inside the compressed latent space already used by Multi-head Latent Attention (MLA). MLA, used in models like DeepSeek, projects keys and values into a lower-dimensional space before storing them. LCA goes further: it condenses the context directly within that latent space rather than at the token level, removing redundancy where it's cheapest to do so.

The core trick is separating semantic content from positional information. Semantic vectors get aggregated through query-aware pooling - the model learns which context regions matter for each query. Positional keys go through anchor selection separately. This disentanglement lets the authors prove a length-independent error bound, meaning condensation quality doesn't degrade as sequences get longer.

A white printed circuit board with electronic components, representing the layers and memory structures of transformer attention Dense layers of components, much like the attention mechanisms that stack across a transformer's depth. The key-value cache sits between computation and memory in a similar architectural tension. Source: commons.wikimedia.org

The results are concrete. LCA reaches a 2.5x speedup in the prefilling phase - the part where the model processes the input before producing anything. KV cache memory drops by 90% at 128K context. The method works across different attention architectures including Grouped Query Attention (GQA), which is the attention variant used in most modern open-source models.

For practitioners running inference on long documents, this matters. A context window that was previously too expensive to fill in production becomes viable. We covered a different approach to the same problem in February - CHESS, which matched full-cache quality at 1% the memory - but LCA's advantage is that it also attacks the compute side, not just storage.

What Fine-Tuning Actually Does to a Transformer

Authors: Qinghua Zhao, Xueling Gong, Xinyu Chen, Zhongfeng Kang, Xinlu Li | arXiv: 2604.11838 | ACL 2026

Supervised fine-tuning (SFT) is how most labs turn a pretrained base model into something that follows instructions. The standard assumption is that fine-tuning reshapes the entire model. This paper, also accepted to ACL 2026, argues the reality is more localized - and that knowing where change actually happens lets you tune more efficiently.

The authors analyzed models from 1B to 32B parameters using a combination of information-theoretic metrics, geometric analysis, and optimization dynamics across each layer. The headline finding: middle layers (roughly 20% to 80% depth) remain largely stable during fine-tuning, while the final layers are highly sensitive to change.

What the layers are actually doing

Think of it this way. The early layers of a transformer extract low-level features - syntax, token relationships. The middle layers build richer representations: semantics, reasoning structure, factual knowledge. The final layers translate those representations into the token distribution the model outputs. SFT, it turns out, mostly rewires the output translation stage rather than the knowledge storage stage.

This has a practical implication. Standard LoRA applies low-rank adapters uniformly across all layers. The paper proposes Mid-Block Efficient Tuning, which focuses parameter updates on the middle layer range. On OLMo2-7B, this beats standard LoRA by 10.2% on GSM8K math reasoning. The improvement holds across model scales from 1B to 32B.

The authors also find that alignment through SFT is "architecturally localized" - the model doesn't reorganize its internal structure wholesale, it recalibrates which features it emphasizes in the final readout. That's a useful frame for thinking about why instruction-tuned models can catastrophically forget: if SFT only modifies the final layers, those layers become specialized for instruction-following at the expense of whatever they were doing before.

"Middle layers (20%-80%) are stable, whereas final layers exhibit high sensitivity." The authors' analysis covers models from 1B to 32B parameters, making this one of the more comprehensive layer-wise studies of SFT to date.

Reasoning Harder Breaks Your Simulation

Authors: Sandro Andric | arXiv: 2604.11840

This one runs counter to most intuitions about what "better" AI means. The paper asks a specific question: if you use LLMs as agents in social and economic simulations, does a more capable model produce more realistic behavior? The answer, under certain conditions, is no - and the explanation is worth understanding.

The concept the paper introduces is the solver-sampler mismatch. A solver tries to find the optimal action. A sampler tries to produce plausible behavior drawn from a realistic distribution. Human subjects in negotiation experiments aren't solvers - they compromise, anchor irrationally, leave value on the table. A model that reasons too hard becomes a solver when you need a sampler.

A chess board with pieces in the starting position, illustrating the game-theoretic structure of negotiation scenarios used to test LLM behavioral simulation Negotiation experiments have a game-theoretic structure similar to chess: fixed rules, strategic positions, and the choice between dominant-strategy play and compromise-seeking behavior. Source: commons.wikimedia.org

The experiment

The author tested three conditions: no reflection, bounded reflection (one revision step), and native reasoning (full chain-of-thought in models with built-in reasoning). Models were tested across three negotiation environments.

The results for GPT-5.2 with native reasoning are stark. In 45 of 45 runs across the three negotiation environments, the model chose the authority-dominant strategy - tuning its position rather than seeking compromise. Switch to bounded reflection and the model recovered compromise outcomes in every environment. GPT-4.1 showed similar patterns.

This isn't a failure of reasoning per se. The model is doing exactly what its reasoning process is trained to do: find and execute the best move. The problem is that "best move" in a strategic simulation isn't the same as "most human move." Real negotiations involve bounded rationality, emotional anchoring, and suboptimal decisions that nonetheless lead to stable agreements. A model that always finds the dominant strategy produces simulations that are strategically coherent but behaviorally unrealistic.

The practical implication for researchers using LLMs as synthetic participants in social science simulations: model capability and simulation fidelity are separate axes. Turning up reasoning quality can push them apart.

The Common Thread

These papers share something beyond their technical domains. All three are about knowing where a system's behavior is actually determined - not assuming it's uniform.

LCA shows that redundancy in the context isn't distributed evenly; it can be compressed selectively in latent space. The SFT analysis shows that fine-tuning doesn't reshape a model uniformly; it concentrates in the final layers. The solver-sampler paper shows that reasoning capability doesn't distribute uniformly across use cases; it helps you solve problems and may hurt you when you need to replicate imperfect human behavior.

The field has spent years scaling uniformly - more parameters, more tokens, more compute applied everywhere. These papers are part of a quieter countertrend: finding the non-uniform structure and using it.

Sources: