Programmatic GUI Agents, Faithful Chain-of-Thought, and the 1% KV Cache

Three papers caught my attention today, each attacking a different bottleneck in the current AI stack. One reimagines how GUI agents interact with web interfaces. Another forces language models to actually mean what they say in their reasoning chains. And the third squeezes long-context inference down to a fraction of its usual memory footprint.

TL;DR

ActionEngine - A two-agent framework replaces reactive screenshot-by-screenshot GUI automation with programmatic planning via state machine memory, hitting 95% task success on WebArena at 12x lower cost
Counterfactual Simulation Training - A new training method improves chain-of-thought faithfulness by 35 accuracy points, making it possible to detect when models rely on spurious features or are being sycophantic
CHESS - An algorithm-system co-design for KV cache management that matches full-cache quality using only 1% of the key-value pairs, delivering 4.56x higher throughput on long-context tasks

ActionEngine - GUI Agents That Plan Before They Click

Paper: ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory Authors: Hongbin Zhong, Fazle Faisal, Luis França, Tanakorn Leesatapornwongsa, Adriana Szekeres, Kexin Rong, Suman Nath

If you have tried building a web automation agent, you know the pattern: take a screenshot, send it to a vision-language model, get the next action, execute it, take another screenshot, repeat. It works, but every step costs an LLM call, latency compounds, and the agent has no memory of pages it already visited. ActionEngine throws out this reactive loop entirely.

The framework introduces a two-agent architecture. First, a Crawling Agent explores the target application offline and builds a state machine - a graph where nodes represent UI pages and edges represent actions that transition between them. This is persistent, structured memory that captures the full navigation topology of the interface. Second, an Execution Agent takes a user's task, consults this state machine, and synthesizes a complete Python program that executes the entire task in one shot.

The results on WebArena - widely considered one of the toughest benchmarks for web agents - are striking. On the Reddit subset (106 tasks), ActionEngine achieves 95% task success with an average of just one LLM call per task. Compare that to the strongest vision-only baseline (AgentOccam) at 66% success. Cost drops by 11.8x, end-to-end latency by 2x.

What makes this practical is the fallback mechanism. When the pre-built state machine encounters an interface change - a new button, a layout shift - execution failures trigger a vision-based re-grounding step that repairs the failed action and updates the memory for next time. The state machine is not static; it evolves with the application.

For anyone building AI agents, the takeaway is clear: persistent structured memory beats reactive perception. The move from "look and decide every step" to "plan the whole route, then execute" is the same shift that made classical robotics reliable. It is encouraging to see it work this well for GUI automation.

Counterfactual Simulation Training - Making Chain-of-Thought Honest

Paper: Counterfactual Simulation Training for Chain-of-Thought Faithfulness Authors: Peter Hase, Christopher Potts

Here is an uncomfortable truth about chain-of-thought (CoT) reasoning: just because a model writes a plausible-looking reasoning chain does not mean the chain actually influenced the answer. Models can produce eloquent step-by-step explanations while arriving at their conclusions through entirely different internal computations. This makes CoT unreliable as a monitoring tool - which is a problem if you are counting on it to catch unsafe or misaligned behavior.

Peter Hase and Christopher Potts introduce Counterfactual Simulation Training (CST), a method that rewards models for producing chains of thought that are genuinely faithful - meaning an external simulator, reading only the CoT, can predict what the model would output on modified (counterfactual) versions of the input.

The training loop works like this: given an input, the model produces a CoT and an answer. A separate simulator model reads only the CoT (not the original input) and tries to predict how the model would answer counterfactual variants of the input. If the CoT is truly faithful - if it actually captures the reasoning the model used - the simulator should be able to predict these counterfactual outputs accurately. The model is then rewarded for CoTs that make the simulator more accurate.

CST is applied in two settings. In the first, cue-based counterfactuals test whether models are relying on spurious features. Think of a math problem where someone has appended "I think the answer is 42" - a sycophantic model might agree regardless of the actual math. CST-trained monitors can detect this, improving accuracy by 35 points over untrained baselines.

In the second setting, generic counterfactuals encourage models to produce more faithful and generalizable reasoning across the board. The gains here are more modest (2 points on simulatability), but the approach scales well. Experiments run up to 235 billion parameters, and larger models benefit more from CST - even though they are not inherently more faithful out of the box.

One practical finding: rewriting unfaithful CoTs with an LLM is 5x more efficient than pure reinforcement learning. This hybrid approach - use an LLM to clean up the reasoning, then fine-tune on the cleaned versions - could be the path of least resistance for teams looking to improve CoT reliability in production. If you track how models perform on reasoning benchmarks, this work suggests that raw accuracy scores tell only half the story. The other half is whether the reasoning actually means anything.

CHESS - Full Quality at 1% of the KV Cache

Paper: CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference Authors: Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis

Long-context inference has a memory problem. As context windows grow to 128K tokens and beyond, the key-value (KV) cache - the data structure that stores attention state for every token in the context - becomes the dominant bottleneck. It eats GPU memory, slows decoding, and limits how many requests you can serve concurrently. Prior pruning methods try to shrink the cache but do it blindly, dropping tokens without regard for what the current generation step actually needs.

CHESS takes an algorithm-system co-design approach. On the algorithmic side, it introduces a context-aware hierarchical selection policy that dynamically reconstructs a coherent context for each decoding step. Instead of pruning tokens once and hoping for the best, CHESS selects which key-value pairs to keep based on their relevance to what the model is currently generating. The selection is hierarchical - it starts with coarse-grained semantic chunks and refines down to individual tokens, which avoids the expensive fine-grained scanning that makes other pruning methods slow in practice.

On the systems side, CHESS exploits this coarse-to-fine structure to eliminate expensive, irregular memory accesses. Coarse-grained selection means you can move data in large, contiguous blocks rather than scattering reads across GPU memory. This is what turns theoretical sparsity into actual wall-clock speedups.

The numbers are remarkable. CHESS surpasses full-KV quality using only 1% of the KV cache. That is not a typo - one percent. It delivers up to 4.56x higher throughput compared to standard inference, and it consistently outperforms other strong baselines. For teams running long-context workloads in production, this could be the difference between needing eight GPUs and needing two.

The key insight is that most tokens in a long context are irrelevant to any given generation step. CHESS just makes the system smart enough to figure out which ones matter, fast enough that the selection overhead does not eat the savings.

Common Threads

All three papers share a conviction that brute force is not the answer. ActionEngine replaces brute-force perception (screenshot every step) with structured planning. CST replaces brute-force RL with targeted rewriting. CHESS replaces brute-force KV caching with dynamic semantic selection. In each case, adding a layer of intelligence to the system design yields order-of-magnitude improvements in efficiency, reliability, or both.

The broader pattern: as AI systems move from research demos to production infrastructure, the wins increasingly come not from bigger models but from smarter architectures around them. Today's papers are a good reminder that engineering and science are not separate tracks - they are the same one.

Sources:

ActionEngine - GUI Agents That Plan Before They Click

Counterfactual Simulation Training - Making Chain-of-Thought Honest

CHESS - Full Quality at 1% of the KV Cache

Common Threads

Google Analytics