Alignment Gaps, Agent Governance, and Greener LLMs

Today's arXiv batch arrived with three papers worth pulling out of the queue. They cover very different territory - alignment training theory, enterprise agent deployment, and data-centre energy efficiency - but each one addresses something practitioners are actively struggling with right now.

TL;DR

Conditional Equivalence of DPO and RLHF - DPO only matches RLHF when a hidden assumption holds; when it breaks, training actively degrades alignment
Governance by Construction - A policy-as-code layer enforces enterprise compliance across five checkpoints in an agent pipeline without retraining the model
PALS - Treating GPU power caps as a scheduling parameter cuts LLM serving energy consumption by up to 26.3%

When DPO Breaks - and Why Nobody Noticed

Direct Preference Optimization became the default alignment recipe for fine-tuners over the last two years. It drops the explicit reward model required by RLHF and improves preferences directly through a cross-entropy loss. It is simpler, cheaper, and faster to implement. The assumption behind most of that enthusiasm has been that DPO and RLHF are roughly equivalent - same destination, smoother road.

A new 49-page paper from Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, and Yike Guo challenges that directly. Their argument, in short: DPO and RLHF are only equivalent under a specific condition, and that condition is frequently violated in practice.

The Hidden Assumption

The equivalence holds only when the RLHF-ideal policy already prefers human-approved responses. When that isn't true - when the reference model is already somewhat misaligned, or the preference data contains noise - DPO loses its guarantee. Instead of improving absolute alignment with human preferences, it falls back to optimizing relative advantage over the reference policy.

The practical consequence is striking. The paper shows it's possible for DPO loss to decrease while the model simultaneously starts preferring the dispreferred response in a pair. Loss goes down; alignment goes down with it. The model looks like it's training successfully until you actually assess its outputs.

The team analyzes this through a soft margin ranking framework and provide a geometric interpretation showing exactly when and why the objective diverges. They also propose a fix: Constrained Preference Optimization (CPO), which adds an explicit constraint ensuring the alignment objective stays on track even when the hidden assumption fails. They report state-of-the-art results on standard benchmarks.

Why This Matters for Practitioners

This isn't a theoretical curiosity. Anyone who has run DPO and noticed that reward model scores sometimes plateau or worsen mid-run has likely hit this failure mode without a name for it. The practical fix is to verify that your reference model truly prefers the winning side of your preference pairs before starting DPO training - or switch to CPO where the constraint enforces it mechanically.

For broader context on alignment training and where these approaches fit, our earlier coverage of related RLHF research and the AI safety and alignment guide are worth revisiting.

DPO vs RLHF comparison diagram DPO reformulates the RLHF objective by treating the policy directly as an implicit reward model - but this paper shows that formulation breaks under common conditions. Source: huggingface.co

Governance by Construction for Enterprise Agents

The second paper comes from a large team at IBM Research - Segev Shlomov, Iftach Shoham, Alon Oved, and seven colleagues. It addresses a question every enterprise team deploying agents in production is asking: how do you make a general-purpose LLM agent behave predictably under compliance requirements without retraining it for every new policy?

Their answer is a modular policy-as-code layer that sits between the orchestration logic and the generalist model. No fine-tuning, no new model weights - just structured enforcement at the right points in the execution pipeline.

Five Checkpoints, One Pipeline

The system defines five structural enforcement points:

Checkpoint	Function
Intent Guard	Blocks queries that fall outside permitted scope before anything runs
Playbook	Injects context-appropriate task structure dynamically
Tool Guide	Constrains which tools are available based on current context
Tool Approvals	Requires human-in-the-loop sign-off for sensitive tool calls
Output Formatter	Enforces structured, audit-ready output formats

The checkpoints compose - you can deploy all five or only the ones relevant to your regulatory context. The paper demonstrates the approach in a healthcare scenario where agents need to route requests, retrieve patient data, and recommend actions within strict policy boundaries.

What This Gives You in Practice

The appeal is that governance gets embedded structurally rather than hoped for. Intent Guard can block an agent from accessing data outside its clearance scope before the LLM even sees the query. Tool Approvals can halt a workflow at a sensitive step and send it to a human reviewer without requiring the model to "decide" to do so. The audit trail is a natural output of the architecture, not an afterthought.

For teams building on top of general-purpose LLM agents, this kind of compositional governance layer is much easier to update when policies change than a fine-tuned model is. You change the policy file, not the weights.

Enterprise AI governance is no longer an architectural afterthought. The EU AI Act's full Annex III enforcement begins in August 2026, and for agent deployments touching regulated workflows, the checkpoints in this paper map almost directly to what auditors will look for.

PALS - Treating Power as a First-Class Parameter

The third paper takes on a problem that data-centre operators have been pushing for years: LLM serving is energy-expensive, and that cost is growing fast. LLM energy footprint is projected to reach over 1,000 TWh by 2026. The question is whether anything can be done at the serving layer without touching the model.

Can Hankendi, Rana Shahout, Minlan Yu, and Ayse K. Coskun propose PALS (Power-Aware LLM Serving), built inside vLLM. The core insight is that GPU power caps have traditionally been treated as static limits imposed from outside the scheduler. PALS treats them as a control knob the scheduler actively adjusts to optimise for both throughput and efficiency.

How It Works

PALS combines two components. An offline power-performance model characterises how throughput changes as a function of power cap and batch size for a given hardware configuration. A feedback controller then adjusts power limits at runtime to meet service-level objectives (SLOs) while minimising energy draw.

The results are solid: up to 26.3% improvement in energy efficiency and 4-7x reduction in QoS violations under power-constrained conditions. The system has been tested across multi-GPU setups and across both dense and mixture-of-experts (MoE) models, which have notably different power profiles due to sparse activation.

Data center servers with power and cooling infrastructure GPU clusters running LLM inference are among the highest-density power loads in modern data centres. PALS dynamically adjusts per-GPU power caps at the scheduler level. Source: unsplash.com

The MoE Angle

MoE models (the architecture behind many of the recent frontier models) are an interesting case here because their power draw is highly variable depending on which experts are activated. A static power cap that works for a dense model will either throttle throughput on busy MoE batches or leave energy on the table during sparse ones. PALS's feedback loop handles this naturally by adjusting to observed load in real time rather than relying on static profiles.

For teams running vLLM in production, this is a fairly low-cost integration path to meaningful energy savings. No model changes, no API modifications - just a smarter serving scheduler.

The Common Thread

Read together, these three papers share a theme: the difference between a system that appears to work and one that provably works under the conditions you'll actually face in production.

DPO appears to align models but can degrade them when its implicit assumption fails. Agent pipelines appear to follow policies but without structural enforcement they rely on model judgment that can drift. LLM serving appears to use power efficiently but static caps leave significant waste on the table.

Each paper names the gap exactly and proposes a fix that's structural rather than statistical. That's the kind of work worth tracking.

Sources: