
Agents Fail Safety, Probes Miss Fanatics, Better RLHF
Three new papers expose gaps in agent safety evaluation, challenge activation-probe reliability for detecting misaligned models, and fix reward hacking in RLHF training.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.