Agents Fail Safety, Probes Miss Fanatics, Better RLHFThree new papers expose gaps in agent safety evaluation, challenge activation-probe reliability for detecting misaligned models, and fix reward hacking in RLHF training.