
Agents Hit 89%, Evals Get a Schema, Memory Falls Short
Three papers from today's arXiv: workplace agents jumped from 43% to 89% task completion in two years, a 47-researcher coalition ships a unified eval schema, and agent memory only helps when similarity tops 0.8.










