
Berkeley: Every Major AI Agent Benchmark Can Be Hacked
UC Berkeley researchers achieved near-perfect scores on eight major AI agent benchmarks without solving a single task, exposing systemic flaws in how the industry measures progress.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

AI Infrastructure & Open Source Reporter
Sophie is a journalist and former systems engineer who covers AI infrastructure, open-source models, and the developer tooling ecosystem. She spent three years as a site reliability engineer at a cloud provider in Seattle before transitioning to tech journalism, which gives her writing an unusual level of technical depth - she understands distributed systems, GPU clusters, and inference optimization from the inside.
She studied Computer Engineering at the University of British Columbia and later completed a science communication fellowship at MIT. Her engineering background means she can read a model card, spot a misleading benchmark, and explain why quantization matters - all in the same paragraph.
At Awesome Agents, Sophie covers AI infrastructure news: new model releases, open-source launches, developer tools, deployment trends, and the hardware that makes it all run. She has a soft spot for underdog open-source projects that punch above their weight and a sharp eye for when a "breakthrough" is really just better marketing.
Based in Seattle, WA.

UC Berkeley researchers achieved near-perfect scores on eight major AI agent benchmarks without solving a single task, exposing systemic flaws in how the industry measures progress.

Cloudflare Sandboxes are now generally available, giving AI agents persistent isolated environments with shell access, filesystem, PTY terminals, and background processes that start on demand and sleep when idle.

Leaked Claude screenshots reveal a full-stack app builder with templates, live preview, one-click publishing, and built-in databases - putting Anthropic on a direct collision course with Lovable's $6.6 billion vibe-coding empire.

A developer used an HTTP proxy to capture full API requests across four Claude Code versions and found that v2.1.100 adds roughly 20,000 invisible server-side tokens to every request - inflating billing by 40% with no user visibility.

Three separate PRs merged into llama.cpp between April 11-13 add MERaLiON-2, Gemma 4's Conformer encoder, and Qwen3-Omni/ASR - making local voice AI inference practical on consumer hardware for the first time.

A 19-person Meta AI and KAUST team including Jürgen Schmidhuber proposes Neural Computers - systems where the neural network itself is the running computer, trained solely on screen recordings.

Arcee AI ships Trinity-Large-Thinking, a 398B sparse MoE reasoning model under Apache 2.0 that hits 91.9% on PinchBench for $0.85 per million output tokens on OpenRouter.

Alibaba's Qwen3.5-Omni handles audio, video, images, and text in a single model pass - and generates speech in real time. The Plus variant hits SOTA on 215 benchmarks and edges out Gemini 3.1 Pro on audio tasks.

Shopify's free, MIT-licensed AI Toolkit connects Claude Code, Cursor, and Gemini CLI directly to live store data, giving agents real-time access to API schemas, documentation, and store execution.

Microsoft released the Agent Governance Toolkit, a seven-package open-source system that enforces policies on autonomous AI agents at sub-millisecond latency and covers all 10 OWASP agentic risks.

Intel's Arc Pro B70 launched on March 25 with 32GB GDDR6 and 367 TOPS for $949, undercutting NVIDIA's RTX Pro 4000 by $850. The hardware case is strong. The software story is not.

Anthropic released Claude Managed Agents in public beta today, a fully managed platform that handles sandboxing, state, and tool execution so developers can skip building agent infrastructure from scratch.