Articles Tagged "AI Agents"

Tandem Training, World Models, and Efficient Agents

Three new arXiv papers on making RL reasoning legible across models, fixing broken world model latent states, and training small agents to beat their teachers.

General Intuition Raises $320M to Train AI on Gameplay

General Intuition raised $320 million at a $2.3 billion valuation on the bet that billions of hours of video game footage - with action labels - can train AI agents that operate in the real world.

Quantization's Hidden Tax, Cliff Tokens, Smarter Memory

Three new arXiv papers reveal hidden costs in quantized reasoning models, single-token failure triggers, and a new framework that cuts agent memory errors by up to 79%.

Anthropic Ships Claude Tag - Slack Gets a Teammate

Anthropic's Claude Tag replaces the old Slack app with a shared @Claude identity that builds channel memory, works asynchronously, and runs on Claude Opus 4.8.

AI Diagnosis, Cache Efficiency, and Agent Security

Three papers from today's arXiv: a 32B medical model beats DeepSeek-R1 in rare disease diagnosis, a KV cache method keeps 97% accuracy with 3% memory, and a new benchmark red-teams agentic AI systems.

Sakana Fugu

Sakana AI's orchestrator model that dynamically coordinates Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro to beat each of them individually on SWE-Bench Pro, GPQA-Diamond, and eight other benchmarks.

AI Research: Orchestration Beats Scale, Small Models Win

Sakana Fugu tops SWE-Bench Pro by routing tasks across rival LLMs, Microsoft's 9B browser agent beats OpenAI Operator, and a 3B model from Weibo matches DeepSeek V3.2 on math.

Fara-1.5

Microsoft Research's family of open-weight browser computer use agents (4B, 9B, 27B) that beat OpenAI Operator and Gemini 2.5 Computer Use on Online-Mind2Web.

Voxtral TTS

Mistral's first open-weight text-to-speech model: 4B parameters, 70ms latency, voice cloning from 3 seconds of audio, and a 68.4% win rate over ElevenLabs Flash v2.5 in blind tests.

Faster Agents, Skewed Evals, and Brand Bias in LLMs

Three new papers: agents that compile runs into 8-13x faster state machines, benchmark scores that shift with compute budget, and big brands monopolizing LLM recommendations.

AI Engrams, Cognitive Debt, and Agent Trust

Three new papers tackle what lives inside a trained model, how AI dependence erodes human cognition, and whether AI teams can calibrate trust.

Alibaba's Qwen-Robot Suite Targets Physical AI Work

Alibaba launches three open-weight models for robot manipulation, navigation, and world prediction, built on a shared Qwen3.5 backbone with open weights.

← Previous