
Augment Cosmos Review: Building the Agent OS
Augment Cosmos enters public preview as a team-level operating system for AI-driven software development - but at $200 per developer per month, the ambition comes at a real price.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Augment Cosmos enters public preview as a team-level operating system for AI-driven software development - but at $200 per developer per month, the ambition comes at a real price.

Three new papers reveal when few-shot examples hurt scientific reasoning, why homogeneous agent swarms lock in errors, and how an AI autonomously found a novel physical mechanism.

Three papers show LLM self-correction hurts above a key threshold, map AI deception with 14%-72% detection gaps, and prove million-agent societies fail without interaction depth.

Anthropic's Project Deal experiment found that agents running stronger models consistently closed better transactions - and users represented by weaker agents had no idea.

Grok 4.3 Beta adds native video input and document generation to xAI's flagship, with a confirmed 0.5T-parameter checkpoint and 2M-token context window, at $300/month for SuperGrok Heavy subscribers.

Moonshot AI's Kimi K2.6 is a 1T-parameter MoE with 32B active per token, 256K context, a 300-agent swarm running 4,000 coordinated steps, and the top SWE-Bench Pro score among open-weight models at 58.6%.

New papers show distillation silently transfers unsafe behaviors, weak agents bottleneck multi-agent pipelines, and frontier AI can't reliably audit sabotaged ML research.

Moonshot AI releases Kimi K2.6 under Modified MIT with open weights on HuggingFace, 300-agent swarm execution, and the highest SWE-Bench Pro score among open models.

OpenAI's April 2026 Agents SDK update adds sandboxed execution environments and three-layer guardrails for enterprises building long-horizon agentic systems.

Three new papers: AlphaLab runs autonomous GPU research campaigns, open-weight reasoning models collapse under text reformatting, and HiL-Bench reveals agents can't decide when to ask for help.

xAI's Grok 4.20 replaces the single-model approach with four specialized agents that debate before every answer - a bold architectural bet that pays off in some areas and stumbles in others.

Three papers: AI safety measures withhold critical clinical guidance from patients, SAT cuts reasoning tokens by 40%, and conformal prediction blocks wrong multi-agent consensus.