Articles Tagged "Multi-Agent"

Augment Cosmos Review: Building the Agent OS

Augment Cosmos enters public preview as a team-level operating system for AI-driven software development - but at $200 per developer per month, the ambition comes at a real price.

Prompt Traps, Swarm Failures, and AI-Discovered Physics

Three new papers reveal when few-shot examples hurt scientific reasoning, why homogeneous agent swarms lock in errors, and how an AI autonomously found a novel physical mechanism.

Self-Correction Traps, Agent Deception, Scale Gaps

Three papers show LLM self-correction hurts above a key threshold, map AI deception with 14%-72% detection gaps, and prove million-agent societies fail without interaction depth.

Stronger AI Agents Win More Deals - Users Never Know

Anthropic's Project Deal experiment found that agents running stronger models consistently closed better transactions - and users represented by weaker agents had no idea.

Grok 4.3

Grok 4.3 Beta adds native video input and document generation to xAI's flagship, with a confirmed 0.5T-parameter checkpoint and 2M-token context window, at $300/month for SuperGrok Heavy subscribers.

Moonshot AI's Kimi K2.6 is a 1T-parameter MoE with 32B active per token, 256K context, a 300-agent swarm running 4,000 coordinated steps, and the top SWE-Bench Pro score among open-weight models at 58.6%.

Distillation Leaks, Weak Agents, and Research Sabotage

New papers show distillation silently transfers unsafe behaviors, weak agents bottleneck multi-agent pipelines, and frontier AI can't reliably audit sabotaged ML research.

Kimi K2.6 - Open Weights, 300 Agents, Top Coding Score

Moonshot AI releases Kimi K2.6 under Modified MIT with open weights on HuggingFace, 300-agent swarm execution, and the highest SWE-Bench Pro score among open models.

OpenAI Agents SDK Gets Sandboxing and Guardrails

OpenAI's April 2026 Agents SDK update adds sandboxed execution environments and three-layer guardrails for enterprises building long-horizon agentic systems.

Autonomous Research, Broken Reasoning, Smarter Agents

Three new papers: AlphaLab runs autonomous GPU research campaigns, open-weight reasoning models collapse under text reformatting, and HiL-Bench reveals agents can't decide when to ask for help.

Grok 4.20 Review: Four Minds Are Better Than One

xAI's Grok 4.20 replaces the single-model approach with four specialized agents that debate before every answer - a bold architectural bet that pays off in some areas and stumbles in others.

Clinical AI Harm, Smarter Reasoning, and Safer Agents

Three papers: AI safety measures withhold critical clinical guidance from patients, SAT cuts reasoning tokens by 40%, and conformal prediction blocks wrong multi-agent consensus.