
Cursor Ships Composer 2 - Its First In-House Coding Model
Cursor launches Composer 2, its first in-house coding model trained via RL on long-horizon tasks, scoring 73.7 on SWE-bench Multilingual at $0.50/M input tokens.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Cursor launches Composer 2, its first in-house coding model trained via RL on long-horizon tasks, scoring 73.7 on SWE-bench Multilingual at $0.50/M input tokens.

Anthropic's mid-tier model matches Opus 4.6 on computer use, leads all models on office productivity tasks, and costs five times less than the flagship at $3/$15 per million tokens.
MiniMax M2.7 is a 230B MoE coding agent that handles 30-50% of MiniMax's own RL research workflow, scoring 56.22% on SWE-Pro and 78% on SWE-bench Verified at $0.30/M input tokens.

A head-to-head comparison of OpenAI Codex and Anthropic Claude Code covering benchmarks, pricing, features, and real-world performance for agentic coding workflows.

METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.

Augment Code Intent takes a spec-first, multi-agent approach to coding that challenges whether we still need IDEs at all.

After a six-hour shopping outage and multiple AI-linked incidents, Amazon now requires junior and mid-level engineers to get senior sign-off before deploying AI-assisted code changes.

Anthropic's new Code Review dispatches parallel AI agents on every pull request to find bugs, rank them by severity, and filter false positives - at $15-25 per review.

MiniMax M2.5 matches Claude Opus 4.6 on SWE-Bench at 1/20th the price - but a spike in hallucinations and a distillation controversy complicate the story.

Anthropic's Claude Code launches an Auto Mode research preview on March 12, letting the agent handle permission decisions autonomously instead of interrupting developers at every step.

Alibaba's SWE-CI benchmark tested 18 AI models on 100 real codebases across 233 days of maintenance. Most agents accumulate technical debt and break previously working code. Only Claude Opus stays above 50% zero-regression.

Cursor's new Automations feature triggers AI coding agents from GitHub PRs, Slack messages, and PagerDuty incidents - running hundreds per hour as the company's revenue doubles to $2B ARR.