
GLM-5.1 Tops SWE-Bench Pro With Zero NVIDIA Hardware
Z.ai's GLM-5.1 scores 58.4 on SWE-bench Pro, edging out GPT-5.4 and Claude Opus 4.6, after being trained on 100,000 Huawei Ascend chips with no US silicon.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Z.ai's GLM-5.1 scores 58.4 on SWE-bench Pro, edging out GPT-5.4 and Claude Opus 4.6, after being trained on 100,000 Huawei Ascend chips with no US silicon.

Anthropic's restricted Claude Mythos Preview model autonomously discovered thousands of high-severity vulnerabilities across every major OS and browser, including bugs hiding in plain sight for 27 years.

Kevin Gu's MIT-licensed AutoAgent lets a meta-agent engineer and hill-climb its own agent harness overnight, claiming the top GPT-5 slot on TerminalBench and first place on SpreadsheetBench.

Claude Sonnet 4.6 and GPT-5.4 cost nearly the same per token but win on opposite benchmarks. Here is where each model leads and which to pick for your workload.

Claude Opus 4.6 and GPT-5.4 lead different code benchmarks in April 2026 - pick based on your workflow, not one score.

Microsoft's Harrier-OSS-v1 family delivers three MIT-licensed multilingual embedding models, with the 27B variant claiming top spot on Multilingual MTEB v2 at 74.3.

Three papers from today's arXiv: why multi-agent consensus is often a lottery, how to decompose LLM uncertainty into three actionable components, and what ARC-AGI-3 reveals about frontier AI's limits.

Three new arXiv papers show how to build more reliable planning agents, cut benchmark costs by 70%, and why LLMs fail at long-horizon financial decision-making.

Claude Opus 4.6 leads LiveSQLBench at 36.4% while ChatGPT's Code Interpreter dominates spreadsheet workflows - picking the right model depends on whether you need SQL, CSV analysis, or visualization.

GPT-5.4 leads OSWorld-Verified at 75.0% for desktop computer use while Claude Sonnet 4.6 matches human performance at 72.5% for half the price.

Claude Opus 4.6 leads the Mazur Writing Benchmark at 8.56 while Claude Sonnet 4.6 tops EQ-Bench Creative Writing with 1936 Elo, making Anthropic the clear winner for fiction.

Seedance 2.0 leads the Artificial Analysis Elo rankings at 1,269, but Kling 3.0 is the most practical choice for global API access with native 4K at 75 cents per minute.