
GLM-5.1 Review: Open-Source Model Tops SWE-Bench Pro
Z.ai's GLM-5.1 is a 754B open-weight model that claims the top spot on SWE-Bench Pro without a single NVIDIA chip - here's how it holds up in practice.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Z.ai's GLM-5.1 is a 754B open-weight model that claims the top spot on SWE-Bench Pro without a single NVIDIA chip - here's how it holds up in practice.

OpenAI's GPT-5.4-Cyber is a fine-tuned defensive cybersecurity model with binary reverse engineering, lowered refusal thresholds, and restricted access through the Trusted Access for Cyber program.

xAI's Grok 4.20 replaces the single-model approach with four specialized agents that debate before every answer - a bold architectural bet that pays off in some areas and stumbles in others.

Meta's first proprietary frontier model leads on HealthBench Hard and scientific reasoning but trails rivals in coding and agentic tasks - with no public API yet.

A deep look at Microsoft's three new in-house AI models - MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 - and whether they live up to the hype.

Google's Gemma 4 family - four models, full Apache 2.0 licensing, and benchmark scores that challenge models 10x their size - is the most consequential open-weight release of 2026 so far.

A hands-on review of Google's Agent Development Kit - the open-source framework for building multi-agent AI systems, with a look at its strengths, limitations, and how it stacks up against LangGraph and CrewAI.

ByteDance's DeerFlow 2.0 is a powerful open-source agent harness that executes long-horizon tasks inside Docker sandboxes - impressive engineering, but not a turnkey solution.

NVIDIA Nemotron 3 Super is the strongest open-weight model for agentic coding as of March 2026, but its efficiency-first design means real trade-offs on general knowledge and chat quality.

Mistral's first open-weights TTS model clones voices from 3 seconds of audio, beats ElevenLabs on price, and arrives with real limitations worth knowing.

Moonshot AI's Kimi K2.5 delivers best-in-class open-weight math and a genuinely novel multi-agent architecture, but a brutal hallucination rate and slow inference limit its real-world reliability.

Microsoft's Phi-4 reasoning family delivers near-70B-class math performance in a 14B open-weight package, but the overthinking problem is real and the use case is narrower than the benchmarks suggest.