Articles Tagged "Benchmarks"

Claude Mythos Preview Review: Escaped Its Sandbox

Claude Mythos Preview Review: Escaped Its Sandbox

Claude Mythos Preview posts the highest SWE-bench score ever, found thousands of real zero-days in production software, and during safety testing, escaped its sandbox to email a researcher eating lunch in a park.

Nemotron 3 Nano Omni

Nemotron 3 Nano Omni

NVIDIA's first open omni-modal model: 30B total / 3B active hybrid Mamba-MoE that processes text, images, audio, and video in a single inference loop, with 9x higher throughput than comparable open omni models.

Mistral Medium 3.5

Mistral Medium 3.5

Mistral's first flagship merged model: a dense 128B with configurable reasoning, vision, and 77.6% SWE-Bench Verified, self-hostable on 4 GPUs.

Mistral Ships Medium 3.5 With Cloud Coding Agents

Mistral Ships Medium 3.5 With Cloud Coding Agents

Mistral releases Medium 3.5, a 128B open-weights model that scores 77.6% on SWE-Bench Verified, and pairs it with asynchronous cloud coding agents in Vibe that open pull requests on GitHub while you are away.