Articles Tagged "Benchmarks"

SubQ Launches: 12M-Token Context on Sub-Quadratic AI

Subquadratic exits stealth with SubQ, the first frontier model built on a sparse-attention architecture, a $29M seed round, and a 12M-token context window that costs a fraction of Opus.

Mayo Clinic AI Spots Pancreatic Cancer 3 Years Early

REDMOD, Mayo Clinic's radiomics AI, detects 73% of pancreatic cancers in CT scans that look normal to radiologists - nearly double the rate specialists achieve.

Qwen 3.6 Max Review: Alibaba's Coding Contender

Qwen3.6-Max-Preview tops six coding benchmarks and ranks third globally, but its closed-weights pivot and verbosity issues complicate the picture.

OpenAI o1 Outperforms ER Doctors in Harvard Trial

A peer-reviewed Science study puts OpenAI o1 through 76 live emergency room cases - and the model beats expert physicians on initial triage with 67.1% accuracy against 55% and 50%.

GPT-5.5 vs Claude Opus 4.7: Benchmarks and Pricing

GPT-5.5 and Claude Opus 4.7 both launched in April 2026 with 1M context windows and agentic coding focus. One leads on math and long-context retrieval, the other on software engineering and vision.

Google TPU 8t - AI Training at ExaFLOP Scale

Google's TPU 8t packs 12.6 FP4 PFLOPs and 216GB HBM3e per chip, scaling to 9,600-chip superpods with 121 ExaFLOPS and 2 petabytes of shared HBM for massive model training.

Claude Mythos Preview Review: Escaped Its Sandbox

Claude Mythos Preview posts the highest SWE-bench score ever, found thousands of real zero-days in production software, and during safety testing, escaped its sandbox to email a researcher eating lunch in a park.

Cost Efficiency Leaderboard: Best AI Performance Per Dollar

Rankings of AI models by cost efficiency in May 2026, comparing performance per dollar across frontier and budget models. Updated with DeepSeek V4, GPT-5.5, and Kimi K2.6.

Nemotron 3 Nano Omni

NVIDIA's first open omni-modal model: 30B total / 3B active hybrid Mamba-MoE that processes text, images, audio, and video in a single inference loop, with 9x higher throughput than comparable open omni models.

Mistral Medium 3.5

Mistral's first flagship merged model: a dense 128B with configurable reasoning, vision, and 77.6% SWE-Bench Verified, self-hostable on 4 GPUs.

Mistral Ships Medium 3.5 With Cloud Coding Agents

Mistral releases Medium 3.5, a 128B open-weights model that scores 77.6% on SWE-Bench Verified, and pairs it with asynchronous cloud coding agents in Vibe that open pull requests on GitHub while you are away.

DeepSeek V4

DeepSeek V4 ships in two open-weight MoE variants - V4-Pro at 1.6T/49B active and V4-Flash at 284B/13B active - both with 1M-token context and MIT license, released April 24, 2026.

← Previous