Articles Tagged "Benchmarks"

Claude 3.7 Sonnet

Claude 3.7 Sonnet

Anthropic's first hybrid reasoning model with togglable extended thinking, a 200K context window, and state-of-the-art SWE-bench performance at $3/$15 per million tokens.

SU-01

SU-01

SU-01 is a 30B-A3B MoE reasoning model from Shanghai AI Lab that achieves gold-medal performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 using a three-stage training recipe and test-time scaling.

HiDream-O1-Image

HiDream-O1-Image

HiDream-O1-Image is an 8B open-source text-to-image model with a pixel-space diffusion architecture that outperforms 32B FLUX.2 [dev] across five major benchmarks.

SubQ

SubQ

SubQ is the first LLM built on a fully subquadratic attention architecture, achieving a 12M-token research context and 52x faster inference than FlashAttention at 1M tokens.

GPT-OSS 20B

GPT-OSS 20B

OpenAI's open-weight 21B MoE reasoning model with 131K context, Apache 2.0 license, and o3-mini-level benchmark performance running in 16 GB of memory.

OpenAI o3-pro

OpenAI o3-pro

OpenAI's maximum-compute reasoning model targets the hardest problems where o3 falls short, at $20/$80 per million tokens.

OpenAI o3

OpenAI o3

OpenAI's most advanced reasoning model, built for math, science, coding, and visual tasks, with 200K context and adaptive chain-of-thought at $2/$8 per million tokens.

OpenAI o4-mini

OpenAI o4-mini

OpenAI o4-mini is a fast, cost-efficient reasoning model in the o-series, delivering near-o3 performance on math and coding benchmarks at roughly 10x lower cost.