Articles Tagged "Reasoning"

Reasoning Benchmarks: GPQA, AIME, and Humanity's Last Exam

Rankings of AI models on the hardest reasoning benchmarks available: GPQA Diamond, AIME competition math, and the notoriously difficult Humanity's Last Exam.

Anthropic Releases Claude Opus 4.6 With Agent Teams and 1M Context Window

Anthropic launches Claude Opus 4.6 featuring agent teams, adaptive thinking, 1M token context window, and state-of-the-art performance on Terminal-Bench 2.0 and Humanity's Last Exam.

Grok 4 Review: xAI's Bold Bid for the Reasoning Crown

A comprehensive review of xAI's Grok 4, the first model to score 50% on Humanity's Last Exam, featuring Heavy and Coding variants with built-in tool use.

Claude Opus 4.5

Anthropic's November 2025 flagship model delivers top SWE-bench scores, a new effort parameter for reasoning control, and a 66% price cut from its predecessor.

GPT-OSS 20B

OpenAI's open-weight 21B MoE reasoning model with 131K context, Apache 2.0 license, and o3-mini-level benchmark performance running in 16 GB of memory.

Gemini 2.5 Pro

Google DeepMind's flagship thinking model with 1M-token context, 84% GPQA Diamond, and native multimodal understanding of text, images, audio, and video.

OpenAI o3-pro

OpenAI's maximum-compute reasoning model targets the hardest problems where o3 falls short, at $20/$80 per million tokens.

OpenAI o3

OpenAI's most advanced reasoning model, built for math, science, coding, and visual tasks, with 200K context and adaptive chain-of-thought at $2/$8 per million tokens.

OpenAI o4-mini

OpenAI o4-mini is a fast, cost-efficient reasoning model in the o-series, delivering near-o3 performance on math and coding benchmarks at roughly 10x lower cost.

Claude 3.7 Sonnet

Anthropic's first hybrid reasoning model with togglable extended thinking, a 200K context window, and state-of-the-art SWE-bench performance at $3/$15 per million tokens.

← Previous