Articles Tagged "Reasoning"

Gemini 2.5 Pro

Google DeepMind's flagship thinking model with 1M-token context, 84% GPQA Diamond, and native multimodal understanding of text, images, audio, and video.

Claude Opus 4.5

Anthropic's November 2025 flagship model delivers top SWE-bench scores, a new effort parameter for reasoning control, and a 66% price cut from its predecessor.

Claude 3.7 Sonnet

Anthropic's first hybrid reasoning model with togglable extended thinking, a 200K context window, and state-of-the-art SWE-bench performance at $3/$15 per million tokens.

SU-01

SU-01 is a 30B-A3B MoE reasoning model from Shanghai AI Lab that achieves gold-medal performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 using a three-stage training recipe and test-time scaling.

Olympiad Gold, Broken Memories, and Attention Loss

A 30B model earns IMO gold, memory consolidation silently corrupts agents, and a new metric predicts when LLMs lose track of their instructions.

GPT-OSS 20B

OpenAI's open-weight 21B MoE reasoning model with 131K context, Apache 2.0 license, and o3-mini-level benchmark performance running in 16 GB of memory.

OpenAI o3-pro

OpenAI's maximum-compute reasoning model targets the hardest problems where o3 falls short, at $20/$80 per million tokens.

OpenAI o3

OpenAI's most advanced reasoning model, built for math, science, coding, and visual tasks, with 200K context and adaptive chain-of-thought at $2/$8 per million tokens.

Reasoning Bias, Behavior Cues, and Tool Interpretability

New research shows reasoning length amplifies position bias, behavior cues cut wasted tokens by 50% while boosting safety, and sparse autoencoders can predict tool failures from model internals.

OpenAI o4-mini

OpenAI o4-mini is a fast, cost-efficient reasoning model in the o-series, delivering near-o3 performance on math and coding benchmarks at roughly 10x lower cost.

Reasoning Model API Pricing Compared - May 2026

Updated May 2026: DeepSeek V4-Flash reasoning now $0.28/MTok output (8x cheaper than R1), o3-pro launched at $20/$80, Grok 4 retires May 15 - verified pricing across 11 models.

ZAYA1-8B: Open Reasoning Model Rivals Claude on AMD GPUs

Zyphra's ZAYA1-8B matches Claude 4.5 Sonnet on HMMT 2025 math benchmarks at just 760M active parameters, trained entirely on AMD Instinct MI300X GPUs under Apache 2.0.