
Gemini 3.1 Pro
Google DeepMind's Gemini 3.1 Pro leads on 13 of 16 benchmarks with 77.1% ARC-AGI-2, 94.3% GPQA Diamond, and a 1M-token context window at $2/M input.

Google DeepMind's Gemini 3.1 Pro leads on 13 of 16 benchmarks with 77.1% ARC-AGI-2, 94.3% GPQA Diamond, and a 1M-token context window at $2/M input.

Google's Gemini 3 Deep Think trades speed for depth, delivering record-breaking reasoning benchmarks - but at a steep price.

Google releases Gemini 3.1 Pro with 77.1% on ARC-AGI-2, more than doubling the reasoning capability of its predecessor and beating Claude Opus 4.6 and GPT-5.2 on most benchmarks.

Google releases Gemini 3.1 Pro with dramatically improved reasoning, topping Claude Opus 4.6 and GPT-5.2 on most industry benchmarks.

Rankings of AI models on the hardest reasoning benchmarks available: GPQA Diamond, AIME competition math, and the notoriously difficult Humanity's Last Exam.

Anthropic launches Claude Opus 4.6 featuring agent teams, adaptive thinking, 1M token context window, and state-of-the-art performance on Terminal-Bench 2.0 and Humanity's Last Exam.