Name: GPT-5.2 - OpenAI's Flagship Reasoning Model
Author: OpenAI

GPT-5.2 is OpenAI's flagship large language model, launched on December 11, 2025, as a direct response to Google's Gemini 3 release. It ships in three variants - Instant, Thinking, and Pro - each targeting a different point on the speed-versus-depth range. OpenAI positions it as the first model to match or beat human experts on professional knowledge work, a claim backed by strong benchmark numbers but met with mixed real-world reception.

TL;DR

OpenAI's most capable model, scoring 70.9% win/tie rate vs. human experts on professional tasks (GDPval)
400K context window, $1.75/M input tokens (Thinking), $21/M input (Pro)
Controls math and coding benchmarks but trails Gemini 3.1 Pro on abstract reasoning (ARC-AGI-2) and Claude Opus 4.6 on creative writing quality

The model arrived during what OpenAI internally called a "Code Red" period - a scramble to regain benchmark leadership after Google's Gemini 3 took the top spot on several evaluations in late 2025. GPT-5.2 succeeded on that front, reclaiming pole position on professional benchmarks, but user feedback suggests the gains come with tradeoffs in output style and speed that matter in practice.

For a deeper look at how GPT-5.2 handles real tasks across coding, research, and creative workflows, see our full GPT-5.2 review.

Key Specifications

Specification	Details
Provider	OpenAI
Model Family	GPT-5
Parameters	Not disclosed
Context Window	400,000 tokens (max prompt: 272K, max output: 128K)
Input Price (Thinking)	$1.75/M tokens ($0.175/M cached)
Output Price (Thinking)	$14.00/M tokens
Input Price (Pro)	$21.00/M tokens
Output Price (Pro)	$168.00/M tokens
Release Date	December 11, 2025
Knowledge Cutoff	August 2025
License	Proprietary
Modalities	Text + Image input, Text output

OpenAI hasn't disclosed the parameter count for GPT-5.2, continuing the trend of keeping architecture details under wraps since GPT-4. The 400K context window is a 3x increase over GPT-5.1's 128K and supports near-perfect recall out to roughly 256K tokens, though accuracy degrades on longer sequences.

GPT-5.2 builds on OpenAI's proprietary architecture with undisclosed parameter count OpenAI keeps GPT-5.2's architecture details under wraps, but the 400K context window and three-tier compute system represent a significant engineering leap over GPT-5.1.

The Three Variants

GPT-5.2 ships as three distinct modes, all sharing the same base architecture but with different compute budgets:

GPT-5.2 Instant is the speed-optimized variant designed for everyday tasks - information lookup, quick drafts, translation, and how-to walkthroughs. It trades reasoning depth for lower latency and cost. In practice, some users report Instant produces blander output than GPT-5.1, with more refusals and hedging.

GPT-5.2 Thinking is the default reasoning mode and the variant most benchmarks reference. It introduces a "thinking time" toggle with Light, Medium, and Heavy settings that let developers trade latency for depth on a per-request basis. The hidden reasoning tokens are billed at the output rate ($14/M), which can significantly inflate costs for complex queries.

GPT-5.2 Pro is the research-grade tier, available only on Pro ($200/month), Business, Enterprise, and Edu plans. It supports "xhigh" reasoning for deep multi-step logic, scientific evaluation, and complex business modeling. At $21/M input and $168/M output, it's one of the most expensive API models available - roughly 12x the cost of Thinking mode per output token.

Benchmark Performance

Benchmark	GPT-5.2	Claude Opus 4.6	Gemini 3.1 Pro
GPQA Diamond	92.4%	91.3%	94.3%
MMLU-Pro	94.2%	-	~91.4%
SWE-Bench Verified	80.0%	72.6%	80.6%
SWE-Bench Pro	55.6%	-	54.2%
HumanEval	95.0%	-	-
AIME 2025	100%	-	-
MATH	98.0%	-	-
ARC-AGI-2	52.9%	37.6%	77.1%
HLE (no tools)	34.5%	41.2%	44.4%
LiveCodeBench	80.0%	-	-
IFEval	95.0%	-	-

Scores represent the best publicly reported result for each model variant. GPT-5.2 scores are from the Thinking variant unless noted. Pro variant reaches 93.2% on GPQA Diamond.

The numbers tell a nuanced story. GPT-5.2 controls math benchmarks (100% AIME 2025, 98% MATH) and holds its own on coding evaluations with 80% on SWE-Bench Verified - basically tied with Gemini 3.1 Pro's 80.6%. On MMLU-Pro, it leads the field at 94.2%.

But the gaps are real. Gemini 3.1 Pro crushes GPT-5.2 on ARC-AGI-2 (77.1% vs 52.9%), the abstract reasoning benchmark that tests genuine novel problem-solving. And on Humanity's Last Exam - the hardest knowledge benchmark available - GPT-5.2 trails both competitors without tool access. Check our coding benchmarks leaderboard for the latest SWE-Bench and LiveCodeBench rankings, and the Chatbot Arena Elo rankings for human preference scores.

GPT-5.2's variable reasoning system lets developers dial compute depth per request The Thinking mode's Light/Medium/Heavy toggle routes each request to a different compute tier - a departure from the one-size-fits-all approach of earlier models.

Key Capabilities

GPT-5.2's strongest differentiator is professional knowledge work. On GDPval - an evaluation spanning 44 occupations from law to engineering - GPT-5.2 Thinking achieves a 70.9% win-or-tie rate against human industry experts. That's a meaningful gap over Claude Opus 4.5's 59.6% and Gemini 3 Pro's 53.5% on the same benchmark.

The 400K context window opens up use cases that were impractical with GPT-5.1's 128K limit. Enterprise users can process entire codebases, stacks of legal documents, or multi-chapter reports in a single session. Recall accuracy stays near-perfect out to 256K tokens - a significant improvement that matters for retrieval-augmented workflows.

On tool use and agentic tasks, GPT-5.2 shows clear progress. Reliable function calling, structured output generation, and the variable reasoning dial make it well-suited for production pipelines where different queries demand different compute budgets. Augment Code chose GPT-5.2 as their primary model for automated code review, citing its instruction-following and consistency.

Pricing and Availability

GPT-5.2 is available through the OpenAI API (Chat Completions endpoint) and through ChatGPT for paid subscribers. Here is how pricing compares to competitors:

Model	Input (per M tokens)	Output (per M tokens)
GPT-5.2 Thinking	$1.75	$14.00
GPT-5.2 Pro	$21.00	$168.00
Claude Opus 4.6	$15.00	$75.00
Gemini 3.1 Pro	$2.00	$10.00

GPT-5.2 Thinking is competitively priced against Gemini 3.1 Pro and notably cheaper than Claude Opus 4.6 on a per-token basis. The 90% cached input discount ($0.175/M) makes it especially attractive for applications that reuse system prompts or reference documents. But the hidden reasoning tokens in Thinking mode are billed at the $14/M output rate, so actual costs can run 3-5x higher than the raw token prices suggest for complex queries.

Pro mode at $168/M output tokens is in a different league completely - reserved for high-stakes tasks where accuracy matters more than cost. Our cost efficiency leaderboard has the full price-performance breakdown across all major models.

ChatGPT access tiers: Free users get GPT-5.2 Instant with limits. Plus ($20/month) and Pro ($200/month) users can select between Instant and Thinking. Pro subscribers also get access to GPT-5.2 Pro mode.

Strengths

Record professional benchmarks: 70.9% expert win/tie rate on GDPval, 100% on AIME 2025
Massive context window: 400K tokens with near-perfect recall to 256K
Variable reasoning depth: Light/Medium/Heavy toggle lets developers tune cost vs. accuracy per request
Strong coding performance: 80% SWE-Bench Verified, 95% HumanEval
Competitive Thinking-mode pricing: $1.75/M input undercuts Claude Opus 4.6 by 8.5x
Mature ecosystem: Largest third-party tool and integration support of any model family

Weaknesses

Slow in deep reasoning modes: Thinking (Heavy) and Pro can take 30+ seconds for complex queries
Bland output style: Users widely report over-formatting, excessive bullet points, and corporate-sounding prose - especially in Instant mode
Hidden reasoning token costs: Thinking mode bills invisible chain-of-thought tokens at output rates, making costs unpredictable
Vision limitations: Still struggles with precise image interpretation tasks like counting objects
Lags on abstract reasoning: 52.9% on ARC-AGI-2 is well behind Gemini 3.1 Pro's 77.1%
Creative writing gap: Ranked behind Claude Opus 4.6 in human writing quality evaluations

Our Review: GPT-5.2 Review - OpenAI's Most Capable Model Tested
Comparison: ChatGPT vs Claude vs Gemini - Which AI Assistant Should You Use?
Successor: GPT-5.3 Codex - OpenAI's Agentic Coding Model
Leaderboards: Coding Benchmarks Leaderboard | Chatbot Arena Rankings | Cost Efficiency Leaderboard
Enterprise: OpenAI Frontier Review

Sources: