Models

GPT-5.2 - OpenAI's Flagship Reasoning Model

GPT-5.2 is OpenAI's most capable model with three modes, 400K context, and record-setting professional benchmarks - but speed and pricing raise questions.

GPT-5.2 - OpenAI's Flagship Reasoning Model

GPT-5.2 is OpenAI's flagship large language model, launched on December 11, 2025, as a direct response to Google's Gemini 3 release. It ships in three variants - Instant, Thinking, and Pro - each targeting a different point on the speed-versus-depth range. OpenAI positions it as the first model to match or beat human experts on professional knowledge work, a claim backed by strong benchmark numbers but met with mixed real-world reception.

TL;DR

  • OpenAI's most capable model, scoring 70.9% win/tie rate vs. human experts on professional tasks (GDPval)
  • 400K context window, $1.75/M input tokens (Thinking), $21/M input (Pro)
  • Controls math and coding benchmarks but trails Gemini 3.1 Pro on abstract reasoning (ARC-AGI-2) and Claude Opus 4.6 on creative writing quality

The model arrived during what OpenAI internally called a "Code Red" period - a scramble to regain benchmark leadership after Google's Gemini 3 took the top spot on several evaluations in late 2025. GPT-5.2 succeeded on that front, reclaiming pole position on professional benchmarks, but user feedback suggests the gains come with tradeoffs in output style and speed that matter in practice.

For a deeper look at how GPT-5.2 handles real tasks across coding, research, and creative workflows, see our full GPT-5.2 review.

Key Specifications

SpecificationDetails
ProviderOpenAI
Model FamilyGPT-5
ParametersNot disclosed
Context Window400,000 tokens (max prompt: 272K, max output: 128K)
Input Price (Thinking)$1.75/M tokens ($0.175/M cached)
Output Price (Thinking)$14.00/M tokens
Input Price (Pro)$21.00/M tokens
Output Price (Pro)$168.00/M tokens
Release DateDecember 11, 2025
Knowledge CutoffAugust 2025
LicenseProprietary
ModalitiesText + Image input, Text output

OpenAI hasn't disclosed the parameter count for GPT-5.2, continuing the trend of keeping architecture details under wraps since GPT-4. The 400K context window is a 3x increase over GPT-5.1's 128K and supports near-perfect recall out to roughly 256K tokens, though accuracy degrades on longer sequences.

GPT-5.2 builds on OpenAI's proprietary architecture with undisclosed parameter count OpenAI keeps GPT-5.2's architecture details under wraps, but the 400K context window and three-tier compute system represent a significant engineering leap over GPT-5.1.

The Three Variants

GPT-5.2 ships as three distinct modes, all sharing the same base architecture but with different compute budgets:

GPT-5.2 Instant is the speed-optimized variant designed for everyday tasks - information lookup, quick drafts, translation, and how-to walkthroughs. It trades reasoning depth for lower latency and cost. In practice, some users report Instant produces blander output than GPT-5.1, with more refusals and hedging.

GPT-5.2 Thinking is the default reasoning mode and the variant most benchmarks reference. It introduces a "thinking time" toggle with Light, Medium, and Heavy settings that let developers trade latency for depth on a per-request basis. The hidden reasoning tokens are billed at the output rate ($14/M), which can significantly inflate costs for complex queries.

GPT-5.2 Pro is the research-grade tier, available only on Pro ($200/month), Business, Enterprise, and Edu plans. It supports "xhigh" reasoning for deep multi-step logic, scientific evaluation, and complex business modeling. At $21/M input and $168/M output, it's one of the most expensive API models available - roughly 12x the cost of Thinking mode per output token.

Benchmark Performance

BenchmarkGPT-5.2Claude Opus 4.6Gemini 3.1 Pro
GPQA Diamond92.4%91.3%94.3%
MMLU-Pro94.2%-~91.4%
SWE-Bench Verified80.0%72.6%80.6%
SWE-Bench Pro55.6%-54.2%
HumanEval95.0%--
AIME 2025100%--
MATH98.0%--
ARC-AGI-252.9%37.6%77.1%
HLE (no tools)34.5%41.2%44.4%
LiveCodeBench80.0%--
IFEval95.0%--

Scores represent the best publicly reported result for each model variant. GPT-5.2 scores are from the Thinking variant unless noted. Pro variant reaches 93.2% on GPQA Diamond.

The numbers tell a nuanced story. GPT-5.2 controls math benchmarks (100% AIME 2025, 98% MATH) and holds its own on coding evaluations with 80% on SWE-Bench Verified - basically tied with Gemini 3.1 Pro's 80.6%. On MMLU-Pro, it leads the field at 94.2%.

But the gaps are real. Gemini 3.1 Pro crushes GPT-5.2 on ARC-AGI-2 (77.1% vs 52.9%), the abstract reasoning benchmark that tests genuine novel problem-solving. And on Humanity's Last Exam - the hardest knowledge benchmark available - GPT-5.2 trails both competitors without tool access. Check our coding benchmarks leaderboard for the latest SWE-Bench and LiveCodeBench rankings, and the Chatbot Arena Elo rankings for human preference scores.

GPT-5.2's variable reasoning system lets developers dial compute depth per request The Thinking mode's Light/Medium/Heavy toggle routes each request to a different compute tier - a departure from the one-size-fits-all approach of earlier models.

Key Capabilities

GPT-5.2's strongest differentiator is professional knowledge work. On GDPval - an evaluation spanning 44 occupations from law to engineering - GPT-5.2 Thinking achieves a 70.9% win-or-tie rate against human industry experts. That's a meaningful gap over Claude Opus 4.5's 59.6% and Gemini 3 Pro's 53.5% on the same benchmark.

The 400K context window opens up use cases that were impractical with GPT-5.1's 128K limit. Enterprise users can process entire codebases, stacks of legal documents, or multi-chapter reports in a single session. Recall accuracy stays near-perfect out to 256K tokens - a significant improvement that matters for retrieval-augmented workflows.

On tool use and agentic tasks, GPT-5.2 shows clear progress. Reliable function calling, structured output generation, and the variable reasoning dial make it well-suited for production pipelines where different queries demand different compute budgets. Augment Code chose GPT-5.2 as their primary model for automated code review, citing its instruction-following and consistency.

Pricing and Availability

GPT-5.2 is available through the OpenAI API (Chat Completions endpoint) and through ChatGPT for paid subscribers. Here is how pricing compares to competitors:

ModelInput (per M tokens)Output (per M tokens)
GPT-5.2 Thinking$1.75$14.00
GPT-5.2 Pro$21.00$168.00
Claude Opus 4.6$15.00$75.00
Gemini 3.1 Pro$2.00$10.00

GPT-5.2 Thinking is competitively priced against Gemini 3.1 Pro and notably cheaper than Claude Opus 4.6 on a per-token basis. The 90% cached input discount ($0.175/M) makes it especially attractive for applications that reuse system prompts or reference documents. But the hidden reasoning tokens in Thinking mode are billed at the $14/M output rate, so actual costs can run 3-5x higher than the raw token prices suggest for complex queries.

Pro mode at $168/M output tokens is in a different league completely - reserved for high-stakes tasks where accuracy matters more than cost. Our cost efficiency leaderboard has the full price-performance breakdown across all major models.

ChatGPT access tiers: Free users get GPT-5.2 Instant with limits. Plus ($20/month) and Pro ($200/month) users can select between Instant and Thinking. Pro subscribers also get access to GPT-5.2 Pro mode.

Strengths

  • Record professional benchmarks: 70.9% expert win/tie rate on GDPval, 100% on AIME 2025
  • Massive context window: 400K tokens with near-perfect recall to 256K
  • Variable reasoning depth: Light/Medium/Heavy toggle lets developers tune cost vs. accuracy per request
  • Strong coding performance: 80% SWE-Bench Verified, 95% HumanEval
  • Competitive Thinking-mode pricing: $1.75/M input undercuts Claude Opus 4.6 by 8.5x
  • Mature ecosystem: Largest third-party tool and integration support of any model family

Weaknesses

  • Slow in deep reasoning modes: Thinking (Heavy) and Pro can take 30+ seconds for complex queries
  • Bland output style: Users widely report over-formatting, excessive bullet points, and corporate-sounding prose - especially in Instant mode
  • Hidden reasoning token costs: Thinking mode bills invisible chain-of-thought tokens at output rates, making costs unpredictable
  • Vision limitations: Still struggles with precise image interpretation tasks like counting objects
  • Lags on abstract reasoning: 52.9% on ARC-AGI-2 is well behind Gemini 3.1 Pro's 77.1%
  • Creative writing gap: Ranked behind Claude Opus 4.6 in human writing quality evaluations

Sources:

GPT-5.2 - OpenAI's Flagship Reasoning Model
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.