GPT-5.2 - OpenAI's Flagship Reasoning Model
GPT-5.2 is OpenAI's most capable model with three modes, 400K context, and record-setting professional benchmarks - but speed and pricing raise questions.

GPT-5.2 is OpenAI's flagship large language model, launched on December 11, 2025, as a direct response to Google's Gemini 3 release. It ships in three variants - Instant, Thinking, and Pro - each targeting a different point on the speed-versus-depth range. OpenAI positions it as the first model to match or beat human experts on professional knowledge work, a claim backed by strong benchmark numbers but met with mixed real-world reception.
TL;DR
- OpenAI's most capable model, scoring 70.9% win/tie rate vs. human experts on professional tasks (GDPval)
- 400K context window, $1.75/M input tokens (Thinking), $21/M input (Pro)
- Controls math and coding benchmarks but trails Gemini 3.1 Pro on abstract reasoning (ARC-AGI-2) and Claude Opus 4.6 on creative writing quality
The model arrived during what OpenAI internally called a "Code Red" period - a scramble to regain benchmark leadership after Google's Gemini 3 took the top spot on several evaluations in late 2025. GPT-5.2 succeeded on that front, reclaiming pole position on professional benchmarks, but user feedback suggests the gains come with tradeoffs in output style and speed that matter in practice.
For a deeper look at how GPT-5.2 handles real tasks across coding, research, and creative workflows, see our full GPT-5.2 review.
Key Specifications
| Specification | Details |
|---|---|
| Provider | OpenAI |
| Model Family | GPT-5 |
| Parameters | Not disclosed |
| Context Window | 400,000 tokens (max prompt: 272K, max output: 128K) |
| Input Price (Thinking) | $1.75/M tokens ($0.175/M cached) |
| Output Price (Thinking) | $14.00/M tokens |
| Input Price (Pro) | $21.00/M tokens |
| Output Price (Pro) | $168.00/M tokens |
| Release Date | December 11, 2025 |
| Knowledge Cutoff | August 2025 |
| License | Proprietary |
| Modalities | Text + Image input, Text output |
OpenAI hasn't disclosed the parameter count for GPT-5.2, continuing the trend of keeping architecture details under wraps since GPT-4. The 400K context window is a 3x increase over GPT-5.1's 128K and supports near-perfect recall out to roughly 256K tokens, though accuracy degrades on longer sequences.
OpenAI keeps GPT-5.2's architecture details under wraps, but the 400K context window and three-tier compute system represent a significant engineering leap over GPT-5.1.
The Three Variants
GPT-5.2 ships as three distinct modes, all sharing the same base architecture but with different compute budgets:
GPT-5.2 Instant is the speed-optimized variant designed for everyday tasks - information lookup, quick drafts, translation, and how-to walkthroughs. It trades reasoning depth for lower latency and cost. In practice, some users report Instant produces blander output than GPT-5.1, with more refusals and hedging.
GPT-5.2 Thinking is the default reasoning mode and the variant most benchmarks reference. It introduces a "thinking time" toggle with Light, Medium, and Heavy settings that let developers trade latency for depth on a per-request basis. The hidden reasoning tokens are billed at the output rate ($14/M), which can significantly inflate costs for complex queries.
GPT-5.2 Pro is the research-grade tier, available only on Pro ($200/month), Business, Enterprise, and Edu plans. It supports "xhigh" reasoning for deep multi-step logic, scientific evaluation, and complex business modeling. At $21/M input and $168/M output, it's one of the most expensive API models available - roughly 12x the cost of Thinking mode per output token.
Benchmark Performance
| Benchmark | GPT-5.2 | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|
| GPQA Diamond | 92.4% | 91.3% | 94.3% |
| MMLU-Pro | 94.2% | - | ~91.4% |
| SWE-Bench Verified | 80.0% | 72.6% | 80.6% |
| SWE-Bench Pro | 55.6% | - | 54.2% |
| HumanEval | 95.0% | - | - |
| AIME 2025 | 100% | - | - |
| MATH | 98.0% | - | - |
| ARC-AGI-2 | 52.9% | 37.6% | 77.1% |
| HLE (no tools) | 34.5% | 41.2% | 44.4% |
| LiveCodeBench | 80.0% | - | - |
| IFEval | 95.0% | - | - |
Scores represent the best publicly reported result for each model variant. GPT-5.2 scores are from the Thinking variant unless noted. Pro variant reaches 93.2% on GPQA Diamond.
The numbers tell a nuanced story. GPT-5.2 controls math benchmarks (100% AIME 2025, 98% MATH) and holds its own on coding evaluations with 80% on SWE-Bench Verified - basically tied with Gemini 3.1 Pro's 80.6%. On MMLU-Pro, it leads the field at 94.2%.
But the gaps are real. Gemini 3.1 Pro crushes GPT-5.2 on ARC-AGI-2 (77.1% vs 52.9%), the abstract reasoning benchmark that tests genuine novel problem-solving. And on Humanity's Last Exam - the hardest knowledge benchmark available - GPT-5.2 trails both competitors without tool access. Check our coding benchmarks leaderboard for the latest SWE-Bench and LiveCodeBench rankings, and the Chatbot Arena Elo rankings for human preference scores.
The Thinking mode's Light/Medium/Heavy toggle routes each request to a different compute tier - a departure from the one-size-fits-all approach of earlier models.
Key Capabilities
GPT-5.2's strongest differentiator is professional knowledge work. On GDPval - an evaluation spanning 44 occupations from law to engineering - GPT-5.2 Thinking achieves a 70.9% win-or-tie rate against human industry experts. That's a meaningful gap over Claude Opus 4.5's 59.6% and Gemini 3 Pro's 53.5% on the same benchmark.
The 400K context window opens up use cases that were impractical with GPT-5.1's 128K limit. Enterprise users can process entire codebases, stacks of legal documents, or multi-chapter reports in a single session. Recall accuracy stays near-perfect out to 256K tokens - a significant improvement that matters for retrieval-augmented workflows.
On tool use and agentic tasks, GPT-5.2 shows clear progress. Reliable function calling, structured output generation, and the variable reasoning dial make it well-suited for production pipelines where different queries demand different compute budgets. Augment Code chose GPT-5.2 as their primary model for automated code review, citing its instruction-following and consistency.
Pricing and Availability
GPT-5.2 is available through the OpenAI API (Chat Completions endpoint) and through ChatGPT for paid subscribers. Here is how pricing compares to competitors:
| Model | Input (per M tokens) | Output (per M tokens) |
|---|---|---|
| GPT-5.2 Thinking | $1.75 | $14.00 |
| GPT-5.2 Pro | $21.00 | $168.00 |
| Claude Opus 4.6 | $15.00 | $75.00 |
| Gemini 3.1 Pro | $2.00 | $10.00 |
GPT-5.2 Thinking is competitively priced against Gemini 3.1 Pro and notably cheaper than Claude Opus 4.6 on a per-token basis. The 90% cached input discount ($0.175/M) makes it especially attractive for applications that reuse system prompts or reference documents. But the hidden reasoning tokens in Thinking mode are billed at the $14/M output rate, so actual costs can run 3-5x higher than the raw token prices suggest for complex queries.
Pro mode at $168/M output tokens is in a different league completely - reserved for high-stakes tasks where accuracy matters more than cost. Our cost efficiency leaderboard has the full price-performance breakdown across all major models.
ChatGPT access tiers: Free users get GPT-5.2 Instant with limits. Plus ($20/month) and Pro ($200/month) users can select between Instant and Thinking. Pro subscribers also get access to GPT-5.2 Pro mode.
Strengths
- Record professional benchmarks: 70.9% expert win/tie rate on GDPval, 100% on AIME 2025
- Massive context window: 400K tokens with near-perfect recall to 256K
- Variable reasoning depth: Light/Medium/Heavy toggle lets developers tune cost vs. accuracy per request
- Strong coding performance: 80% SWE-Bench Verified, 95% HumanEval
- Competitive Thinking-mode pricing: $1.75/M input undercuts Claude Opus 4.6 by 8.5x
- Mature ecosystem: Largest third-party tool and integration support of any model family
Weaknesses
- Slow in deep reasoning modes: Thinking (Heavy) and Pro can take 30+ seconds for complex queries
- Bland output style: Users widely report over-formatting, excessive bullet points, and corporate-sounding prose - especially in Instant mode
- Hidden reasoning token costs: Thinking mode bills invisible chain-of-thought tokens at output rates, making costs unpredictable
- Vision limitations: Still struggles with precise image interpretation tasks like counting objects
- Lags on abstract reasoning: 52.9% on ARC-AGI-2 is well behind Gemini 3.1 Pro's 77.1%
- Creative writing gap: Ranked behind Claude Opus 4.6 in human writing quality evaluations
Related Coverage
- Our Review: GPT-5.2 Review - OpenAI's Most Capable Model Tested
- Comparison: ChatGPT vs Claude vs Gemini - Which AI Assistant Should You Use?
- Successor: GPT-5.3 Codex - OpenAI's Agentic Coding Model
- Leaderboards: Coding Benchmarks Leaderboard | Chatbot Arena Rankings | Cost Efficiency Leaderboard
- Enterprise: OpenAI Frontier Review
Sources:
- Introducing GPT-5.2 - OpenAI
- GPT-5.2: Benchmarks, Model Breakdown, and Real-World Performance - DataCamp
- GPT-5.2: Pricing, Context Window, Benchmarks - LLM Stats
- Update to GPT-5 System Card: GPT-5.2 - OpenAI
- GPT-5.2 - OpenRouter
- OpenAI's GPT-5.2 Is Here: What Enterprises Need to Know - VentureBeat
- GPT-5.2 Review: Incredibly Impressive, But Too Slow - Shumer.dev
- Why GPT-5.2 Is Our Model of Choice for Augment Code Review - Augment Code
- Gemini 3.1 Pro vs Claude Opus 4.6 vs GPT-5.2 Comparison - NxCode
