GPT-5.1
GPT-5.1 is OpenAI's November 2025 coding and agentic flagship with 400K context, configurable reasoning effort, and 76.3% on SWE-bench Verified.

GPT-5.1 is the second major release in OpenAI's GPT-5 family, landing on November 12, 2025, roughly six months after the base GPT-5 launch. OpenAI positioned it as the best model for coding and agentic workflows, pairing a 400K context window with configurable reasoning effort that lets developers dial between speed and depth per request. It isn't a dramatic leap over GPT-5; it's a measured tuning pass that tightened instruction following, added personality customization, and moved the coding benchmark needle up several points before GPT-5.2 arrived in December.
TL;DR
- OpenAI's first GPT-5 update, released November 12, 2025, built for coding and multi-step agentic tasks
- 400K context, $1.25/M input tokens, 76.3% SWE-bench Verified - second-ranked coding model at launch
- Stronger math and instruction following than GPT-5; trails Claude 4.5 marginally on SWE-bench, lags GPT-5.2 on ARC-AGI-2 (17.6% vs 52.9%)
The release came five days before the November 19 debut of two additional variants - GPT-5.1-Codex-Max and GPT-5.1 Pro - rounding out a family of five models. All were built on the same architecture, differentiated by compute budget and use case. The codex variants got dedicated tooling support: an apply_patch tool for code edits without JSON escaping, plus shell command execution. Both changes were aimed directly at agentic coding pipelines where the model writes and runs code in loops.
A multi-agent trust study published on arXiv (paper 2606.14923) later included GPT-5.1 in its test cohort alongside Claude Opus 4.6, Claude Sonnet 4.6, and Gemini 3.1 Pro, examining how these frontier models calibrate verification behavior when working with other agents.
Key Specifications
| Specification | Details |
|---|---|
| Provider | OpenAI |
| Model Family | GPT-5 |
| Parameters | Not disclosed |
| Context Window | 400,000 tokens (max output: 128K) |
| Input Price | $1.25/M tokens ($0.125/M cached) |
| Output Price | $10.00/M tokens |
| Release Date | November 12, 2025 |
| Knowledge Cutoff | September 30, 2024 |
| License | Proprietary |
| Modalities | Text + image input, text output |
| API Endpoint | gpt-5.1, gpt-5.1-2025-11-13 |
The 400K context window matches what GPT-5.2 shipped with a month later, which OpenAI described as a 3x increase over the original GPT-5's context ceiling. Prompt caching cuts the $1.25/M input rate by 90% for repeated system prompts, making GPT-5.1 cost-effective for long-running sessions. OpenAI extended prompt cache retention to 24 hours, useful for overnight agentic runs.
The Model Variants
GPT-5.1 shipped as five distinct variants, all sharing the same base architecture.
GPT-5.1 Instant handles general conversational tasks. OpenAI called it "more conversational than earlier chat models" with an adaptive reasoning mode that decides when to think before answering, rather than always engaging full chain-of-thought. For routine queries it's noticeably faster than GPT-5.
GPT-5.1 Thinking is the reasoning variant where you set reasoning_effort to low, medium, or high - or disable it completely with none for latency-critical applications. This is the model most benchmarks reference when reporting GPT-5.1 scores.
GPT-5.1-Codex-Mini is the cost-effective coding tier. It accepts MCP server configurations as native input with tool definitions, making it a first-class citizen in automated development pipelines.
GPT-5.1-Codex-Max (November 19) targets large-scale agentic coding. The apply_patch tool and shell execution capability made it a direct competitor to agents like Devin and Replit Agent on multi-file, multi-step repository work.
GPT-5.1 Pro (November 19) replaced GPT-5 Pro for subscribers on the $200/month plan.
Benchmark Performance
| Benchmark | GPT-5.1 | Claude 4.5 | Gemini 3 Pro |
|---|---|---|---|
| GPQA Diamond | 87.3% | Not reported | Not reported |
| MMLU-Pro | 87.0% | Not reported | Not reported |
| SWE-bench Verified | 76.3% | 77.2% | Not reported |
| AIME 2025 | 94.0% | ~88% | Not reported |
| ARC-AGI-2 | 17.6% | Not reported | Not reported |
| HLE (no tools) | 26.5% | Not reported | Not reported |
| LiveCodeBench | 86.8% | Not reported | Not reported |
| tau-bench | 81.9% | Not reported | Not reported |
Scores are from the Thinking variant where reasoning effort applies. Competitor scores represent best publicly available results at time of GPT-5.1 launch.
The numbers show a model that's strong at math and structured coding but hits a ceiling on the hardest novel reasoning problems. The 94% AIME 2025 score is truly impressive, placing GPT-5.1 ahead of Claude 4.5's ~88% on the same benchmark. Coding is more competitive: Claude 4.5 held a slim edge on SWE-bench Verified at 77.2% versus GPT-5.1's 76.3%, a gap small enough to dissolve in run-to-run variance.
The ARC-AGI-2 score of 17.6% is where things get uncomfortable. GPT-5.2 jumped to 52.9% a month later, which suggests OpenAI made a meaningful architectural change rather than just tuning between releases. For tasks requiring genuine novel reasoning on visual puzzles, GPT-5.1 is near the bottom of the frontier pack. See the reasoning benchmarks leaderboard for context on how the family evolved from here.
The Humanity's Last Exam score of 26.5% is consistent with the pattern: GPT-5.1 is a strong practitioner model, not a research-grade reasoning system. GPT-5.2 raised that to 34.5%, still trailing Gemini 3.1 Pro (44.4%) and Claude Opus 4.6 (41.2%).
Key Capabilities
GPT-5.1's main value proposition is agentic coding reliability. At 76.3% on SWE-bench Verified, it resolved 381 of 500 real GitHub issues without human intervention - a production-relevant number for teams building autonomous code review or bug-fix pipelines. The apply_patch tool on Codex variants removes a friction point that caused earlier models to mangle JSON-escaped patches; shell execution lets the model verify its own fixes by running tests before committing.
Instruction following saw targeted improvement over GPT-5. Reviewers at launch consistently noted better prompt adherence on structured workflows - fewer hallucinated JSON fields, more reliable tool call sequencing. For agentic pipelines where the model orchestrates multiple tool calls in a loop, that consistency compounds across steps.
The eight personality presets (Default, Friendly, Efficient, Professional, Candid, Quirky, Nerdy, Cynical) got some attention at launch but matter little for API users building products. They're more relevant for ChatGPT consumer use where OpenAI was competing on warmth and engagement against Claude's conversational style. In practice, the "Efficient" and "Professional" presets reduce verbosity - useful for cost-sensitive applications billed on output tokens.
Multi-agent behavior has been independently studied. The arXiv paper 2606.14923 found that larger frontier models including GPT-5.1 reduce verification behavior by 60-85% when paired with reliable partners - trusting teammates rather than re-checking their work. This aligns with what teams running GPT-5.1 in multi-agent setups reported: the model cooperates efficiently but doesn't maintain independent verification unless explicitly prompted to do so.
Pricing and Availability
GPT-5.1 launched at $1.25/M input and $10.00/M output through the OpenAI API, with a 90% cached input discount at $0.125/M. At launch, this made it cheaper than Claude 4.5 on a per-token basis while delivering comparable coding benchmark scores.
| Model | Input (per M tokens) | Output (per M tokens) |
|---|---|---|
| GPT-5.1 | $1.25 | $10.00 |
| GPT-5.1 (cached) | $0.125 | $10.00 |
| Claude 4.5 Sonnet | Higher | Higher |
| GPT-5.2 Thinking | $1.75 | $14.00 |
ChatGPT availability followed the API launch by one day. Free users got limited GPT-5.1 Instant access; Plus ($20/month) users could select both Instant and Thinking; Pro ($200/month) subscribers got GPT-5.1 Pro starting November 19. Microsoft Copilot also picked up GPT-5.1, giving enterprise Microsoft 365 customers access through their existing licensing.
Note: GPT-5.1 was retired from ChatGPT on March 11, 2026, replaced by GPT-5.2 and later models. API access via the versioned snapshot gpt-5.1-2025-11-13 may remain available, but production users should assess migrating to GPT-5.4 or GPT-5.5.
Strengths
- Agentic coding focus:
apply_patchtool, shell execution, and MCP server support on Codex variants target real production agentic pipelines - Extended prompt caching: 24-hour cache retention reduces cost for long-running agentic sessions
- Configurable reasoning:
reasoning_effortfromnonetohighlets developers tune latency vs. depth per request - Strong math: 94% AIME 2025 beats Claude 4.5 and sets a high bar for math reasoning at launch
- Competitive pricing: $1.25/M input with 90% cache discount undercuts many frontier competitors
Weaknesses
- ARC-AGI-2 gap: 17.6% shows a real ceiling on novel abstract reasoning that GPT-5.2 addressed one month later
- SWE-bench marginally behind Claude: 76.3% vs Claude 4.5's 77.2% - not a large gap, but it's second place on the benchmark that matters most for the model's stated use case
- No fine-tuning or predicted outputs: API lacks these capabilities available on some competitors
- Knowledge cutoff: September 30, 2024 means real-world knowledge gaps for anything released in late 2024 or 2025
- Now superseded: GPT-5.2, GPT-5.4, and GPT-5.5 have all passed GPT-5.1 on benchmark rankings; for new projects there's limited reason to choose it over successors
Related Coverage
- Model overview: GPT-5.2 - the December 2025 successor with 3x improvement on ARC-AGI-2
- Later releases: GPT-5.3 Codex | GPT-5.4 | GPT-5.5
- Leaderboards: Coding Benchmarks | Reasoning Benchmarks | Agentic AI Benchmarks
- Compared here: Claude Opus 4.6 | Claude Sonnet 4.6 | Gemini 3.1 Pro
FAQ
Is GPT-5.1 still available?
GPT-5.1 was removed from ChatGPT on March 11, 2026. The versioned API snapshot (gpt-5.1-2025-11-13) may still be accessible, but most users should migrate to GPT-5.2 or later.
What is the difference between GPT-5.1 Instant and GPT-5.1 Thinking?
Instant uses adaptive reasoning that decides on the fly whether to think before responding - faster for simple queries. Thinking always engages chain-of-thought, with reasoning_effort set to low, medium, or high.
How does GPT-5.1 handle multi-agent tasks?
Research shows GPT-5.1 reduces verification by 60-85% when working with reliable agent partners. It cooperates efficiently but needs explicit prompting to maintain independent verification. The model supports MCP server configs as native inputs.
What makes GPT-5.1-Codex-Max different from GPT-5.1 Thinking?
Codex-Max is purpose-built for agentic coding: it includes apply_patch for reliable code edits, shell execution for running tests, and native MCP server support. These are absent in the base GPT-5.1 Thinking variant.
Why did ARC-AGI-2 improve so much from GPT-5.1 to GPT-5.2?
GPT-5.1 scored 17.6% on ARC-AGI-2; GPT-5.2 jumped to 52.9% one month later. OpenAI hasn't detailed the architectural change, but the magnitude of the jump suggests more than incremental tuning.
Sources:
- GPT-5.1 Model Card - OpenAI Developer Docs
- Announcing GPT-5.1 in the API - OpenAI Community
- GPT-5.1 Release: Key Upgrades, Performance, and API Guide - Apidog
- GPT-5.1 Performance Review: 76.3% SWE-bench (November 2025) - Claude5.com
- GPT-5.1 vs GPT-4.1 Pricing, Benchmarks, Context - Requesty
- GPT-5.1 - Wikipedia
- GPT-5.2 Crosses 90% ARC-AGI: Infrastructure Implications - Introl Blog
- Trust Between AI Agents: arXiv 2606.14923
- GPT-5.1 on MindStudio
✓ Last verified June 16, 2026
