GPT-4.1
OpenAI's coding-optimized API model with a 1M token context window, 54.6% SWE-bench Verified score, and $2/$8 per million token pricing.

Overview
OpenAI released GPT-4.1 on April 14, 2025, as an API-only model built around two specific developer complaints about GPT-4o: it didn't follow complex multi-step instructions reliably, and it struggled with coding tasks that required diff-format outputs. GPT-4.1 addresses both directly. On OpenAI's internal instruction-following eval, it scores 49.1% compared to GPT-4o's 29.2% - a gap wide enough to change behavior in production agentic systems, not just improve scores on a leaderboard.
The model comes in three variants: GPT-4.1 (flagship), GPT-4.1 Mini, and GPT-4.1 Nano. All three support a 1 million token context window - 8x the 128K limit in GPT-4o. At launch, the main model was priced 26% cheaper than GPT-4o for median queries, which made the upgrade straightforward for most API customers. OpenAI positioned GPT-4.1 as a non-reasoning model, meaning it doesn't run extended chain-of-thought like OpenAI o3 or o4-mini. The tradeoff is lower latency and cost at the expense of deep multi-step reasoning.
The model's clearest positioning relative to competitors is its coding and format-adherence story. SWE-bench Verified jumps from 33.2% on GPT-4o to 54.6% - an absolute improvement of 21.4 points. Still, Gemini 2.5 Pro reached 63.8% and Claude 3.7 Sonnet hit 62.3% on the same benchmark around the same time. GPT-4.1 isn't the top coding model, but its combination of pricing, context size, and tool-use reliability made it a competitive default for mid-complexity agentic pipelines.
TL;DR
- Best for agentic coding tasks, long-document processing, and instruction-following pipelines where GPT-4o was unreliable
- 1M token context, $2.00/$8.00 per million tokens input/output, knowledge cutoff June 2024
- 54.6% SWE-bench Verified - a big jump from GPT-4o's 33.2%, but trails Gemini 2.5 Pro (63.8%) and Claude 3.7 Sonnet (62.3%)
Key Specifications
| Specification | Details |
|---|---|
| Provider | OpenAI |
| Model Family | GPT-4.1 |
| Architecture | Not disclosed (dense transformer) |
| Parameters | Not disclosed |
| Context Window | 1,047,576 tokens |
| Max Output | 32,768 tokens |
| Input Modalities | Text, Images |
| Output Modality | Text (with structured output support) |
| Function Calling | Supported (parallel function calls) |
| Knowledge Cutoff | June 2024 |
| Input Price | $2.00/M tokens |
| Output Price | $8.00/M tokens |
| Cached Input | $0.50/M tokens |
| Release Date | April 14, 2025 |
| License | Proprietary (API access) |
| Availability | OpenAI API, ChatGPT (Plus/Pro/Team), Azure OpenAI Service |
Benchmark Performance
GPT-4.1's benchmark profile shows a model that's clearly improved over GPT-4o but sits below the top tier in most categories. Its strongest results are in coding and instruction following. The comparison uses models that were directly competitive at the time of GPT-4.1's release:
| Benchmark | GPT-4.1 | GPT-4o | Claude 3.7 Sonnet | Gemini 2.5 Pro |
|---|---|---|---|---|
| SWE-bench Verified (coding) | 54.6% | 33.2% | 62.3% | 63.8% |
| Aider polyglot diff accuracy | 52.9% | 18.3% | - | - |
| MMLU (general knowledge) | 90.2% | 87.2% | - | - |
| GPQA Diamond (PhD science) | 66.3% | 53.6% | - | - |
| IFEval (instruction following) | 87.4% | 81.0% | - | - |
| MultiChallenge (instruction) | 38.3% | 27.8% | - | - |
| Video-MME (video understanding) | 72.0% | 65.3% | - | - |
| MMMU (visual reasoning) | 74.8% | 68.7% | - | - |
| Graphwalks (long context) | 61.7% | 41.7% | - | - |
Two patterns are worth noting. First, the coding gains are real but not best-in-class. GPT-4.1's 54.6% on SWE-bench Verified is 9 points below Gemini 2.5 Pro's 63.8%, and both Google and Anthropic maintained a lead at the benchmark's top end. Second, the instruction-following numbers are the most distinctive gains: the jump from 27.8% to 38.3% on MultiChallenge (a benchmark measuring sustained compliance across complex, multi-constraint prompts) represents meaningful improvement for agent builders. See the coding benchmarks leaderboard for a broader view of how GPT-4.1 ranks across the field.
One context limitation to flag: accuracy on very long inputs degrades. Independent testing showed performance declining from 84% at 8,000 tokens to roughly 50% at 1 million tokens on certain recall tasks. The 1M window is real, but the model's ability to use all of it reliably depends on the specific task.
GPT-4.1 was trained specifically on diff-format outputs and agentic coding workflows, reducing extraneous edits from 9% to 2% vs GPT-4o.
Source: pexels.com
Key Capabilities
Coding and Agentic Workflows
Coding is GPT-4.1's primary design goal. OpenAI trained the model to follow diff formats reliably, which matters for agents that only need to output changed lines rather than rewriting entire files. The extraneous edit rate dropped from 9% with GPT-4o to 2% with GPT-4.1 - a measurable improvement in cost and latency for tools like Windsurf, Cursor, and Copilot that rely on diff-based code updates. The Aider polyglot diff accuracy of 52.9% (vs GPT-4o's 18.3%) is the starkest single-number illustration of this.
For agentic pipelines, GPT-4.1's improved instruction adherence changes what's actually buildable. Agents running multi-step workflows need models that follow complex tool-use patterns without deviating. The 20-point gain on OpenAI's internal instruction-following eval translates directly to fewer agent failures in practice. At Windsurf, internal benchmarks showed 60% improvement over GPT-4o with 30% more efficiency on agentic coding tasks.
Long Context Processing
The 1M token context window opens up use cases that were previously impractical: processing entire codebases in a single prompt, analyzing year-long document archives, or running customer service resolution across full ticket histories. On the "needle in a haystack" test - locating specific information in a massive document - GPT-4.1 reached 100% accuracy across all context lengths in OpenAI's internal testing. The Graphwalks score of 61.7% (vs 41.7% for GPT-4o) shows improved long-range reasoning in structured contexts.
For reference, the ChatGPT interface caps context at 32K tokens even when using GPT-4.1 - the full 1M window is only accessible via the API.
Vision and Multimodal Input
GPT-4.1 accepts text and image inputs. Video-MME improved from 65.3% (GPT-4o) to 72.0%, and MMMU visual reasoning went from 68.7% to 74.8%. These are solid gradual gains rather than a step-change in vision capability. The model supports combining images with function calling, which enables multimodal agent use cases: interpreting screenshots to drive UI automation, extracting structured data from documents with diagrams, or analyzing images to trigger downstream actions.
Pricing and Availability
GPT-4.1 launched at $2.00/$8.00 per million input/output tokens, with prompt caching at $0.50/M for cached input. The mini and nano variants offer lower-cost options for lighter workloads.
| Model | Input/M | Output/M | Cached Input/M | Context |
|---|---|---|---|---|
| GPT-4.1 | $2.00 | $8.00 | $0.50 | 1M |
| GPT-4.1 Mini | $0.40 | $1.60 | $0.10 | 1M |
| GPT-4.1 Nano | $0.10 | $0.40 | $0.025 | 1M |
| GPT-4o mini | $0.15 | $0.60 | - | 128K |
GPT-4.1 full is 26% cheaper than GPT-4o was for median queries. Mini is 83% cheaper than the main model. Nano, at $0.10/$0.40, was OpenAI's cheapest and fastest model at launch.
The model is available through the OpenAI API, Azure OpenAI Service, and in ChatGPT for Plus, Pro, and Team subscribers (accessible via the model picker). Free users get access to GPT-4.1 Mini as the default fallback once their GPT-4o usage is exhausted. Fine-tuning is available for the main model and Mini at premium training rates ($25/M tokens for the full model).
OpenAI confirmed plans to retire GPT-4.5 by mid-July 2025, citing GPT-4.1's superior performance at lower cost. GPT-4o retirement followed later in the year. The instruction following leaderboard tracks how GPT-4.1 compares as the baseline has shifted with each new model generation.
Strengths
- 21-point SWE-bench improvement over GPT-4o - the largest single-generation coding gain for the GPT-4 series
- 1M token context at $2.00 input - 8x the context of GPT-4o at a lower price
- Reliable diff-format output for agentic coding, with extraneous edit rate down from 9% to 2%
- Strong instruction following - IFEval at 87.4%, MultiChallenge at 38.3% vs GPT-4o's 27.8%
- GPT-4.1 Mini and Nano provide cost-effective options with the same 1M context window
- Prompt caching at $0.50/M for repeated context, reducing costs in multi-turn applications
- Broad ecosystem support via OpenAI API and Azure OpenAI Service
Weaknesses
- SWE-bench Verified of 54.6% trails Claude 3.7 Sonnet (62.3%) and Gemini 2.5 Pro (63.8%)
- Accuracy degrades at very long inputs - reported drop from 84% at 8K tokens to ~50% at 1M tokens on recall tasks
- No image generation or voice output - text-only output
- Knowledge cutoff of June 2024 is dated by the time of this writing
- GPQA Diamond at 66.3% is competitive but not leading on hard science reasoning
- Parameters and architecture not disclosed; no self-hosting option
- Noted tendency to hallucinate sources in user feedback, especially in the citation-heavy outputs
- ChatGPT UI limits context to 32K even when using GPT-4.1; full 1M window requires direct API access
Related Coverage
- OpenAI o3 - OpenAI's full reasoning model, for tasks where extended chain-of-thought matters more than latency
- OpenAI o4-mini - Lightweight reasoning option when you need more than GPT-4.1 but less cost than o3
- GPT-4o mini - The predecessor budget model, useful context on the pricing and capability baseline
- Coding Benchmarks Leaderboard - Full rankings showing how GPT-4.1 fits in the current coding landscape
- Instruction Following Leaderboard - Where GPT-4.1's key gains show up in head-to-head comparisons
- How to Choose an LLM in 2026 - Decision guide for when GPT-4.1 makes sense vs. reasoning models or open-source options
Sources
- Introducing GPT-4.1 in the API - OpenAI
- GPT-4.1 Model Documentation - OpenAI
- GPT-4.1: Full Developer Guide with Benchmarks - Helicone
- GPT-4.1: Features, Access, GPT-4o Comparison - DataCamp
- GPT-4.1 Comparison with Claude 3.7 Sonnet and Gemini 2.5 Pro - Bind AI
- GPT-4.1 API Pricing - OpenRouter
- OpenAI API Pricing
- GPT-4.1 explained - TechTarget
- GPT-4.1 explained - TechTarget
- HAL GAIA Leaderboard - GPT-4.1 April 2025
✓ Last verified May 15, 2026
