Name: GPT-4.1
Author: OpenAI

Overview

OpenAI released GPT-4.1 on April 14, 2025, as an API-only model built around two specific developer complaints about GPT-4o: it didn't follow complex multi-step instructions reliably, and it struggled with coding tasks that required diff-format outputs. GPT-4.1 addresses both directly. On OpenAI's internal instruction-following eval, it scores 49.1% compared to GPT-4o's 29.2% - a gap wide enough to change behavior in production agentic systems, not just improve scores on a leaderboard.

The model comes in three variants: GPT-4.1 (flagship), GPT-4.1 Mini, and GPT-4.1 Nano. All three support a 1 million token context window - 8x the 128K limit in GPT-4o. At launch, the main model was priced 26% cheaper than GPT-4o for median queries, which made the upgrade straightforward for most API customers. OpenAI positioned GPT-4.1 as a non-reasoning model, meaning it doesn't run extended chain-of-thought like OpenAI o3 or o4-mini. The tradeoff is lower latency and cost at the expense of deep multi-step reasoning.

The model's clearest positioning relative to competitors is its coding and format-adherence story. SWE-bench Verified jumps from 33.2% on GPT-4o to 54.6% - an absolute improvement of 21.4 points. Still, Gemini 2.5 Pro reached 63.8% and Claude 3.7 Sonnet hit 62.3% on the same benchmark around the same time. GPT-4.1 isn't the top coding model, but its combination of pricing, context size, and tool-use reliability made it a competitive default for mid-complexity agentic pipelines.

TL;DR

Best for agentic coding tasks, long-document processing, and instruction-following pipelines where GPT-4o was unreliable
1M token context, $2.00/$8.00 per million tokens input/output, knowledge cutoff June 2024
54.6% SWE-bench Verified - a big jump from GPT-4o's 33.2%, but trails Gemini 2.5 Pro (63.8%) and Claude 3.7 Sonnet (62.3%)

Key Specifications

Specification	Details
Provider	OpenAI
Model Family	GPT-4.1
Architecture	Not disclosed (dense transformer)
Parameters	Not disclosed
Context Window	1,047,576 tokens
Max Output	32,768 tokens
Input Modalities	Text, Images
Output Modality	Text (with structured output support)
Function Calling	Supported (parallel function calls)
Knowledge Cutoff	June 2024
Input Price	$2.00/M tokens
Output Price	$8.00/M tokens
Cached Input	$0.50/M tokens
Release Date	April 14, 2025
License	Proprietary (API access)
Availability	OpenAI API, ChatGPT (Plus/Pro/Team), Azure OpenAI Service

Benchmark Performance

GPT-4.1's benchmark profile shows a model that's clearly improved over GPT-4o but sits below the top tier in most categories. Its strongest results are in coding and instruction following. The comparison uses models that were directly competitive at the time of GPT-4.1's release:

Benchmark	GPT-4.1	GPT-4o	Claude 3.7 Sonnet	Gemini 2.5 Pro
SWE-bench Verified (coding)	54.6%	33.2%	62.3%	63.8%
Aider polyglot diff accuracy	52.9%	18.3%	-	-
MMLU (general knowledge)	90.2%	87.2%	-	-
GPQA Diamond (PhD science)	66.3%	53.6%	-	-
IFEval (instruction following)	87.4%	81.0%	-	-
MultiChallenge (instruction)	38.3%	27.8%	-	-
Video-MME (video understanding)	72.0%	65.3%	-	-
MMMU (visual reasoning)	74.8%	68.7%	-	-
Graphwalks (long context)	61.7%	41.7%	-	-

Two patterns are worth noting. First, the coding gains are real but not best-in-class. GPT-4.1's 54.6% on SWE-bench Verified is 9 points below Gemini 2.5 Pro's 63.8%, and both Google and Anthropic maintained a lead at the benchmark's top end. Second, the instruction-following numbers are the most distinctive gains: the jump from 27.8% to 38.3% on MultiChallenge (a benchmark measuring sustained compliance across complex, multi-constraint prompts) represents meaningful improvement for agent builders. See the coding benchmarks leaderboard for a broader view of how GPT-4.1 ranks across the field.

One context limitation to flag: accuracy on very long inputs degrades. Independent testing showed performance declining from 84% at 8,000 tokens to roughly 50% at 1 million tokens on certain recall tasks. The 1M window is real, but the model's ability to use all of it reliably depends on the specific task.

Code running on a monitor screen during a software development session GPT-4.1 was trained specifically on diff-format outputs and agentic coding workflows, reducing extraneous edits from 9% to 2% vs GPT-4o. Source: pexels.com

Key Capabilities

Coding and Agentic Workflows

Coding is GPT-4.1's primary design goal. OpenAI trained the model to follow diff formats reliably, which matters for agents that only need to output changed lines rather than rewriting entire files. The extraneous edit rate dropped from 9% with GPT-4o to 2% with GPT-4.1 - a measurable improvement in cost and latency for tools like Windsurf, Cursor, and Copilot that rely on diff-based code updates. The Aider polyglot diff accuracy of 52.9% (vs GPT-4o's 18.3%) is the starkest single-number illustration of this.

For agentic pipelines, GPT-4.1's improved instruction adherence changes what's actually buildable. Agents running multi-step workflows need models that follow complex tool-use patterns without deviating. The 20-point gain on OpenAI's internal instruction-following eval translates directly to fewer agent failures in practice. At Windsurf, internal benchmarks showed 60% improvement over GPT-4o with 30% more efficiency on agentic coding tasks.

Long Context Processing

The 1M token context window opens up use cases that were previously impractical: processing entire codebases in a single prompt, analyzing year-long document archives, or running customer service resolution across full ticket histories. On the "needle in a haystack" test - locating specific information in a massive document - GPT-4.1 reached 100% accuracy across all context lengths in OpenAI's internal testing. The Graphwalks score of 61.7% (vs 41.7% for GPT-4o) shows improved long-range reasoning in structured contexts.

For reference, the ChatGPT interface caps context at 32K tokens even when using GPT-4.1 - the full 1M window is only accessible via the API.

Vision and Multimodal Input

GPT-4.1 accepts text and image inputs. Video-MME improved from 65.3% (GPT-4o) to 72.0%, and MMMU visual reasoning went from 68.7% to 74.8%. These are solid gradual gains rather than a step-change in vision capability. The model supports combining images with function calling, which enables multimodal agent use cases: interpreting screenshots to drive UI automation, extracting structured data from documents with diagrams, or analyzing images to trigger downstream actions.

Pricing and Availability

GPT-4.1 launched at $2.00/$8.00 per million input/output tokens, with prompt caching at $0.50/M for cached input. The mini and nano variants offer lower-cost options for lighter workloads.

Model	Input/M	Output/M	Cached Input/M	Context
GPT-4.1	$2.00	$8.00	$0.50	1M
GPT-4.1 Mini	$0.40	$1.60	$0.10	1M
GPT-4.1 Nano	$0.10	$0.40	$0.025	1M
GPT-4o mini	$0.15	$0.60	-	128K

GPT-4.1 full is 26% cheaper than GPT-4o was for median queries. Mini is 83% cheaper than the main model. Nano, at $0.10/$0.40, was OpenAI's cheapest and fastest model at launch.

The model is available through the OpenAI API, Azure OpenAI Service, and in ChatGPT for Plus, Pro, and Team subscribers (accessible via the model picker). Free users get access to GPT-4.1 Mini as the default fallback once their GPT-4o usage is exhausted. Fine-tuning is available for the main model and Mini at premium training rates ($25/M tokens for the full model).

OpenAI confirmed plans to retire GPT-4.5 by mid-July 2025, citing GPT-4.1's superior performance at lower cost. GPT-4o retirement followed later in the year. The instruction following leaderboard tracks how GPT-4.1 compares as the baseline has shifted with each new model generation.

Strengths

21-point SWE-bench improvement over GPT-4o - the largest single-generation coding gain for the GPT-4 series
1M token context at $2.00 input - 8x the context of GPT-4o at a lower price
Reliable diff-format output for agentic coding, with extraneous edit rate down from 9% to 2%
Strong instruction following - IFEval at 87.4%, MultiChallenge at 38.3% vs GPT-4o's 27.8%
GPT-4.1 Mini and Nano provide cost-effective options with the same 1M context window
Prompt caching at $0.50/M for repeated context, reducing costs in multi-turn applications
Broad ecosystem support via OpenAI API and Azure OpenAI Service

Weaknesses

SWE-bench Verified of 54.6% trails Claude 3.7 Sonnet (62.3%) and Gemini 2.5 Pro (63.8%)
Accuracy degrades at very long inputs - reported drop from 84% at 8K tokens to ~50% at 1M tokens on recall tasks
No image generation or voice output - text-only output
Knowledge cutoff of June 2024 is dated by the time of this writing
GPQA Diamond at 66.3% is competitive but not leading on hard science reasoning
Parameters and architecture not disclosed; no self-hosting option
Noted tendency to hallucinate sources in user feedback, especially in the citation-heavy outputs
ChatGPT UI limits context to 32K even when using GPT-4.1; full 1M window requires direct API access

OpenAI o3 - OpenAI's full reasoning model, for tasks where extended chain-of-thought matters more than latency
OpenAI o4-mini - Lightweight reasoning option when you need more than GPT-4.1 but less cost than o3
GPT-4o mini - The predecessor budget model, useful context on the pricing and capability baseline
Coding Benchmarks Leaderboard - Full rankings showing how GPT-4.1 fits in the current coding landscape
Instruction Following Leaderboard - Where GPT-4.1's key gains show up in head-to-head comparisons
How to Choose an LLM in 2026 - Decision guide for when GPT-4.1 makes sense vs. reasoning models or open-source options