GPT-OSS 20B

OpenAI's open-weight 21B MoE reasoning model with 131K context, Apache 2.0 license, and o3-mini-level benchmark performance running in 16 GB of memory.

GPT-OSS 20B

OpenAI released gpt-oss-20b on August 5, 2025, alongside its larger sibling gpt-oss-120b. It's the first open-weight model from OpenAI since the GPT-2 era and a deliberate answer to the traction that DeepSeek R1, Qwen 3.6 35B A3B, and Kimi K2 had been getting on open benchmarks. The Apache 2.0 license means commercial self-hosting is unrestricted - no additional terms, no usage caps in the license itself.

The model uses a Mixture-of-Experts transformer with 20.9 billion total parameters but only 3.6 billion active per token. That ratio keeps memory demand low enough that MXFP4 quantization brings the full checkpoint to 12.8 GiB, and the model runs on any 16 GB consumer GPU. That's an unusual combination: frontier-trained weights, in a form factor that fits on a gaming card.

Training followed the same reinforcement learning recipe OpenAI used on o3 and o4-mini: pretraining on a large STEM-heavy corpus through May 2024, then post-training with chain-of-thought RL to produce the variable-effort reasoning behavior OpenAI calls "reasoning levels."

TL;DR

  • MoE architecture: 20.9B total parameters, 3.6B active - runs in 16 GB memory with MXFP4 quantization
  • 131K context window, three reasoning levels (low/medium/high), full tool-use stack built in
  • Matches OpenAI o3-mini on most standard benchmarks; beats it on AIME 2024 and 2025 math

Key Specifications

SpecificationDetails
ProviderOpenAI
Model FamilyGPT-OSS
Parameters20.9B total, 3.6B active (MoE)
Architecture24-layer MoE transformer, 32 experts, Top-4 routing, GQA with 8 KV heads
Context Window131,072 tokens (YaRN extension)
Max Output8,192 tokens
Input Price (API)$0.03/M tokens
Output Price (API)$0.10/M tokens
Release DateAugust 5, 2025
Knowledge CutoffMay 31, 2024
LicenseApache 2.0
QuantizationMXFP4 (MoE weights)
Checkpoint Size12.8 GiB

GPT-OSS 20B on Hugging Face, showing download stats and model card The model page on Hugging Face, where gpt-oss-20b recorded over 7 million downloads in its first month. Source: huggingface.co

Benchmark Performance

All scores below are at the "high" reasoning level unless noted. Figures come from the official model card (arXiv 2508.10925) and the Artificial Analysis benchmark suite.

BenchmarkGPT-OSS 20Bo3-mini (high)DeepSeek R1
MMLU85.3%~86%90.8%
GPQA Diamond (no tools)71.5%79.7%71.5%
GPQA Diamond (with tools)74.2%--
AIME 2024 (with tools)96.0%90.0%87.5%
AIME 2025 (with tools)98.7%~91%87.5%
SWE-Bench Verified60.7%49.3%~49%
Codeforces Elo (with tools)2516~1820~2029

The math numbers are striking for a 20B model. GPT-OSS 20B scores 98.7% on AIME 2025 - edging out its larger 120B sibling (97.9%) and coming close to the proprietary o3 result. That suggests the RL post-training transfers well to competition math at this scale, and the smaller active parameter count may actually help by forcing tighter reasoning paths.

On SWE-Bench, 60.7% at "high" reasoning puts it well past o3-mini and into territory previously occupied only by much larger closed models. That result aligns with what you'd expect from a model trained with tool-use RL: it's been explicitly optimized to navigate real code repositories, not just create plausible patches in isolation. Compare that to the SWE-bench coding agent leaderboard where most smaller models cluster under 40%.

GPQA Diamond at 74.2% (with tools) is solid for a 20B model but trails larger open competitors like DeepSeek R1 on knowledge-heavy science tasks. The MMLU gap is consistent with the same pattern: knowledge retrieval scales with parameters, and 3.6B active weights can't match a 70B+ dense model's breadth.

Key Capabilities

The model ships with three runtime reasoning modes: low, medium, and high. At low effort it behaves like a fast chat model; at high effort it produces extended chains of thought. The trade-off is speed - Artificial Analysis measured 244 tokens per second at high reasoning on shared infrastructure, ranking it fourth fastest among models in its class. For local inference, throughput depends on GPU and quantization choice.

Tool use is native, not a wrapper. The model supports function calling with defined schemas, web search, Python execution, and structured output, all integrated through OpenAI's Harmony response format. That format is required for correct multi-turn agentic operation - vLLM, Ollama, LM Studio, SGLang, and the Transformers library all have Harmony support. For anyone building agents locally, this is a meaningful differentiator over models that require external scaffolding for the same capabilities.

Fine-tuning is fully supported through parameter fine-tuning. The Apache 2.0 license allows derivative models without restrictions, and OpenAI published a reference implementation with the weights. Several fine-tuned variants appeared within weeks of release targeting domain-specific coding and instruction following.

Pricing and Availability

The weights are free on Hugging Face under Apache 2.0. OpenAI also hosts the model through their API at $0.03 per million input tokens and $0.10 per million output tokens, which undercuts o4-mini notably. OpenRouter offers a free tier with rate limits.

Third-party inference is available across Azure, AWS Bedrock, Google Cloud, Fireworks AI, Together AI, Groq, Baseten, Databricks, Cloudflare Workers AI, Oracle OCI, and Vercel. For self-hosting, Ollama distributes it as gpt-oss:20b - pulling and running on a 16 GB GPU takes about ten minutes from scratch. The MXFP4 checkpoint means even consumer cards like a RTX 4080 can handle it at full context.

At $0.03 input / $0.10 output through the OpenAI API, it's cheaper than every o-series model and cheaper than GPT-4o. For cost-sensitive applications that don't need 200K context or image understanding, the pricing profile is hard to argue with.

GPT-OSS 20B GitHub repository with deployment guides and model code The official GitHub repository at openai/gpt-oss, covering inference setup for vLLM, Ollama, LM Studio, and SGLang. Source: github.com

Strengths and Weaknesses

Strengths

  • AIME 2024/2025 performance beats o3-mini and equals frontier models on competition math
  • Runs in 16 GB GPU memory with MXFP4; no special hardware required
  • Native tool use (function calling, code execution, web search) without external scaffolding
  • Apache 2.0 license with no additional commercial restrictions
  • Three reasoning levels let you trade cost for quality at inference time
  • SWE-Bench Verified at 60.7% puts it in the top tier among self-hostable models
  • 244 tokens/second output throughput - fast for a reasoning model

Weaknesses

  • 3.6B active parameters limits knowledge breadth; GPQA and MMLU trail larger dense models
  • Knowledge cutoff is May 2024 - over a year behind current frontier models
  • Text-only: no image, audio, or video input
  • Context window (131K) is smaller than GPT-5, Gemini 3 Pro, and Claude Opus 4.7 which offer 200K+
  • Max 8,192 output tokens per turn constrains long-document generation
  • Harmony response format is required for full functionality and isn't yet universal across all deployment stacks

Sources

✓ Last verified May 11, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.