Microsoft Phi-4
Microsoft's 14B dense transformer that consistently beats models 5x its size on MATH and GPQA, available under the MIT license for unrestricted commercial use.

Overview
Microsoft released Phi-4 on December 12, 2024, and it landed with a simple thesis: you do not need hundreds of billions of parameters to do serious reasoning. At 14 billion parameters, Phi-4 is a dense decoder-only transformer that scores 80.4% on MATH (competition-level problems) and 56.1% on GPQA Diamond (graduate-level science) - both numbers that exceed GPT-4o, the model that was used as its teacher during training. That is not a typo. The student outscored the teacher on the hardest benchmarks.
The key to Phi-4's performance is not some architectural breakthrough. It is data engineering. Microsoft built the training pipeline around high-quality synthetic datasets, carefully curated organic data from academic sources, and a multi-stage post-training process. The result is a model that punches well above its weight class on STEM and reasoning tasks while remaining small enough to run on a single consumer GPU with quantization. Under the MIT license, there are zero restrictions on commercial deployment, fine-tuning, or redistribution.
That said, Phi-4 has clear blind spots. It scores just 3.0 on SimpleQA (factual recall), 63.0 on IFEval (instruction following), and 75.5 on DROP (reading comprehension with discrete reasoning). This is not a general-purpose chatbot replacement. It is a specialist - excellent at structured reasoning, mediocre at the kind of open-ended knowledge retrieval that users expect from an assistant. If your workload is math-heavy, code-heavy, or STEM-heavy, Phi-4 delivers outsized value. If you need broad factual coverage and strong instruction following, look elsewhere.
TL;DR
- 14B dense transformer under MIT license - fully open for commercial use, fine-tuning, and redistribution
- Beats GPT-4o on MATH (80.4% vs 74.6%) and GPQA Diamond (56.1% vs 50.6%) despite being a fraction of the size
- 16K context window (expandable), 82.6% HumanEval for code generation
- Weak on factual recall (SimpleQA: 3.0) and instruction following (IFEval: 63.0) - this is a STEM specialist, not a generalist
Key Specifications
| Specification | Details |
|---|---|
| Provider | Microsoft |
| Model Family | Phi |
| Architecture | Dense decoder-only transformer (full attention) |
| Parameters | 14 billion (40 layers) |
| Context Window | 16,384 tokens (expandable to 64K with Phi-4-reasoning variants) |
| Max Output | Context-dependent |
| Input Modalities | Text only |
| Tokenizer | tiktoken (vocabulary: 100,352) |
| Training Data | Synthetic datasets, curated organic data, academic sources |
| Release Date | December 12, 2024 |
| License | MIT |
| Availability | Hugging Face, Azure AI Foundry, Ollama |
Benchmark Performance
Phi-4's benchmark profile is unusual - it dominates reasoning and math while lagging on knowledge and instruction tasks. The comparison against Qwen 2.5 14B (its closest open-weight competitor) and the much larger Llama 3.3 70B tells the story:
| Benchmark | Phi-4 (14B) | Qwen 2.5 14B | Llama 3.3 70B |
|---|---|---|---|
| MMLU (general knowledge) | 84.8 | 79.9 | 86.3 |
| GPQA Diamond (PhD-level science) | 56.1 | 42.9 | 49.1 |
| MATH (competition math) | 80.4 | 75.6 | 66.3 |
| HumanEval (code generation) | 82.6 | 72.1 | 78.9 |
| MGSM (multilingual math) | 80.6 | 79.6 | 89.1 |
| HumanEval+ (code, stricter) | 82.8 | - | - |
| SimpleQA (factual recall) | 3.0 | - | - |
| DROP (reading comprehension) | 75.5 | - | - |
| IFEval (instruction following) | 63.0 | - | - |
The GPQA result is the standout: 56.1% versus 42.9% for Qwen 2.5 14B and 49.1% for Llama 3.3 70B. That is a 14B model beating a 70B model by 7 points on graduate-level science questions. On MATH, Phi-4 leads both competitors by wide margins (80.4% vs 75.6% and 66.3%). Microsoft's synthetic data strategy is clearly paying off on structured reasoning tasks.
Key Capabilities
Phi-4's core strength is STEM reasoning at an efficiency that makes local deployment practical. At 14B parameters, the model fits comfortably on GPUs with 16GB+ VRAM in 4-bit quantization, and on Apple Silicon Macs with 32GB+ unified memory at higher precision. For research labs, startups, and individual developers who need strong math and science capabilities without API costs, this is one of the best options available. The MIT license means you can embed it in commercial products, fine-tune it on proprietary data, and redistribute derivatives without any licensing overhead.
The code generation numbers (82.6% HumanEval, 82.8% HumanEval+) put Phi-4 ahead of every open-weight model in its size class and competitive with Llama 3.3 at 5x the parameter count. Combined with the MATH scores, this makes Phi-4 a strong candidate for automated reasoning pipelines, educational tools, and scientific computing assistants. The coding benchmarks leaderboard provides broader context on where it sits relative to the field.
Where Phi-4 falls short is anywhere that requires broad world knowledge or nuanced instruction following. The SimpleQA score of 3.0 means it cannot reliably answer factual questions. The IFEval score of 63.0 means it will sometimes ignore formatting constraints or output structure requirements. These are not minor gaps - they make Phi-4 unsuitable as a general-purpose assistant. For the open-source vs proprietary tradeoff, Phi-4 represents the extreme case: spectacular on its specialties, limited outside them.
Pricing and Availability
Phi-4 is free to use under the MIT license. No API costs, no usage limits, no licensing negotiations.
| Option | Cost | Context | Notes |
|---|---|---|---|
| Phi-4 (self-hosted) | Free | 16K | MIT license, Hugging Face / Ollama |
| Azure AI Foundry | Pay-per-token | 16K | Managed hosting on Azure |
| GPT-4o mini (comparison) | $0.15/$0.60 per M tokens | 128K | OpenAI hosted API |
| Qwen 2.5 14B (comparison) | Free | 128K | Apache 2.0 license |
For self-hosting, GGUF quantizations are available on Hugging Face (microsoft/phi-4-gguf) for use with llama.cpp, Ollama, and other local inference frameworks. The model runs at reasonable speeds on consumer hardware - expect 15-30 tokens/second on a modern GPU with 4-bit quantization.
Strengths
- Exceptional STEM and math reasoning for its size - 80.4% MATH, 56.1% GPQA Diamond
- MIT license with zero restrictions on commercial use or modification
- 14B parameters fits on consumer GPUs and Apple Silicon with quantization
- Outperforms Qwen 2.5 14B on 9 of 12 benchmarks in Microsoft's evaluation
- Strong code generation (82.6% HumanEval) competitive with 70B-class models
- Active ecosystem with GGUF, Ollama, and Azure deployment options
Weaknesses
- 16K context window is small by 2025-2026 standards - many competitors offer 128K+
- Extremely weak factual recall (SimpleQA: 3.0) - not suitable as a knowledge assistant
- Poor instruction following (IFEval: 63.0) limits reliability for structured output tasks
- Text-only input - no vision or multimodal capabilities
- Multilingual performance (MGSM: 80.6) trails larger models significantly
- Synthetic data dependency raises questions about benchmark contamination, though Microsoft addresses this in the technical report
Related Coverage
- Open Source vs Proprietary AI - Where open-weight models like Phi-4 fit in the broader landscape
- Open Source LLM Leaderboard - Current rankings for open-weight models
- Coding Benchmarks Leaderboard - HumanEval and SWE-bench comparisons
