Name: Microsoft Phi-4
Author: Microsoft

Overview

Microsoft released Phi-4 on December 12, 2024, and it landed with a simple thesis: you don't need hundreds of billions of parameters to do serious reasoning. At 14 billion parameters, Phi-4 is a dense decoder-only transformer that scores 80.4% on MATH (competition-level problems) and 56.1% on GPQA Diamond (graduate-level science) - both numbers that exceed GPT-4o, the model that was used as its teacher during training. That isn't a typo. The student outscored the teacher on the hardest benchmarks.

The key to Phi-4's performance isn't some architectural breakthrough. It's data engineering. Microsoft built the training pipeline around high-quality synthetic datasets, carefully curated organic data from academic sources, and a multi-stage post-training process. The result is a model that punches well above its weight class on STEM and reasoning tasks while remaining small enough to run on a single consumer GPU with quantization. Under the MIT license, there are zero restrictions on commercial deployment, fine-tuning, or redistribution.

Still, Phi-4 has clear blind spots. It scores just 3.0 on SimpleQA (factual recall), 63.0 on IFEval (instruction following), and 75.5 on DROP (reading comprehension with discrete reasoning). This is not a general-purpose chatbot replacement. It's a specialist - excellent at structured reasoning, mediocre at the kind of open-ended knowledge retrieval that users expect from an assistant. If your workload is math-heavy, code-heavy, or STEM-heavy, Phi-4 delivers outsized value. If you need broad factual coverage and strong instruction following, look elsewhere.

TL;DR

14B dense transformer under MIT license - fully open for commercial use, fine-tuning, and redistribution
Beats GPT-4o on MATH (80.4% vs 74.6%) and GPQA Diamond (56.1% vs 50.6%) despite being a fraction of the size
16K context window (expandable), 82.6% HumanEval for code generation
Weak on factual recall (SimpleQA: 3.0) and instruction following (IFEval: 63.0) - this is a STEM specialist, not a generalist

Key Specifications

Specification	Details
Provider	Microsoft
Model Family	Phi
Architecture	Dense decoder-only transformer (full attention)
Parameters	14 billion (40 layers)
Context Window	16,384 tokens (expandable to 64K with Phi-4-reasoning variants)
Max Output	Context-dependent
Input Modalities	Text only
Tokenizer	tiktoken (vocabulary: 100,352)
Training Data	Synthetic datasets, curated organic data, academic sources
Release Date	December 12, 2024
License	MIT
Availability	Hugging Face, Azure AI Foundry, Ollama

Benchmark Performance

Phi-4's benchmark profile is unusual - it dominates reasoning and math while lagging on knowledge and instruction tasks. The comparison against Qwen 2.5 14B (its closest open-weight competitor) and the much larger Llama 3.3 70B tells the story:

Benchmark	Phi-4 (14B)	Qwen 2.5 14B	Llama 3.3 70B
MMLU (general knowledge)	84.8	79.9	86.3
GPQA Diamond (PhD-level science)	56.1	42.9	49.1
MATH (competition math)	80.4	75.6	66.3
HumanEval (code generation)	82.6	72.1	78.9
MGSM (multilingual math)	80.6	79.6	89.1
HumanEval+ (code, stricter)	82.8	-	-
SimpleQA (factual recall)	3.0	-	-
DROP (reading comprehension)	75.5	-	-
IFEval (instruction following)	63.0	-	-

The GPQA result is the standout: 56.1% versus 42.9% for Qwen 2.5 14B and 49.1% for Llama 3.3 70B. That's a 14B model beating a 70B model by 7 points on graduate-level science questions. On MATH, Phi-4 leads both competitors by wide margins (80.4% vs 75.6% and 66.3%). Microsoft's synthetic data strategy is clearly paying off on structured reasoning tasks.

Key Capabilities

Phi-4's core strength is STEM reasoning at an efficiency that makes local deployment practical. At 14B parameters, the model fits comfortably on GPUs with 16GB+ VRAM in 4-bit quantization, and on Apple Silicon Macs with 32GB+ unified memory at higher precision. For research labs, startups, and individual developers who need strong math and science capabilities without API costs, this is one of the best options available. The MIT license means you can embed it in commercial products, fine-tune it on proprietary data, and redistribute derivatives without any licensing overhead.

The code generation numbers (82.6% HumanEval, 82.8% HumanEval+) put Phi-4 ahead of every open-weight model in its size class and competitive with Llama 3.3 at 5x the parameter count. Combined with the MATH scores, this makes Phi-4 a strong candidate for automated reasoning pipelines, educational tools, and scientific computing assistants. The coding benchmarks leaderboard provides broader context on where it sits relative to the field.

Where Phi-4 falls short is anywhere that requires broad world knowledge or nuanced instruction following. The SimpleQA score of 3.0 means it cannot reliably answer factual questions. The IFEval score of 63.0 means it'll sometimes ignore formatting constraints or output structure requirements. These aren't minor gaps - they make Phi-4 unsuitable as a general-purpose assistant. For the open-source vs proprietary tradeoff, Phi-4 represents the extreme case: spectacular on its specialties, limited outside them.

Pricing and Availability

Phi-4 is free to use under the MIT license. No API costs, no usage limits, no licensing negotiations.

Option	Cost	Context	Notes
Phi-4 (self-hosted)	Free	16K	MIT license, Hugging Face / Ollama
Azure AI Foundry	Pay-per-token	16K	Managed hosting on Azure
GPT-4o mini (comparison)	$0.15/$0.60 per M tokens	128K	OpenAI hosted API
Qwen 2.5 14B (comparison)	Free	128K	Apache 2.0 license

For self-hosting, GGUF quantizations are available on Hugging Face (microsoft/phi-4-gguf) for use with llama.cpp, Ollama, and other local inference frameworks. The model runs at reasonable speeds on consumer hardware - expect 15-30 tokens/second on a modern GPU with 4-bit quantization.

Strengths

Exceptional STEM and math reasoning for its size - 80.4% MATH, 56.1% GPQA Diamond
MIT license with zero restrictions on commercial use or modification
14B parameters fits on consumer GPUs and Apple Silicon with quantization
Beats Qwen 2.5 14B on 9 of 12 benchmarks in Microsoft's evaluation
Strong code generation (82.6% HumanEval) competitive with 70B-class models
Active ecosystem with GGUF, Ollama, and Azure deployment options

Weaknesses

16K context window is small by 2025-2026 standards - many competitors offer 128K+
Very weak factual recall (SimpleQA: 3.0) - not suitable as a knowledge assistant
Poor instruction following (IFEval: 63.0) limits reliability for structured output tasks
Text-only input - no vision or multimodal capabilities
Multilingual performance (MGSM: 80.6) trails larger models significantly
Synthetic data dependency raises questions about benchmark contamination, though Microsoft addresses this in the technical report

Open Source vs Proprietary AI - Where open-weight models like Phi-4 fit in the broader landscape
Open Source LLM Leaderboard - Current rankings for open-weight models
Coding Benchmarks Leaderboard - HumanEval and SWE-bench comparisons