Models

Microsoft Phi-4

Microsoft's 14B dense transformer that consistently beats models 5x its size on MATH and GPQA, available under the MIT license for unrestricted commercial use.

Microsoft Phi-4

Overview

Microsoft released Phi-4 on December 12, 2024, and it landed with a simple thesis: you do not need hundreds of billions of parameters to do serious reasoning. At 14 billion parameters, Phi-4 is a dense decoder-only transformer that scores 80.4% on MATH (competition-level problems) and 56.1% on GPQA Diamond (graduate-level science) - both numbers that exceed GPT-4o, the model that was used as its teacher during training. That is not a typo. The student outscored the teacher on the hardest benchmarks.

The key to Phi-4's performance is not some architectural breakthrough. It is data engineering. Microsoft built the training pipeline around high-quality synthetic datasets, carefully curated organic data from academic sources, and a multi-stage post-training process. The result is a model that punches well above its weight class on STEM and reasoning tasks while remaining small enough to run on a single consumer GPU with quantization. Under the MIT license, there are zero restrictions on commercial deployment, fine-tuning, or redistribution.

That said, Phi-4 has clear blind spots. It scores just 3.0 on SimpleQA (factual recall), 63.0 on IFEval (instruction following), and 75.5 on DROP (reading comprehension with discrete reasoning). This is not a general-purpose chatbot replacement. It is a specialist - excellent at structured reasoning, mediocre at the kind of open-ended knowledge retrieval that users expect from an assistant. If your workload is math-heavy, code-heavy, or STEM-heavy, Phi-4 delivers outsized value. If you need broad factual coverage and strong instruction following, look elsewhere.

TL;DR

  • 14B dense transformer under MIT license - fully open for commercial use, fine-tuning, and redistribution
  • Beats GPT-4o on MATH (80.4% vs 74.6%) and GPQA Diamond (56.1% vs 50.6%) despite being a fraction of the size
  • 16K context window (expandable), 82.6% HumanEval for code generation
  • Weak on factual recall (SimpleQA: 3.0) and instruction following (IFEval: 63.0) - this is a STEM specialist, not a generalist

Key Specifications

SpecificationDetails
ProviderMicrosoft
Model FamilyPhi
ArchitectureDense decoder-only transformer (full attention)
Parameters14 billion (40 layers)
Context Window16,384 tokens (expandable to 64K with Phi-4-reasoning variants)
Max OutputContext-dependent
Input ModalitiesText only
Tokenizertiktoken (vocabulary: 100,352)
Training DataSynthetic datasets, curated organic data, academic sources
Release DateDecember 12, 2024
LicenseMIT
AvailabilityHugging Face, Azure AI Foundry, Ollama

Benchmark Performance

Phi-4's benchmark profile is unusual - it dominates reasoning and math while lagging on knowledge and instruction tasks. The comparison against Qwen 2.5 14B (its closest open-weight competitor) and the much larger Llama 3.3 70B tells the story:

BenchmarkPhi-4 (14B)Qwen 2.5 14BLlama 3.3 70B
MMLU (general knowledge)84.879.986.3
GPQA Diamond (PhD-level science)56.142.949.1
MATH (competition math)80.475.666.3
HumanEval (code generation)82.672.178.9
MGSM (multilingual math)80.679.689.1
HumanEval+ (code, stricter)82.8--
SimpleQA (factual recall)3.0--
DROP (reading comprehension)75.5--
IFEval (instruction following)63.0--

The GPQA result is the standout: 56.1% versus 42.9% for Qwen 2.5 14B and 49.1% for Llama 3.3 70B. That is a 14B model beating a 70B model by 7 points on graduate-level science questions. On MATH, Phi-4 leads both competitors by wide margins (80.4% vs 75.6% and 66.3%). Microsoft's synthetic data strategy is clearly paying off on structured reasoning tasks.

Key Capabilities

Phi-4's core strength is STEM reasoning at an efficiency that makes local deployment practical. At 14B parameters, the model fits comfortably on GPUs with 16GB+ VRAM in 4-bit quantization, and on Apple Silicon Macs with 32GB+ unified memory at higher precision. For research labs, startups, and individual developers who need strong math and science capabilities without API costs, this is one of the best options available. The MIT license means you can embed it in commercial products, fine-tune it on proprietary data, and redistribute derivatives without any licensing overhead.

The code generation numbers (82.6% HumanEval, 82.8% HumanEval+) put Phi-4 ahead of every open-weight model in its size class and competitive with Llama 3.3 at 5x the parameter count. Combined with the MATH scores, this makes Phi-4 a strong candidate for automated reasoning pipelines, educational tools, and scientific computing assistants. The coding benchmarks leaderboard provides broader context on where it sits relative to the field.

Where Phi-4 falls short is anywhere that requires broad world knowledge or nuanced instruction following. The SimpleQA score of 3.0 means it cannot reliably answer factual questions. The IFEval score of 63.0 means it will sometimes ignore formatting constraints or output structure requirements. These are not minor gaps - they make Phi-4 unsuitable as a general-purpose assistant. For the open-source vs proprietary tradeoff, Phi-4 represents the extreme case: spectacular on its specialties, limited outside them.

Pricing and Availability

Phi-4 is free to use under the MIT license. No API costs, no usage limits, no licensing negotiations.

OptionCostContextNotes
Phi-4 (self-hosted)Free16KMIT license, Hugging Face / Ollama
Azure AI FoundryPay-per-token16KManaged hosting on Azure
GPT-4o mini (comparison)$0.15/$0.60 per M tokens128KOpenAI hosted API
Qwen 2.5 14B (comparison)Free128KApache 2.0 license

For self-hosting, GGUF quantizations are available on Hugging Face (microsoft/phi-4-gguf) for use with llama.cpp, Ollama, and other local inference frameworks. The model runs at reasonable speeds on consumer hardware - expect 15-30 tokens/second on a modern GPU with 4-bit quantization.

Strengths

  • Exceptional STEM and math reasoning for its size - 80.4% MATH, 56.1% GPQA Diamond
  • MIT license with zero restrictions on commercial use or modification
  • 14B parameters fits on consumer GPUs and Apple Silicon with quantization
  • Outperforms Qwen 2.5 14B on 9 of 12 benchmarks in Microsoft's evaluation
  • Strong code generation (82.6% HumanEval) competitive with 70B-class models
  • Active ecosystem with GGUF, Ollama, and Azure deployment options

Weaknesses

  • 16K context window is small by 2025-2026 standards - many competitors offer 128K+
  • Extremely weak factual recall (SimpleQA: 3.0) - not suitable as a knowledge assistant
  • Poor instruction following (IFEval: 63.0) limits reliability for structured output tasks
  • Text-only input - no vision or multimodal capabilities
  • Multilingual performance (MGSM: 80.6) trails larger models significantly
  • Synthetic data dependency raises questions about benchmark contamination, though Microsoft addresses this in the technical report

Sources

Microsoft Phi-4
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.