Microsoft Phi-4 Reasoning: Small Model, Big Math
Microsoft's Phi-4 reasoning family delivers near-70B-class math performance in a 14B open-weight package, but the overthinking problem is real and the use case is narrower than the benchmarks suggest.

When Microsoft released Phi-4-reasoning in April 2025, the headline numbers were striking enough to warrant scrutiny: a 14-billion-parameter model that scores 75.3% on AIME 2024, beating DeepSeek-R1-Distill-70B (69.3%) by six points and landing within striking distance of the full 671-billion-parameter DeepSeek-R1. The stronger sibling, Phi-4-reasoning-plus, pushed that number to 81.3%. Both ship under MIT license, run locally via Ollama on a consumer GPU, and cost nothing to download.
TL;DR
- 8.1/10 - The best open-weight reasoning model under 15B parameters, but only if math and science are your primary use case
- Phi-4-reasoning-plus matches or beats 70B-class models on AIME and OmniMath at a fraction of the compute cost
- The overthinking problem is severe: the model runs long chain-of-thought traces on trivial queries, unlike Qwen 3's
/nothinkmode - Researchers, STEM developers, and teams running constrained inference hardware; skip if you need multilingual support or general-purpose chat
The question I kept asking across several weeks of testing wasn't whether the benchmarks hold up - broadly, they do - but whether the model is actually useful for anything beyond STEM competitions and math olympiad prep. The answer is more complicated than Microsoft's announcement blog lets on.
What the Phi-4 Reasoning Family Actually Is
The family has five models, each targeting a different point on the size-vs-capability curve.
The core text models are Phi-4-reasoning and Phi-4-reasoning-plus, both 14B parameters with a 32K token context window. They share the same base: the original Phi-4 instruct model, which already punched above its weight on GPQA and MATH benchmarks. The reasoning variant was fine-tuned on roughly 8.3 billion unique tokens of synthetic chain-of-thought traces created by OpenAI's o3-mini, covering STEM, coding, and logic problems. The "plus" variant then added a short reinforcement learning phase using GRPO (Group Relative Policy Optimization), rewarding correct answers by +1 and penalizing wrong ones by -0.5.
At the smaller end, Phi-4-mini-reasoning (3.8B) is explicitly a math-only model. Microsoft trained it via distillation from DeepSeek-R1, generating 150 billion tokens of synthetic traces across a million math problems and keeping only verified-correct solutions. It's not general-purpose and the model card is upfront about this.
The newest addition, Phi-4-reasoning-vision-15B, dropped March 4 and adds a SigLIP-2 vision encoder to the Phi-4-reasoning backbone. I covered its announcement when it launched. This review focuses mainly on the text models, with a section on the vision variant below.
Phi-4-mini-flash-reasoning (3.8B) ships a different architecture completely - a hybrid Mamba-plus-attention design called SambaY that reaches up to 10x higher throughput than the standard mini model. It's interesting for production inference but not something you'd run for general tasks.
December 2024 - Original Phi-4 (14B) released under MIT license. Establishes the base model now used across the reasoning family.
April 30, 2025 - Phi-4-reasoning and Phi-4-reasoning-plus released simultaneously. Phi-4-mini-reasoning follows shortly after.
July 9, 2025 - Phi-4-mini-flash-reasoning released with SambaY hybrid architecture, targeting production throughput.
March 4, 2026 - Phi-4-reasoning-vision-15B released on HuggingFace and Azure AI Foundry.
The Phi-4-reasoning model page on HuggingFace, where all five family members are available under MIT license.
Source: huggingface.co
Benchmark Reality Check
The AIME numbers are the ones Microsoft leads with, so they deserve close reading. Phi-4-reasoning-plus scores 81.3% on AIME 2024 and somewhere between 78% and 82.5% on AIME 2025, depending on sampling methodology. The variance is real - the technical report explicitly notes that AIME runs can differ by five to ten percentage points across random seeds. Treating any single number as conclusive is a mistake.
Still, the core claim is defensible. A 14B model beating DeepSeek-R1-Distill-70B on AIME is genuine, and it reflects smart training rather than benchmark overfitting. You can verify this by probing with novel problems not in any public dataset.
| Benchmark | Phi-4-reasoning | Phi-4-reasoning-plus | DeepSeek-R1-Distill-70B | QwQ-32B |
|---|---|---|---|---|
| AIME 2024 | 75.3% | 81.3% | 69.3% | 79.5% |
| AIME 2025 | 62.9% | ~80% | 51.5% | 65.8% |
| GPQA Diamond | 65.8% | 68.9% | 66.2% | 59.5% |
| MATH-500 | 94.6% | ~95% | - | - |
| OmniMath | 76.6% | 81.9% | 63.4% | - |
| LiveCodeBench | 53.8% | 53.1% | 57.5% | 63.4% |
The coding numbers are the one area where the model doesn't lead. On LiveCodeBench, QwQ-32B (63.4%) and DeepSeek-R1-Distill-70B (57.5%) both beat Phi-4-reasoning (53.8%). If you mainly need a coding assistant rather than a math reasoner, this affects the decision. The coding benchmarks leaderboard tracks the current standings.
GPQA Diamond is a bright spot: 65.8% and 68.9% for the two text variants, putting both ahead of o1-mini (60.0%) and QwQ-32B (59.5%). Graduate-level science questions are where this model earns its keep. For more context on reasoning model rankings, see the reasoning benchmarks leaderboard.
The Training Method Is the Story
What makes Phi-4-reasoning interesting technically isn't just the final numbers - it's how Microsoft got there with a relatively small model.
The key design choice is targeting "teachable" prompts during training: problems where the base Phi-4 model scored around 50% accuracy. Too easy and the model learns nothing; too hard and training stalls. This calibration, described in the technical report (arXiv 2504.21318), is why the model generalizes well rather than memorizing specific competition problems.
A 14B model beating DeepSeek-R1-Distill-70B on AIME is genuine, and it reflects smart training rather than benchmark overfitting.
Using o3-mini as the teacher for producing reasoning traces is a notable choice. It means Phi-4-reasoning is, in a sense, distilling proprietary OpenAI capabilities into an open-weight MIT-licensed model. Microsoft is doing this while maintaining its OpenAI partnership - an arrangement that creates some interesting tensions the company hasn't publicly addressed.
The GRPO reinforcement learning step on top of SFT produces the "plus" variant's additional accuracy, at the cost of producing roughly 50% more tokens per response. That tradeoff matters in production: you're paying for longer outputs to get a few points of AIME improvement.
AIME and OmniMath are the benchmarks where Phi-4-reasoning-plus earns its strongest results, handling graduate-level math problems that trip up models many times its size.
Source: unsplash.com
The Overthinking Problem
I'd be doing a disservice not to address this directly. When you send a simple message like "What's the capital of France?" to Phi-4-reasoning-plus, you get a long internal reasoning trace before the one-word answer. The default system prompt instructs the model to run "comprehensive analysis cycles." Simon Willison documented a case where prompting with "hi" produced 56 sentences of reasoning before a single sentence of response.
This isn't a corner case - it's the model's default behavior, and it creates real friction for anything outside of STEM tasks. Compare this with Mistral Small 4 or Qwen 3, which both offer direct-response modes. Phi-4-reasoning has no equivalent nothink flag.
There's also a more troubling failure mode: the model will occasionally produce long, detailed reasoning chains that arrive at wrong answers. One HackerNews thread demonstrated the model reasoning for over two minutes about a simple letter-counting task and still getting it wrong. Elaborate reasoning doesn't guarantee correct answers - a lesson that applies across all chain-of-thought models but is especially visible here because the traces are so verbose.
Local Deployment and Availability
Running these models locally is truly straightforward. Phi-4-reasoning-plus is available via Ollama:
ollama run phi4-reasoning:plus
The download is 11GB for the full model. On a machine with a modern GPU, responses for math problems normally arrive within 30-60 seconds. The model has 1.3 million Ollama pulls, which is a strong adoption signal for an open-weight reasoning model. Community quantizations in GGUF format are available via bartowski and unsloth for lower-memory setups.
For vLLM deployments:
vllm serve microsoft/Phi-4-reasoning-plus --enable-reasoning --reasoning-parser deepseek_r1
Azure AI Foundry hosts all variants, though Microsoft hasn't published explicit per-token pricing for the reasoning models. The base Phi-4 on Azure runs $0.13 per million input tokens and $0.50 per million output tokens. Given that reasoning models produce substantially longer outputs, effective cost per task is higher than the listed rate suggests. For comparison with other small model options, the small language model leaderboard tracks the current field.
The Vision Model
Phi-4-reasoning-vision-15B adds a SigLIP-2 vision encoder and a hybrid reasoning mode where the model decides when to apply chain-of-thought (<think>) versus direct response (<nothink>). Roughly 20% of training samples included reasoning traces; the remaining 80% were tagged for direct response, teaching the model to match mode to task type.
The vision benchmarks are mixed. ScreenSpot-V2 is strong at 88.2%, making it a credible option for GUI navigation agents. ChartQA (83.3%) and AI2D (84.8%) are solid. MathVista (75.2%) and OCRBench (76.0%) lag behind Qwen3-VL-32B, which scores 83.8% and 85.0% on those tasks respectively. The context window shrinks to 16K tokens compared to 32K for the text models - a constraint worth noting for document analysis workloads.
Early community reports described the vision performance as "far below expectations" in some image-to-text tasks. My testing found it capable but inconsistent: strong on structured data (tables, charts, diagrams) and weaker on open-ended image description. The model isn't yet competitive with top multimodal models for general vision tasks.
Limitations Worth Knowing
A few hard constraints before rolling out these models:
English only. Performance degrades notably on non-English languages. The model cards acknowledge this explicitly. If multilingual support matters, look elsewhere.
Knowledge cutoff. March 2025 for the text reasoning models; February 2025 for the vision variant. Questions about events after those dates will produce hallucinations or refusals.
Python-centric coding. The training emphasis is on Python. Other languages work but get less reliable results, especially for complex packages and frameworks.
Election content. Every model card notes an elevated defect rate on election-related queries. Microsoft flags this prominently.
The mini variants (Phi-4-mini-reasoning and flash-reasoning) carry an additional constraint: they're math-only models. Don't try to use them as general-purpose assistants - they aren't designed for that.
Verdict
Phi-4-reasoning-plus earns its place as the best open-weight reasoning model under 15B parameters for math and STEM tasks. Running it locally on a single GPU, getting AIME performance that would have required a 70B model six months ago, under MIT license - that's truly useful for researchers, educators, and developers working on constrained hardware.
The gap closes notably outside of math competition problems. On coding, QwQ-32B and the distilled 70B models lead. On general chat, the overthinking problem makes it awkward. The English-only limitation rules it out for any multilingual deployment.
Score: 8.1/10. A strong, specific tool that rewards users who know what they're asking it to do. Treat it as a dedicated math and science reasoner, not a GPT replacement, and it delivers.
Sources
- Phi-4-reasoning on HuggingFace - Model card, benchmark scores, and training details for the base reasoning model
- Phi-4-reasoning technical report (arXiv 2504.21318) - Full paper covering training methodology, data curation, and evaluation
- Microsoft Research blog: Phi Reasoning - Official announcement and benchmark context
- Phi-4-reasoning-vision-15B research blog - Vision model training methodology and benchmark results
- Azure blog: Phi-4-mini-flash-reasoning - SambaY architecture and throughput claims
- Simon Willison's analysis of Phi-4-reasoning - Independent testing including the overthinking documentation
- Azure AI Foundry model pricing - Base Phi-4 pricing, used as proxy for reasoning model cost estimates
- Phi-4-reasoning-plus on Ollama - Local deployment instructions and download statistics
