Microsoft's Phi-4 Vision Matches Models 10x Its Size

Microsoft released Phi-4-reasoning-vision-15B on March 4 - a 15 billion parameter open-weight model that combines visual understanding with mathematical reasoning. The model was trained on just 240 NVIDIA B200 GPUs over 4 days using 200 billion multimodal tokens, a fraction of the compute used by competitors like Qwen 2.5 VL and Gemma 3 (which trained on 1 trillion+ tokens). Despite the efficiency, it competes with models many times its size on math, science, and GUI understanding benchmarks.

TL;DR

15B parameter open-weight multimodal model combining vision + reasoning
Trained on 240 B200 GPUs in 4 days using 200B tokens (vs 1T+ for competitors)
Architecture: Phi-4-Reasoning backbone + SigLIP-2 vision encoder, mid-fusion design
Key feature: adaptive reasoning - decides when to think deeply vs answer instantly
Scores: 84.8 AI2D, 83.3 ChartQA, 75.2 MathVista, 88.2 ScreenSpot v2
Available on HuggingFace, Microsoft Foundry, GitHub under MIT license

Architecture

Phi-4-reasoning-vision-15B is built on two components: the Phi-4-Reasoning language model backbone (which handles text-based reasoning) and the SigLIP-2 vision encoder (which processes images). These are connected via a mid-fusion architecture that lets visual and textual information interact throughout the model's layers rather than only at the input.

The vision encoder uses a SigLIP-2 Naflex variant with natively dynamic resolution, supporting up to 3,600 visual tokens (~720p). A cross-modality MLP projector connects visual tokens into the language model's embedding space. Bidirectional attention within images (but not across images) reduces overfitting. Context length is 16,384 tokens.

The model's distinguishing feature is adaptive reasoning. It operates as a single unified system that decides when to activate extended chain-of-thought (<think>...</think> tags) vs answering directly (<nothink> mode). For a simple image captioning task, it responds right away. For a complex math problem embedded in a diagram, it activates extended reasoning. Users can override the defaults via prompts to force either mode.

Benchmark Results

Benchmark	Phi-4-Vision 15B	Qwen3-VL-8B	Qwen3-VL-32B
AI2D (science diagrams)	84.8	83.9	85.0
ChartQA	83.3	83.9	84.0
MathVista	75.2	72.4	81.8
MMMU	54.3	55.8	70.6
MMStar	64.5	65.0	71.5
OCRBench	76.0	86.5	88.1
ScreenSpot v2 (GUI)	88.2	80.4	93.9
MathVerse	44.9	-	-
MathVision	36.2	-	-

The results show a model that's truly competitive with the 8B-class Qwen models and not far behind the 32B version on several benchmarks - despite being trained on a fifth of the data. The ScreenSpot v2 score (88.2) is especially strong, beating the 8B Qwen model by nearly 8 points on GUI element grounding.

Where it falls short: MMMU (broad multimodal understanding) and OCRBench, where the larger Qwen models maintain a significant lead. The model excels on structured reasoning tasks but has less raw visual understanding bandwidth than models with more training data.

Training Efficiency

The headline efficiency claim: 200 billion multimodal tokens on 240 B200 GPUs for 4 days. Microsoft specifically contrasts this with competing models:

Qwen 2.5 VL: 1 trillion+ tokens
Qwen 3 VL: 1 trillion+ tokens
Kimi-VL: 1 trillion+ tokens
Gemma 3: 1 trillion+ tokens

The training used a three-stage recipe: Stage 1 trained only the projector on 2M samples (1.4B tokens) with both vision and LLM frozen. Stage 2 unfroze everything and trained on 62.8M samples (188.5B tokens). Stage 3 fine-tuned on 3.2M specialized samples (12B tokens). Inaccurate captions in training data were regenerated using GPT-4o and o1-mini.

A 5x reduction in training data while maintaining competitive scores suggests either better data curation, more efficient architecture, or both. Microsoft's Phi team has consistently highlighted data quality over data volume as the driver of their small model performance.

What It Can Do

The model handles four main categories of visual tasks:

Math and science reasoning - Reads mathematical notation, diagrams, charts, and figures, then reasons through multi-step solutions. This is where the chain-of-thought reasoning matters most.

Document understanding - Processes receipts, forms, tables, and structured documents. The OCR score (76.0) is decent but not class-leading.

GUI navigation - Identifies and locates interface elements on screens. The ScreenSpot v2 score (88.2) suggests strong potential for computer-use applications at a fraction of the cost of frontier models.

General visual Q&A - Photo captioning, object identification, scene description. Competent but not the model's primary strength.

Availability

The model is released under the MIT license - fully open for commercial use. Weights are available on HuggingFace, and it's deployable through Microsoft Foundry and Azure AI. Training code and evaluation scripts are on GitHub.

At 15B parameters, the model runs on a single consumer GPU with quantization, making it accessible for researchers and developers who can't afford API calls to frontier models. That's the Phi series' ongoing value proposition: near-frontier quality for tasks that don't need the full weight of a 100B+ parameter model.

Sources:

Architecture

Benchmark Results

Training Efficiency

What It Can Do

Availability

Google Analytics