Small Language Model Leaderboard: Best Under 10B

Rankings of the best small language models under 10 billion parameters, comparing Phi-4, Gemma 3, Qwen 3.5, and more across key benchmarks.

Small Language Model Leaderboard: Best Under 10B

The most capable AI models get the headlines. A 685B-parameter DeepSeek or a 405B Qwen captures attention because the numbers are enormous and the benchmarks are impressive. But here's the thing: most real-world AI workloads don't need a model that large. They need something that runs on a phone, fits in 8 GB of RAM, or processes requests at a fraction of a cent per thousand tokens.

Small language models - those under 10 billion parameters - have gotten remarkably good in the past year. Models that once struggled with basic reasoning now handle complex instruction following, multilingual tasks, and even code generation with scores that would've beaten last generation's 30B+ flagships. If you're building a product that needs AI on the edge, in a mobile app, or at scale without burning through GPU budgets, this is the leaderboard that matters.

TL;DR

  • Qwen3.5-9B takes the top spot with an MMLU-Pro of 82.5 and GPQA Diamond of 81.7 - beating models 3x its size
  • Gemma 3 4B IT is the biggest surprise, posting a GSM8K of 89.2% and HumanEval of 71.3% from just 4 billion parameters
  • For on-device deployment, Gemma 3n E4B runs on 3 GB of memory and is the first sub-10B model to break 1300 on LMArena

Methodology

This leaderboard ranks models with 10 billion or fewer parameters (or, for MoE architectures, 10B or fewer effective parameters). All scores come from official model cards, technical reports, or verified third-party evaluations. Where models offer both thinking and non-thinking modes, I used the best available score.

The benchmarks:

  • MMLU-Pro - Multi-task language understanding with harder questions than classic MMLU. Tests broad knowledge and reasoning.
  • HumanEval - Code generation. The model writes Python functions from docstrings.
  • GSM8K - Grade-school math word problems requiring multi-step reasoning.
  • ARC-C - Science questions from grade-school exams. Tests common-sense reasoning.
  • GPQA Diamond - Graduate-level science questions. A genuine measure of deep reasoning.

RAM requirements are estimated for Q4_K_M GGUF quantization using llama.cpp, which is how most people actually run these models locally.

Rankings

RankModelParamsProviderMMLU-ProHumanEvalGSM8KARC-CRAM (Q4)
1Qwen3.5-9B9BAlibaba82.5---~6 GB
2Qwen3.5-4B4BAlibaba79.1---~3 GB
3Gemma 3n E4B8B (eff. 4B)Google----~3 GB
4Gemma 3 4B IT4BGoogle43.671.389.256.2~3 GB
5Phi-4-mini3.8BMicrosoft52.8-88.683.7~2.5 GB
6Llama 3.2 3B3BMeta--77.778.6~2 GB
7Qwen3.5-2B2BAlibaba----~1.5 GB
8Gemma 3 1B IT1BGoogle14.741.562.838.4~0.8 GB
9SmolLM2 1.7B1.7BHugging Face----~1.2 GB
10Qwen3.5-0.8B0.8BAlibaba----~0.6 GB

Note on missing scores: Several models - particularly the Qwen 3.5 small series and Gemma 3n - were evaluated on different benchmark suites than the classic MMLU-Pro/HumanEval/GSM8K trio. Qwen 3.5 models report MMLU-Pro, GPQA Diamond, and IFEval. Gemma 3n reports LMArena Elo and vision benchmarks. I've ranked based on the totality of available evidence, cross-referencing where benchmarks overlap, rather than leaving a model out because it didn't report one specific test.

Key Takeaways

Qwen 3.5 owns the top of the small model stack

The Qwen 3.5 small series is the most complete lineup in this class. Four models from 0.8B to 9B, all sharing the same Gated DeltaNet hybrid architecture from the 397B flagship, all natively multimodal (text, images, video), all Apache 2.0 licensed. The 9B variant scores 82.5 on MMLU-Pro and 81.7 on GPQA Diamond - numbers that would've placed it in the top five of the open-source LLM leaderboard just six months ago. It beats the previous-gen Qwen3-30B on most benchmarks despite being a third of the size.

Gemma 3 4B punches hard on code and math

Google's Gemma 3 4B IT posts a 71.3% on HumanEval and 89.2% on GSM8K. Those are strong numbers for a 4B model. For coding and math-heavy use cases where you don't need the full reasoning depth of a 9B model, Gemma 3 4B is the pick. It's also multimodal (images and text) with a 128K context window, which is generous for its size class.

Phi-4-mini is the ARC champion

Microsoft's Phi-4-mini (3.8B) scores an impressive 83.7% on ARC-C - the highest in this entire leaderboard for that benchmark. It also posts 88.6% on GSM8K, making it a strong contender for reasoning-heavy workloads. The 128K context window and competitive code performance make it a solid all-rounder, even if Qwen 3.5 takes the overall crown.

Gemma 3n redefines what fits on a phone

Google's Gemma 3n is architecturally distinct from the rest of this list. Its E4B variant has 8B raw parameters but runs with the memory footprint of a 4B model - roughly 3 GB. It's the first model under 10B parameters to break a 1300 LMArena Elo score. The trick is Per-Layer Embeddings, which allows the model to dynamically manage its memory. For mobile deployment specifically, Gemma 3n is purpose-built in a way that general models aren't.

The 1-3B tier is now genuinely useful

A year ago, models under 3B parameters were toys. That's no longer true. Llama 3.2 3B handles basic instruction following and simple reasoning with respectable scores (77.7% GSM8K, 78.6% ARC-C). Gemma 3 1B runs at over 2,500 tokens per second on a mobile GPU and scores 62.8% on GSM8K. Qwen3.5-2B adds native multimodal support. These aren't replacing 9B models, but they're perfectly adequate for autocomplete, summarization, simple Q&A, and basic coding assistance on constrained hardware.

Silicon chip wafer close-up - the physical substrate where small language models run

On-Device Deployment Guide

The whole point of small models is running them where large models can't. Here's what actually works in March 2026.

Phones (4-8 GB RAM)

Device TierBest ModelsFrameworkNotes
Flagship (8+ GB RAM)Qwen3.5-4B Q4, Gemma 3n E4B, Phi-4-mini Q4ExecuTorch, MediaPipe, MLXFull instruction following, multimodal
Midrange (6 GB RAM)Qwen3.5-2B Q4, Gemma 3 1B, Llama 3.2 3B Q4ExecuTorch, AI EdgeBasic chat, summarization
Budget (4 GB RAM)Qwen3.5-0.8B, Gemma 3 270MLiteRT, TFLiteAutocomplete, classification

Qualcomm's Snapdragon 8 Gen 5 NPU delivers up to 46% faster AI inference than the previous generation. Samsung, Google, and Motorola flagship phones shipping in 2026 all support on-device inference for models up to roughly 4B parameters in Q4 quantization. Apple's Foundation Models framework provides native access to a ~3B on-device model on iPhone 16 and later.

Meta's ExecuTorch hit 1.0 GA in October 2025 and now supports 12+ hardware backends - Apple, Qualcomm, Arm, MediaTek, and Vulkan - with over 80% of popular edge models working out of the box.

Laptops and Desktops (8-32 GB RAM)

Any machine with 8 GB of RAM can run any model in this leaderboard at Q4 quantization. A typical laptop with 16 GB of RAM can comfortably run Qwen3.5-9B at Q4_K_M through llama.cpp or one of the local LLM tools with room to spare. That's the current best-in-class small model, running locally, with no internet connection and no API costs.

For GPU-equipped machines, you can run several of these models at full BF16 precision. The Qwen3.5-4B needs about 8 GB VRAM, Phi-4-mini about 7.5 GB, and Gemma 3 4B roughly 8 GB.

On-device AI technology - small models running directly on consumer hardware

Edge and IoT

Gemma 3 270M (not in the main leaderboard since it's under 1B) is worth mentioning for extreme edge cases. At 529 MB, it handles basic text classification, entity extraction, and simple Q&A on hardware as minimal as a Raspberry Pi 5. SmolLM2 at 135M and 360M parameter variants targets similar use cases.

Practical Guidance

Best for coding: Gemma 3 4B IT. The 71.3% HumanEval is strong for a model this size, and the 128K context window helps with larger codebases. Phi-4-mini is a close second.

Best for reasoning: Qwen3.5-9B. The 82.5 MMLU-Pro and 81.7 GPQA Diamond are hard to argue with. If you can spare 6 GB of RAM, this is the model to use.

Best for mobile deployment: Gemma 3n E4B. Purpose-built for phones, 3 GB memory footprint, first sub-10B model above 1300 LMArena Elo. If you're building an iOS or Android app, start here.

Best for general chat on a laptop: Qwen3.5-4B at Q4 quantization. Strong across the board, natively multimodal, 262K context, and runs comfortably on any modern laptop. Check the home GPU leaderboard for more recommendations at this tier.

Best ultra-light model: Llama 3.2 3B. At 2 GB in Q4, it's the lightest model in this leaderboard that still delivers genuinely useful performance across tasks. Gemma 3 1B is lighter still but trades off noticeable capability.

Best for privacy-first deployments: Any model on this list runs fully offline. But if your priority is maximizing capability per watt on air-gapped hardware, Phi-4-mini offers the best reasoning-per-GB ratio thanks to its exceptional ARC-C and GSM8K scores at just 3.8B parameters.

What's Next

The trajectory is clear. Two years ago, a 7B model couldn't follow basic instructions. One year ago, a 3B model was barely useful. Today, a 4B model matches what last year's 30B models reached. The performance-per-parameter curve continues to steepen as training techniques, architecture innovations like Gated DeltaNet and Per-Layer Embeddings, and better data curation compound.

Watch for the next wave of on-device models optimized specifically for NPU hardware rather than adapted from general-purpose architectures. Google's Gemma 3n is the template - models designed from the ground up for constrained inference, not miniaturized versions of cloud models. That's where the real gains are coming.

Sources

✓ Last verified March 9, 2026

Small Language Model Leaderboard: Best Under 10B
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.