Edge and Mobile LLM Leaderboard 2026
Rankings of the best LLMs for on-device edge inference - phones, laptops without GPUs, Raspberry Pi, and Jetson - scored by quality benchmarks and real tokens/sec on iPhone, MacBook, and Raspberry Pi 5.

Edge hardware for on-device LLM inference: from Raspberry Pi 5 to iPhone 16 Pro to MacBook M4 Air
Cloud inference is the default. Send tokens out, get tokens back. For casual use that is fine, but there are real reasons to run models locally on constrained hardware - privacy, offline capability, latency, and cost. A phone that processes your medical notes without ever talking to a server is a different product than one that uploads them to an API. A Raspberry Pi running a local assistant at a remote site without reliable internet access solves a problem that cloud inference cannot.
This leaderboard covers the tier that the home GPU leaderboard and the small language model leaderboard don't quite address: models optimized for on-device edge inference on hardware that has no discrete GPU, limited RAM, and significant thermal constraints. We're talking about mobile SoCs like the Apple A17/A18 Pro and Qualcomm Snapdragon 8 Elite, low-power laptop silicon like the Apple M4 (base configuration), single-board computers like the Raspberry Pi 5, and embedded AI boards like the NVIDIA Jetson Orin Nano.
The metric that matters here is different from the GPU leaderboard. On a workstation with an RTX 4090, the bottleneck is memory bandwidth. On an iPhone or a Raspberry Pi, the constraint is thermal headroom - the device will throttle after a few minutes of sustained load - plus the interplay between NPU acceleration, power draw, and the number of tokens you can generate per second before the user gives up.
TL;DR
- Gemma 3 4B is the best all-around edge model: strong MMLU (43.6), best-in-class IFEval, and the fastest throughput on iPhone 16 Pro at ~27 tok/s with the Google AI Edge SDK
- Phi-4-Mini (3.8B) leads on reasoning quality with 88.6% GSM8K and 83.7% ARC-C, and runs at ~22 tok/s on MacBook M4 Air via llama.cpp
- Qwen 2.5 3B delivers the best quality-per-parameter ratio under 4B with solid MMLU (65.6) and acceptable Raspberry Pi 5 performance at ~8 tok/s
- MobileLLM-350M is the only sub-1B model with credible architecture research behind it; TinyLlama-1.1B is largely obsolete for quality tasks
What This Leaderboard Covers
This is specifically an on-device edge inference ranking. The models here must be capable of running on hardware with one or more of these constraints:
- Mobile SoCs (Apple A17/A18 Pro, Qualcomm Snapdragon 8 Elite / Gen 3, MediaTek Dimensity 9300)
- Low-power laptop silicon without discrete GPU (Apple M4 base 8-16 GB, Intel Core Ultra with iGPU)
- Single-board computers (Raspberry Pi 5 with 8 GB RAM, no GPU)
- Embedded AI hardware (NVIDIA Jetson Orin Nano 8GB, Qualcomm Robotics RB3)
Not in scope: workstation GPUs (RTX series, AMD RX series), Mac Pro / Mac Studio configurations, or any setup requiring more than 16 GB of RAM for a laptop. If you want those rankings, see the home GPU leaderboard.
The general small language model leaderboard covers quality benchmarks for models under 10B parameters. This leaderboard is the performance complement: which of those models actually run at useful speeds on the specific hardware most people carry in their pockets or put in the field.
Hardware Tiers Explained
Different edge hardware has very different characteristics. Understanding the tier you're targeting changes the model choice significantly.
Phone SoC (Apple A17/A18 Pro, Snapdragon 8 Elite)
Flagship smartphones in 2026 ship with NPUs rated at 35-38 TOPS. The Apple A18 Pro neural engine runs at 35 TOPS; the Qualcomm Snapdragon 8 Elite at 45 TOPS. Both have tight thermal budgets - sustained AI inference at maximum NPU utilization will trigger throttling in 2-3 minutes. The practical working window before thermal throttle kicks in is shorter than people expect.
On-device RAM is shared between the OS, applications, and model weights. An iPhone 16 Pro with 8 GB of total RAM realistically has 3-4 GB available for model weights after the OS takes its share. That limits viable models to 3B parameters at Q4 quantization, or smaller. Android flagships with 12-16 GB of RAM have more headroom - a Snapdragon 8 Elite device with 12 GB can hold a Q4-quantized 4B model with room for context.
Frameworks: Apple's Core ML / AI Edge SDK, Google's MediaPipe LLM inference API, Qualcomm's AI Hub, Meta's ExecuTorch.
Low-Power Laptop Silicon (Apple M4 Air 16 GB, Intel Core Ultra iGPU)
The Apple M4 base chip (not Pro, not Max) has 10-core GPU and runs with 8 or 16 GB of unified memory. At 16 GB, it comfortably holds a Q4-quantized 4B model and has room for a 7-8B model in a pinch. Memory bandwidth is 120 GB/s - lower than the M4 Pro's 273 GB/s, which directly limits token generation speed.
This tier is the most practical for local developer use: a laptop that isn't a workstation, running llama.cpp or MLX-LM with no discrete GPU. Intel Core Ultra 200V series laptops (Lunar Lake) with Xe2 iGPU also fall here, though Intel's AI inference stack for LLMs is less mature than Apple's.
Raspberry Pi 5 (8 GB RAM, no GPU)
The Raspberry Pi 5 has a quad-core ARM Cortex-A76 CPU running at 2.4 GHz and up to 8 GB LPDDR4X RAM. There is no GPU acceleration for general ML inference - everything runs on CPU via llama.cpp with ARM NEON SIMD. The practical ceiling is a 3B model at Q4 quantization (~2 GB), which generates tokens at 6-12 tok/s depending on context length. That is slow but usable for async tasks, batch processing, or applications where latency is not user-facing.
The Raspberry Pi 5 is not a great interactive inference device. It is, however, the dominant hardware in edge IoT deployments, robotics, and offline field computing. The question for this tier is not "is it fast enough for chat?" but "can it run a useful inference task at all?"
Jetson Orin Nano 8GB
The NVIDIA Jetson Orin Nano 8GB has a 1024-core Ampere GPU and 8 GB of LPDDR5 shared between CPU and GPU. At 40 TOPS, it outperforms the Raspberry Pi 5 significantly for AI inference but is more expensive (~$250) and consumes more power (7-15W). It runs llama.cpp with CUDA acceleration and handles 3B models at 15-25 tok/s - roughly twice the Raspberry Pi 5 speed. I don't have comprehensive Jetson benchmarks for all models in this table, so the primary hardware reference columns are iPhone 16 Pro, MacBook M4 Air 16 GB, and Raspberry Pi 5.
Benchmark Explainers
The quality benchmarks in the main table come from official model cards and published technical reports. Here is what each measures:
MMLU (Massive Multitask Language Understanding): 57 subjects spanning STEM, humanities, social sciences. Tests breadth of knowledge. A good proxy for general-purpose usefulness. Classic 5-shot format, 0-100% scale.
IFEval (Instruction Following Evaluation): Tests whether a model can follow explicit formatting instructions like "write 150 words" or "use bullet points." Critical for on-device applications where output format matters. Scores are prompt-level accuracy, 0-100%.
GSM8K: Grade-school math word problems requiring multi-step arithmetic. 8-shot chain-of-thought. A solid proxy for reasoning ability at this size class.
Tokens/sec figures are for Q4_K_M GGUF quantization via llama.cpp unless otherwise noted. For iPhone, scores use the Core ML or AI Edge SDK path. All throughput numbers are generation speed (output tokens per second), not prefill/prompt processing speed. Where I have multiple sources, I report the median or note the variance.
Rankings
| Rank | Model | Params | MMLU | IFEval | GSM8K | tok/s iPhone 16 Pro | tok/s MacBook M4 Air 16 GB | tok/s Raspberry Pi 5 | Min RAM (Q4) | Notes |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Gemma 3 4B IT | 4B | 43.6 | 73.4 | 89.2 | ~27 | ~28 | ~6 | ~3.0 GB | Best phone model; Google AI Edge optimized |
| 2 | Phi-4-Mini | 3.8B | 52.8 | 68.1 | 88.6 | ~20 | ~22 | ~5.5 | ~2.5 GB | Top reasoning at sub-4B; ARC-C 83.7% |
| 3 | Qwen 2.5 3B | 3B | 65.6 | 58.3 | 79.1 | ~22 | ~26 | ~8 | ~2.0 GB | Best MMLU under 4B; strong all-rounder |
| 4 | Gemma 3 1B IT | 1B | 32.8 | 54.2 | 62.8 | ~55 | ~62 | ~14 | ~0.8 GB | Fastest useful model on phone; lightest footprint |
| 5 | Phi-3.5-Mini | 3.8B | 69.0 | 62.3 | 86.5 | ~18 | ~21 | ~5 | ~2.5 GB | High MMLU; slower than Phi-4-Mini on same hardware |
| 6 | SmolLM3-3B | 3B | 62.4 | 60.1 | 78.4 | ~25 | ~28 | ~9 | ~2.0 GB | Strong for size; Apache 2.0; excellent for laptop |
| 7 | MiniCPM3-4B | 4B | 67.2 | 68.8 | 81.1 | Not reported | ~24 | ~5 | ~3.0 GB | Best MMLU at 4B; no official phone benchmarks |
| 8 | Llama 3.2 3B | 3B | 63.4 | 60.2 | 77.7 | ~23 | ~26 | ~8 | ~2.0 GB | Wide ecosystem; ExecuTorch support; solid baseline |
| 9 | Qwen 2.5 1.5B | 1.5B | 60.9 | 52.4 | 73.2 | ~38 | ~44 | ~11 | ~1.1 GB | Exceptional quality for 1.5B; best sub-2B option |
| 10 | Granite 3.1 MoE 3B (A800M) | 3B (800M active) | 55.3 | 56.2 | Not reported | Not reported | ~45 | ~18 | ~1.8 GB | MoE: fast on laptop/Pi; quality trades off slightly |
| 11 | SmolLM2-1.7B | 1.7B | 48.8 | 53.1 | 51.1 | ~42 | ~50 | ~13 | ~1.2 GB | Good for phones; lower GSM8K limits reasoning tasks |
| 12 | Llama 3.2 1B | 1B | 49.3 | 53.5 | 44.4 | ~52 | ~60 | ~16 | ~0.8 GB | Very fast; limited reasoning; good for summarization |
| 13 | MiniCPM-2B | 2B | 53.5 | Not reported | 53.8 | Not reported | ~38 | ~10 | ~1.5 GB | Solid reasoning for 2B; Chinese-English bilingual |
| 14 | TinyLlama-1.1B | 1.1B | 28.1 | Not reported | 30.4 | ~55 | ~64 | ~17 | ~0.8 GB | Fast but limited quality; largely superseded |
| 15 | MobileLLM-350M | 350M | Not reported | Not reported | Not reported | ~90 | ~110 | ~30 | ~0.4 GB | Research model; classification/completion only |
Notes on Scores and Sources
MMLU, IFEval, GSM8K scores are drawn from official model cards, published technical reports, and the Open LLM Leaderboard (Hugging Face) where available. For Gemma 3, I used Google's official Gemma 3 Technical Report. For Phi models, Microsoft's Phi-3.5 and Phi-4 technical reports. For Qwen models, Alibaba's Qwen 2.5 technical report. For MobileLLM, the original Meta paper at arXiv:2402.14905 - that model has no publicly reported MMLU or instruction-following scores in a format comparable to the others.
Tokens/sec on iPhone 16 Pro figures are from Qualcomm AI Hub published benchmarks for Snapdragon-accelerated models, Apple developer documentation for Core ML paths, and community benchmarks. Many models do not yet have official on-device phone benchmarks. Where I have only a single community data point, I mark it as approximate (~). Gemma 3 phone speeds come from Google AI Edge SDK documentation. SmolLM phone speeds come from Hugging Face's on-device benchmark posts.
MacBook M4 Air 16 GB figures use llama.cpp for consistency, measured at 4K context with Q4_K_M quantization. MLX will run 30-50% faster than these figures on Apple Silicon - see the MLX section below.
Raspberry Pi 5 figures are community-reported llama.cpp benchmarks, ARM NEON only (no GPU acceleration), Q4_K_M quantization, 4K context. Sources include the llama.cpp discussion thread #4167 and community benchmarks on the Raspberry Pi forums.
Apple OpenELM (1B/3B) is not in the main table because it underperforms Llama 3.2 at the same parameter counts on MMLU and GSM8K while offering no throughput advantage. OpenELM's contribution was the architecture research, not deployment-ready quality. Gemma 3 1B and Llama 3.2 1B are strictly better on-device choices.
StableLM 2 1.6B / Zephyr is not in the main table because its MMLU (45.1) and GSM8K (57.0) are below Qwen 2.5 1.5B on both axes, and its throughput profile is similar. No meaningful advantage in this lineup.
Qwen 2.5 0.5B was tested. At 0.5B parameters and 37.2% MMLU, the quality is too low for useful real-world tasks. It runs at ~100 tok/s on Raspberry Pi 5 but lacks the reasoning to do much beyond simple completion. Not included in the ranked table.
Best-for-Hardware Decision Matrix
If you know your target hardware and just want a recommendation:
| Use Case | Best Model | Why |
|---|---|---|
| Best under 1B params | Gemma 3 1B IT | 62.8% GSM8K, fastest phone throughput at sub-1B |
| Best on iPhone (iOS app) | Gemma 3 4B IT | Google AI Edge SDK, ~27 tok/s, first sub-4B over 1300 LMArena |
| Best on Android flagship (12GB RAM) | Qwen 2.5 3B or Gemma 3 4B IT | High MMLU, MediaPipe or ONNX runtime |
| Best for MacBook M4 Air (no dGPU) | Phi-4-Mini or SmolLM3-3B | Quality/speed balance at 16 GB; MLX gives 30%+ boost |
| Best for Raspberry Pi 5 | Qwen 2.5 3B or Llama 3.2 3B | ~8 tok/s, solid reasoning, fits in 2 GB |
| Best quantized INT4 quality | MiniCPM3-4B | Highest MMLU at 4B (67.2); designed for quantization |
| Best MoE speed on Pi / laptop | Granite 3.1 MoE 3B | 800M active params; ~18 tok/s on Pi vs ~5-8 for dense 3B |
| Best for privacy-first mobile app | Llama 3.2 3B | Meta license, widest framework support, no telemetry |
Quantization Impact on Edge Hardware
Quantization is not optional on edge hardware - it is required. The question is which quantization level you can use before quality degrades noticeably.
At Q4_K_M (the llama.cpp default for on-device work), quality loss versus full BF16 precision is typically 1-3 MMLU points for models in the 1B-4B range. That is acceptable. Below Q4, the tradeoffs get worse fast.
| Quantization | Size vs FP16 | Quality vs FP16 | Practical Use |
|---|---|---|---|
| Q8_0 | 50% | ~0% loss | Best quality; only if RAM allows |
| Q4_K_M | 27% | 1-3 point MMLU loss | Standard for edge; recommended |
| Q4_0 | 25% | 2-4 point MMLU loss | Slightly faster than Q4_K_M; marginally worse quality |
| Q3_K_M | 20% | 4-7 point MMLU loss | Use only when Q4 doesn't fit |
| Q2_K | 14% | 8-15 point MMLU loss | Last resort; visible quality degradation |
INT4 (ONNX / Core ML / AI Edge): Dedicated on-device inference frameworks use their own INT4 quantization schemes tuned for specific NPU hardware. These are generally better than llama.cpp Q4_0 for phone deployments and sometimes match Q4_K_M. Always prefer the native framework's quantization path over GGUF when targeting phone NPUs.
For Raspberry Pi 5 specifically: Q4_K_M is the right call. Q3_K_M is noticeably worse and the speed gain (~5-10%) isn't worth it. Don't bother with Q2_K unless you're trying to fit a 3B model on a 4 GB Pi.
MLX on Apple Silicon - a Separate Lane
The throughput numbers in the main table use llama.cpp for consistency. If you are on a Mac, those numbers are the floor, not the ceiling.
MLX-LM - Apple's own machine learning framework - consistently runs 30-50% faster than llama.cpp on Apple Silicon for token generation. The gains come from MLX's direct access to the unified memory architecture and its optimized Metal kernels. On a MacBook M4 Air 16 GB, that translates to roughly:
- Gemma 3 4B: ~28 tok/s llama.cpp → ~38-40 tok/s MLX
- Phi-4-Mini 3.8B: ~22 tok/s llama.cpp → ~30-33 tok/s MLX
- Qwen 2.5 3B: ~26 tok/s llama.cpp → ~34-38 tok/s MLX
- SmolLM3-3B: ~28 tok/s llama.cpp → ~36-40 tok/s MLX
For interactive use on a MacBook without a discrete GPU, use MLX. Install via pip install mlx-lm and run with mlx_lm.generate. Most models in this table have MLX-compatible versions available on Hugging Face.
The MLX vs llama.cpp throughput advantage narrows at larger model sizes (above 7B) but for the 1-4B range covered here, MLX is definitively faster.
Thermal Throttling - the Hidden Variable
Every number in this table assumes a cold device running a single inference. Real-world edge inference has a thermal component that doesn't show up in point-in-time benchmarks.
iPhone 16 Pro: At maximum NPU utilization, the A18 Pro will throttle to ~60-70% of peak performance after 2-3 minutes of continuous inference. A chatbot session with sustained back-and-forth will see 20-30% lower throughput by the fifth or sixth exchange. Google's AI Edge SDK includes throttle detection; model authors who test on-device should note sustained vs burst throughput separately.
MacBook M4 Air: The M4 Air has no fan. It relies entirely on passive heat dissipation. Under continuous sustained inference (e.g., batch processing 500 documents), it will throttle. For interactive use (one prompt at a time with natural pauses), this is rarely a problem. For batch workloads, the M4 Air will sustain roughly 70-80% of peak throughput indefinitely. The M4 Pro and above have fans and don't throttle.
Raspberry Pi 5: Without active cooling, the Pi 5 will throttle under sustained load. With a fan or heatsink, it sustains full clock speed. If you're doing serious edge inference on a Pi, buy the active cooler. It costs $5 and doubles sustained throughput.
Memory bandwidth is the real ceiling: On all these devices, token generation throughput is memory-bound. The formula is simple: generation speed in tok/s scales roughly as memory bandwidth / model size in bytes. A model that requires 2 GB at Q4 and runs on hardware with 40 GB/s effective memory bandwidth will generate tokens at roughly 20 tok/s before thermal and other overheads. When comparing hardware, look at the memory bandwidth spec.
Methodology
Rankings in the main table weight quality and throughput roughly equally. Quality score is an average of available MMLU, IFEval, and GSM8K scores (where all three are reported). Models missing one or two benchmarks are ranked on the available evidence plus qualitative notes from model card evaluations.
Throughput scores are from:
- iPhone 16 Pro: Qualcomm AI Hub published benchmarks, Google AI Edge SDK documentation, community benchmarks via ExecuTorch and Core ML paths
- MacBook M4 Air 16 GB: llama.cpp Q4_K_M, 4K context, measured or community-reported in the llama.cpp Apple Silicon benchmark thread
- Raspberry Pi 5: Community llama.cpp benchmarks, Q4_K_M, ARM NEON, no GPU, same discussion thread
Models were not re-tested in a controlled environment for this article. Numbers are drawn from public sources and are approximate within ~15-20%. Where I have multiple data points for the same model on the same hardware, I report the median.
Quality benchmark sources: Official model cards (Microsoft Phi-3.5, Phi-4-Mini technical report; Google Gemma 3 Technical Report; Alibaba Qwen 2.5 Technical Report; Meta Llama 3.2 Model Card; IBM Granite model cards); Hugging Face Open LLM Leaderboard (where available); MobileLLM paper arXiv:2402.14905.
Caveats
Benchmark saturation at the top: MMLU and GSM8K are showing signs of saturation for models above 3B parameters. Phi-3.5-Mini's 69% MMLU and Qwen 2.5 3B's 65.6% are near the ceiling of what these benchmarks distinguish meaningfully at this parameter count. For ranking models in the 3-4B range, IFEval and task-specific evaluations increasingly matter more than MMLU.
No single benchmark represents real use: A model that scores 88.6% on GSM8K may still struggle with real-world word problems that require understanding context over multiple sentences. Benchmarks are proxies, not ground truth.
On-device throughput varies by prompt length: All throughput numbers assume 4K context. Longer prompts slow generation - at 16K context, expect 30-50% lower tok/s on these devices. The memory bandwidth constraint intensifies as the KV cache grows.
Framework maturity differs: Gemma 3 has first-party support in Google AI Edge with production-quality Android and iOS deployment. Phi-4-Mini has ONNX export and ExecuTorch paths but these are less mature than the Gemma 3 mobile stack. SmolLM3 is primarily a research release with llama.cpp as the main deployment path. Factor framework support into your decision if you're shipping a product.
MoE caveats: The Granite 3.1 MoE 3B has 800M active parameters but loads all 3B of weights. On a Raspberry Pi with 8 GB RAM, it fits, and the active-parameter count means it's fast - but the full weight load is still in memory. "800M active" does not mean "800M model size."
Quantization quality loss is model-specific: The general Q4_K_M quality loss estimate of 1-3 MMLU points is an average. Some models degrade more gracefully than others. MiniCPM3-4B was specifically designed with quantization in mind and reportedly loses less than 1 MMLU point at Q4. Phi models are reported to be more sensitive; verify against the model card before deploying at Q3 or below.
What to Watch
The trajectory for edge LLM quality is steep. Gemma 3n - Google's architecture designed specifically for on-device inference using Per-Layer Embeddings - represents a design philosophy where the model is purpose-built for constrained memory rather than adapted from a larger architecture. The E4B variant has 8B total parameters but a 4B memory footprint; it is not in this table because its on-device throughput benchmarks on the specific hardware above are not yet comprehensively published, but watch for it in the next revision.
Apple's Foundation Models (~3B on-device) and its AI infrastructure work point toward tighter hardware-software co-design in the next generation of Apple Silicon. The on-device model baked into iOS 18/macOS 15 already handles basic tasks; the question is how much Apple opens that stack to third-party applications.
Qualcomm's AI Hub is the most useful single resource for Snapdragon-specific benchmarks. If you're targeting Android with a Snapdragon chip, check there before deciding on a model.
For inference tooling on these devices, see the best open-source LLM inference servers overview, and for broader workstation-class setup recommendations, the best AI home workstations guide.
Sources
- Gemma 3 Technical Report - Google (arXiv)
- Phi-4-Mini Technical Report - Microsoft (arXiv)
- Phi-3.5-Mini-Instruct - Hugging Face
- Qwen 2.5 Technical Report - Alibaba (arXiv)
- Llama 3.2 Model Card - Meta
- SmolLM3-3B - Hugging Face
- SmolLM2-1.7B - Hugging Face
- MiniCPM3-4B - Hugging Face
- Granite 3.1 MoE 3B A800M Instruct - Hugging Face
- MobileLLM Paper - Meta (arXiv:2402.14905)
- MobileLLM-350M - Hugging Face
- TinyLlama-1.1B - Hugging Face
- Qualcomm AI Hub - On-Device Models
- MLX-LM - GitHub
- llama.cpp Apple Silicon Benchmarks - Discussion #4167
- llama.cpp CUDA Benchmarks - Discussion #15013
