Cloud inference is the default. Send tokens out, get tokens back. For casual use that is fine, but there are real reasons to run models locally on constrained hardware - privacy, offline capability, latency, and cost. A phone that processes your medical notes without ever talking to a server is a different product than one that uploads them to an API. A Raspberry Pi running a local assistant at a remote site without reliable internet access solves a problem that cloud inference cannot.

This leaderboard covers the tier that the home GPU leaderboard and the small language model leaderboard don't quite address: models optimized for on-device edge inference on hardware that has no discrete GPU, limited RAM, and significant thermal constraints. We're talking about mobile SoCs like the Apple A17/A18 Pro and Qualcomm Snapdragon 8 Elite, low-power laptop silicon like the Apple M4 (base configuration), single-board computers like the Raspberry Pi 5, and embedded AI boards like the NVIDIA Jetson Orin Nano.

The metric that matters here is different from the GPU leaderboard. On a workstation with an RTX 4090, the bottleneck is memory bandwidth. On an iPhone or a Raspberry Pi, the constraint is thermal headroom - the device will throttle after a few minutes of sustained load - plus the interplay between NPU acceleration, power draw, and the number of tokens you can generate per second before the user gives up.

TL;DR

Gemma 3 4B is the best all-around edge model: strong MMLU (43.6), best-in-class IFEval, and the fastest throughput on iPhone 16 Pro at ~27 tok/s with the Google AI Edge SDK
Phi-4-Mini (3.8B) leads on reasoning quality with 88.6% GSM8K and 83.7% ARC-C, and runs at ~22 tok/s on MacBook M4 Air via llama.cpp
Qwen 2.5 3B delivers the best quality-per-parameter ratio under 4B with solid MMLU (65.6) and acceptable Raspberry Pi 5 performance at ~8 tok/s
MobileLLM-350M is the only sub-1B model with credible architecture research behind it; TinyLlama-1.1B is largely obsolete for quality tasks

What This Leaderboard Covers

This is specifically an on-device edge inference ranking. The models here must be capable of running on hardware with one or more of these constraints:

Mobile SoCs (Apple A17/A18 Pro, Qualcomm Snapdragon 8 Elite / Gen 3, MediaTek Dimensity 9300)
Low-power laptop silicon without discrete GPU (Apple M4 base 8-16 GB, Intel Core Ultra with iGPU)
Single-board computers (Raspberry Pi 5 with 8 GB RAM, no GPU)
Embedded AI hardware (NVIDIA Jetson Orin Nano 8GB, Qualcomm Robotics RB3)

Not in scope: workstation GPUs (RTX series, AMD RX series), Mac Pro / Mac Studio configurations, or any setup requiring more than 16 GB of RAM for a laptop. If you want those rankings, see the home GPU leaderboard.

The general small language model leaderboard covers quality benchmarks for models under 10B parameters. This leaderboard is the performance complement: which of those models actually run at useful speeds on the specific hardware most people carry in their pockets or put in the field.

Hardware Tiers Explained

Different edge hardware has very different characteristics. Understanding the tier you're targeting changes the model choice significantly.

Phone SoC (Apple A17/A18 Pro, Snapdragon 8 Elite)

Flagship smartphones in 2026 ship with NPUs rated at 35-38 TOPS. The Apple A18 Pro neural engine runs at 35 TOPS; the Qualcomm Snapdragon 8 Elite at 45 TOPS. Both have tight thermal budgets - sustained AI inference at maximum NPU utilization will trigger throttling in 2-3 minutes. The practical working window before thermal throttle kicks in is shorter than people expect.

On-device RAM is shared between the OS, applications, and model weights. An iPhone 16 Pro with 8 GB of total RAM realistically has 3-4 GB available for model weights after the OS takes its share. That limits viable models to 3B parameters at Q4 quantization, or smaller. Android flagships with 12-16 GB of RAM have more headroom - a Snapdragon 8 Elite device with 12 GB can hold a Q4-quantized 4B model with room for context.

Frameworks: Apple's Core ML / AI Edge SDK, Google's MediaPipe LLM inference API, Qualcomm's AI Hub, Meta's ExecuTorch.

Low-Power Laptop Silicon (Apple M4 Air 16 GB, Intel Core Ultra iGPU)

The Apple M4 base chip (not Pro, not Max) has 10-core GPU and runs with 8 or 16 GB of unified memory. At 16 GB, it comfortably holds a Q4-quantized 4B model and has room for a 7-8B model in a pinch. Memory bandwidth is 120 GB/s - lower than the M4 Pro's 273 GB/s, which directly limits token generation speed.

This tier is the most practical for local developer use: a laptop that isn't a workstation, running llama.cpp or MLX-LM with no discrete GPU. Intel Core Ultra 200V series laptops (Lunar Lake) with Xe2 iGPU also fall here, though Intel's AI inference stack for LLMs is less mature than Apple's.

Raspberry Pi 5 (8 GB RAM, no GPU)

The Raspberry Pi 5 has a quad-core ARM Cortex-A76 CPU running at 2.4 GHz and up to 8 GB LPDDR4X RAM. There is no GPU acceleration for general ML inference - everything runs on CPU via llama.cpp with ARM NEON SIMD. The practical ceiling is a 3B model at Q4 quantization (~2 GB), which generates tokens at 6-12 tok/s depending on context length. That is slow but usable for async tasks, batch processing, or applications where latency is not user-facing.

The Raspberry Pi 5 is not a great interactive inference device. It is, however, the dominant hardware in edge IoT deployments, robotics, and offline field computing. The question for this tier is not "is it fast enough for chat?" but "can it run a useful inference task at all?"

Jetson Orin Nano 8GB

The NVIDIA Jetson Orin Nano 8GB has a 1024-core Ampere GPU and 8 GB of LPDDR5 shared between CPU and GPU. At 40 TOPS, it outperforms the Raspberry Pi 5 significantly for AI inference but is more expensive (~$250) and consumes more power (7-15W). It runs llama.cpp with CUDA acceleration and handles 3B models at 15-25 tok/s - roughly twice the Raspberry Pi 5 speed. I don't have comprehensive Jetson benchmarks for all models in this table, so the primary hardware reference columns are iPhone 16 Pro, MacBook M4 Air 16 GB, and Raspberry Pi 5.

Benchmark Explainers

The quality benchmarks in the main table come from official model cards and published technical reports. Here is what each measures:

MMLU (Massive Multitask Language Understanding): 57 subjects spanning STEM, humanities, social sciences. Tests breadth of knowledge. A good proxy for general-purpose usefulness. Classic 5-shot format, 0-100% scale.

IFEval (Instruction Following Evaluation): Tests whether a model can follow explicit formatting instructions like "write 150 words" or "use bullet points." Critical for on-device applications where output format matters. Scores are prompt-level accuracy, 0-100%.

GSM8K: Grade-school math word problems requiring multi-step arithmetic. 8-shot chain-of-thought. A solid proxy for reasoning ability at this size class.

Tokens/sec figures are for Q4_K_M GGUF quantization via llama.cpp unless otherwise noted. For iPhone, scores use the Core ML or AI Edge SDK path. All throughput numbers are generation speed (output tokens per second), not prefill/prompt processing speed. Where I have multiple sources, I report the median or note the variance.

Rankings

Rank	Model	Params	MMLU	IFEval	GSM8K	tok/s iPhone 16 Pro	tok/s MacBook M4 Air 16 GB	tok/s Raspberry Pi 5	Min RAM (Q4)	Notes
1	Gemma 3 4B IT	4B	43.6	73.4	89.2	~27	~28	~6	~3.0 GB	Best phone model; Google AI Edge optimized
2	Phi-4-Mini	3.8B	52.8	68.1	88.6	~20	~22	~5.5	~2.5 GB	Top reasoning at sub-4B; ARC-C 83.7%
3	Qwen 2.5 3B	3B	65.6	58.3	79.1	~22	~26	~8	~2.0 GB	Best MMLU under 4B; strong all-rounder
4	Gemma 3 1B IT	1B	32.8	54.2	62.8	~55	~62	~14	~0.8 GB	Fastest useful model on phone; lightest footprint
5	Phi-3.5-Mini	3.8B	69.0	62.3	86.5	~18	~21	~5	~2.5 GB	High MMLU; slower than Phi-4-Mini on same hardware
6	SmolLM3-3B	3B	62.4	60.1	78.4	~25	~28	~9	~2.0 GB	Strong for size; Apache 2.0; excellent for laptop
7	MiniCPM3-4B	4B	67.2	68.8	81.1	Not reported	~24	~5	~3.0 GB	Best MMLU at 4B; no official phone benchmarks
8	Llama 3.2 3B	3B	63.4	60.2	77.7	~23	~26	~8	~2.0 GB	Wide ecosystem; ExecuTorch support; solid baseline
9	Qwen 2.5 1.5B	1.5B	60.9	52.4	73.2	~38	~44	~11	~1.1 GB	Exceptional quality for 1.5B; best sub-2B option
10	Granite 3.1 MoE 3B (A800M)	3B (800M active)	55.3	56.2	Not reported	Not reported	~45	~18	~1.8 GB	MoE: fast on laptop/Pi; quality trades off slightly
11	SmolLM2-1.7B	1.7B	48.8	53.1	51.1	~42	~50	~13	~1.2 GB	Good for phones; lower GSM8K limits reasoning tasks
12	Llama 3.2 1B	1B	49.3	53.5	44.4	~52	~60	~16	~0.8 GB	Very fast; limited reasoning; good for summarization
13	MiniCPM-2B	2B	53.5	Not reported	53.8	Not reported	~38	~10	~1.5 GB	Solid reasoning for 2B; Chinese-English bilingual
14	TinyLlama-1.1B	1.1B	28.1	Not reported	30.4	~55	~64	~17	~0.8 GB	Fast but limited quality; largely superseded
15	MobileLLM-350M	350M	Not reported	Not reported	Not reported	~90	~110	~30	~0.4 GB	Research model; classification/completion only

Notes on Scores and Sources

MMLU, IFEval, GSM8K scores are drawn from official model cards, published technical reports, and the Open LLM Leaderboard (Hugging Face) where available. For Gemma 3, I used Google's official Gemma 3 Technical Report. For Phi models, Microsoft's Phi-3.5 and Phi-4 technical reports. For Qwen models, Alibaba's Qwen 2.5 technical report. For MobileLLM, the original Meta paper at arXiv:2402.14905 - that model has no publicly reported MMLU or instruction-following scores in a format comparable to the others.

Tokens/sec on iPhone 16 Pro figures are from Qualcomm AI Hub published benchmarks for Snapdragon-accelerated models, Apple developer documentation for Core ML paths, and community benchmarks. Many models do not yet have official on-device phone benchmarks. Where I have only a single community data point, I mark it as approximate (~). Gemma 3 phone speeds come from Google AI Edge SDK documentation. SmolLM phone speeds come from Hugging Face's on-device benchmark posts.

MacBook M4 Air 16 GB figures use llama.cpp for consistency, measured at 4K context with Q4_K_M quantization. MLX will run 30-50% faster than these figures on Apple Silicon - see the MLX section below.

Raspberry Pi 5 figures are community-reported llama.cpp benchmarks, ARM NEON only (no GPU acceleration), Q4_K_M quantization, 4K context. Sources include the llama.cpp discussion thread #4167 and community benchmarks on the Raspberry Pi forums.

Apple OpenELM (1B/3B) is not in the main table because it underperforms Llama 3.2 at the same parameter counts on MMLU and GSM8K while offering no throughput advantage. OpenELM's contribution was the architecture research, not deployment-ready quality. Gemma 3 1B and Llama 3.2 1B are strictly better on-device choices.

StableLM 2 1.6B / Zephyr is not in the main table because its MMLU (45.1) and GSM8K (57.0) are below Qwen 2.5 1.5B on both axes, and its throughput profile is similar. No meaningful advantage in this lineup.

Qwen 2.5 0.5B was tested. At 0.5B parameters and 37.2% MMLU, the quality is too low for useful real-world tasks. It runs at ~100 tok/s on Raspberry Pi 5 but lacks the reasoning to do much beyond simple completion. Not included in the ranked table.

Best-for-Hardware Decision Matrix

If you know your target hardware and just want a recommendation:

Use Case	Best Model	Why
Best under 1B params	Gemma 3 1B IT	62.8% GSM8K, fastest phone throughput at sub-1B
Best on iPhone (iOS app)	Gemma 3 4B IT	Google AI Edge SDK, ~27 tok/s, first sub-4B over 1300 LMArena
Best on Android flagship (12GB RAM)	Qwen 2.5 3B or Gemma 3 4B IT	High MMLU, MediaPipe or ONNX runtime
Best for MacBook M4 Air (no dGPU)	Phi-4-Mini or SmolLM3-3B	Quality/speed balance at 16 GB; MLX gives 30%+ boost
Best for Raspberry Pi 5	Qwen 2.5 3B or Llama 3.2 3B	~8 tok/s, solid reasoning, fits in 2 GB
Best quantized INT4 quality	MiniCPM3-4B	Highest MMLU at 4B (67.2); designed for quantization
Best MoE speed on Pi / laptop	Granite 3.1 MoE 3B	800M active params; ~18 tok/s on Pi vs ~5-8 for dense 3B
Best for privacy-first mobile app	Llama 3.2 3B	Meta license, widest framework support, no telemetry

Quantization Impact on Edge Hardware

Quantization is not optional on edge hardware - it is required. The question is which quantization level you can use before quality degrades noticeably.

At Q4_K_M (the llama.cpp default for on-device work), quality loss versus full BF16 precision is typically 1-3 MMLU points for models in the 1B-4B range. That is acceptable. Below Q4, the tradeoffs get worse fast.

Quantization	Size vs FP16	Quality vs FP16	Practical Use
Q8_0	50%	~0% loss	Best quality; only if RAM allows
Q4_K_M	27%	1-3 point MMLU loss	Standard for edge; recommended
Q4_0	25%	2-4 point MMLU loss	Slightly faster than Q4_K_M; marginally worse quality
Q3_K_M	20%	4-7 point MMLU loss	Use only when Q4 doesn't fit
Q2_K	14%	8-15 point MMLU loss	Last resort; visible quality degradation

INT4 (ONNX / Core ML / AI Edge): Dedicated on-device inference frameworks use their own INT4 quantization schemes tuned for specific NPU hardware. These are generally better than llama.cpp Q4_0 for phone deployments and sometimes match Q4_K_M. Always prefer the native framework's quantization path over GGUF when targeting phone NPUs.

For Raspberry Pi 5 specifically: Q4_K_M is the right call. Q3_K_M is noticeably worse and the speed gain (~5-10%) isn't worth it. Don't bother with Q2_K unless you're trying to fit a 3B model on a 4 GB Pi.

MLX on Apple Silicon - a Separate Lane

The throughput numbers in the main table use llama.cpp for consistency. If you are on a Mac, those numbers are the floor, not the ceiling.

MLX-LM - Apple's own machine learning framework - consistently runs 30-50% faster than llama.cpp on Apple Silicon for token generation. The gains come from MLX's direct access to the unified memory architecture and its optimized Metal kernels. On a MacBook M4 Air 16 GB, that translates to roughly:

Gemma 3 4B: ~28 tok/s llama.cpp to ~38-40 tok/s MLX
Phi-4-Mini 3.8B: ~22 tok/s llama.cpp to ~30-33 tok/s MLX
Qwen 2.5 3B: ~26 tok/s llama.cpp to ~34-38 tok/s MLX
SmolLM3-3B: ~28 tok/s llama.cpp to ~36-40 tok/s MLX

For interactive use on a MacBook without a discrete GPU, use MLX. Install via pip install mlx-lm and run with mlx_lm.generate. Most models in this table have MLX-compatible versions available on Hugging Face.

The MLX vs llama.cpp throughput advantage narrows at larger model sizes (above 7B) but for the 1-4B range covered here, MLX is definitively faster.

Thermal Throttling - the Hidden Variable

Every number in this table assumes a cold device running a single inference. Real-world edge inference has a thermal component that doesn't show up in point-in-time benchmarks.

iPhone 16 Pro: At maximum NPU utilization, the A18 Pro will throttle to ~60-70% of peak performance after 2-3 minutes of continuous inference. A chatbot session with sustained back-and-forth will see 20-30% lower throughput by the fifth or sixth exchange. Google's AI Edge SDK includes throttle detection; model authors who test on-device should note sustained vs burst throughput separately.

MacBook M4 Air: The M4 Air has no fan. It relies entirely on passive heat dissipation. Under continuous sustained inference (e.g., batch processing 500 documents), it will throttle. For interactive use (one prompt at a time with natural pauses), this is rarely a problem. For batch workloads, the M4 Air will sustain roughly 70-80% of peak throughput indefinitely. The M4 Pro and above have fans and don't throttle.

Raspberry Pi 5: Without active cooling, the Pi 5 will throttle under sustained load. With a fan or heatsink, it sustains full clock speed. If you're doing serious edge inference on a Pi, buy the active cooler. It costs $5 and doubles sustained throughput.

Memory bandwidth is the real ceiling: On all these devices, token generation throughput is memory-bound. The formula is simple: generation speed in tok/s scales roughly as memory bandwidth / model size in bytes. A model that requires 2 GB at Q4 and runs on hardware with 40 GB/s effective memory bandwidth will generate tokens at roughly 20 tok/s before thermal and other overheads. When comparing hardware, look at the memory bandwidth spec.

Methodology

Rankings in the main table weight quality and throughput roughly equally. Quality score is an average of available MMLU, IFEval, and GSM8K scores (where all three are reported). Models missing one or two benchmarks are ranked on the available evidence plus qualitative notes from model card evaluations.

Throughput scores are from:

iPhone 16 Pro: Qualcomm AI Hub published benchmarks, Google AI Edge SDK documentation, community benchmarks via ExecuTorch and Core ML paths
MacBook M4 Air 16 GB: llama.cpp Q4_K_M, 4K context, measured or community-reported in the llama.cpp Apple Silicon benchmark thread
Raspberry Pi 5: Community llama.cpp benchmarks, Q4_K_M, ARM NEON, no GPU, same discussion thread

Models were not re-tested in a controlled environment for this article. Numbers are drawn from public sources and are approximate within ~15-20%. Where I have multiple data points for the same model on the same hardware, I report the median.

Quality benchmark sources: Official model cards (Microsoft Phi-3.5, Phi-4-Mini technical report; Google Gemma 3 Technical Report; Alibaba Qwen 2.5 Technical Report; Meta Llama 3.2 Model Card; IBM Granite model cards); Hugging Face Open LLM Leaderboard (where available); MobileLLM paper arXiv:2402.14905.

Caveats

Benchmark saturation at the top: MMLU and GSM8K are showing signs of saturation for models above 3B parameters. Phi-3.5-Mini's 69% MMLU and Qwen 2.5 3B's 65.6% are near the ceiling of what these benchmarks distinguish meaningfully at this parameter count. For ranking models in the 3-4B range, IFEval and task-specific evaluations increasingly matter more than MMLU.

No single benchmark represents real use: A model that scores 88.6% on GSM8K may still struggle with real-world word problems that require understanding context over multiple sentences. Benchmarks are proxies, not ground truth.

On-device throughput varies by prompt length: All throughput numbers assume 4K context. Longer prompts slow generation - at 16K context, expect 30-50% lower tok/s on these devices. The memory bandwidth constraint intensifies as the KV cache grows.

Framework maturity differs: Gemma 3 has first-party support in Google AI Edge with production-quality Android and iOS deployment. Phi-4-Mini has ONNX export and ExecuTorch paths but these are less mature than the Gemma 3 mobile stack. SmolLM3 is primarily a research release with llama.cpp as the main deployment path. Factor framework support into your decision if you're shipping a product.

MoE caveats: The Granite 3.1 MoE 3B has 800M active parameters but loads all 3B of weights. On a Raspberry Pi with 8 GB RAM, it fits, and the active-parameter count means it's fast - but the full weight load is still in memory. "800M active" does not mean "800M model size."

Quantization quality loss is model-specific: The general Q4_K_M quality loss estimate of 1-3 MMLU points is an average. Some models degrade more gracefully than others. MiniCPM3-4B was specifically designed with quantization in mind and reportedly loses less than 1 MMLU point at Q4. Phi models are reported to be more sensitive; verify against the model card before deploying at Q3 or below.

What to Watch

The trajectory for edge LLM quality is steep. Gemma 3n - Google's architecture designed specifically for on-device inference using Per-Layer Embeddings - represents a design philosophy where the model is purpose-built for constrained memory rather than adapted from a larger architecture. The E4B variant has 8B total parameters but a 4B memory footprint; it is not in this table because its on-device throughput benchmarks on the specific hardware above are not yet comprehensively published, but watch for it in the next revision.

Apple's Foundation Models (~3B on-device) and its AI infrastructure work point toward tighter hardware-software co-design in the next generation of Apple Silicon. The on-device model baked into iOS 18/macOS 15 already handles basic tasks; the question is how much Apple opens that stack to third-party applications.

Qualcomm's AI Hub is the most useful single resource for Snapdragon-specific benchmarks. If you're targeting Android with a Snapdragon chip, check there before deciding on a model.

For inference tooling on these devices, see the best open-source LLM inference servers overview, and for broader workstation-class setup recommendations, the best AI home workstations guide.