Home GPU LLM Leaderboard: Best Open Source Models by VRAM Tier with Token/s Benchmarks
Rankings of the best open source LLMs you can run on home hardware - RTX 4090, RTX 3090, Apple M3/M4 Max - organized by VRAM tier with real-world token/s benchmarks and quality scores.

Running large language models locally has gone from a niche hobby to a practical reality. With the right hardware and the right model, you can get genuinely useful AI inference on a consumer GPU or a Mac without sending a single token to the cloud. But the landscape is overwhelming: dozens of model families, multiple quantization formats, and wildly different performance characteristics across hardware. This leaderboard cuts through the noise.
I have organized everything by VRAM tier - because that is the hard constraint that determines what you can actually run. For each tier, I rank the top models by quality across coding, reasoning, and general knowledge benchmarks, then show real-world token generation speeds on popular consumer hardware so you know exactly what to expect.
The Hardware People Are Actually Using
Before diving into models, here is the consumer hardware landscape for local LLM inference in 2026. The critical spec is not compute power - it is memory bandwidth. Token generation is memory-bound: the GPU spends most of its time loading model weights from VRAM, not computing. Higher bandwidth means faster tokens.
| GPU / Chip | VRAM | Bandwidth | Street Price (USD) | Notes |
|---|---|---|---|---|
| RTX 5090 | 32 GB GDDR7 | 1,790 GB/s | ~$2,000 | New king. 77% more bandwidth than 4090 |
| RTX 4090 | 24 GB GDDR6X | 1,008 GB/s | ~$1,600 | Still the enthusiast standard |
| RTX 3090 | 24 GB GDDR6X | 936 GB/s | ~$800 used | Best value for 24GB VRAM |
| RTX 4080 | 16 GB GDDR6X | 717 GB/s | ~$900 | Good middle ground |
| RTX 4070 Ti Super | 16 GB GDDR6X | 672 GB/s | ~$750 | Popular 16GB option |
| RTX 4060 | 8 GB GDDR6 | 272 GB/s | ~$300 | Budget entry point |
| RX 7900 XTX | 24 GB GDDR6 | 960 GB/s | ~$750 | AMD's best; ROCm/Vulkan support improving |
| RX 9070 XT | 16 GB GDDR6 | ~640 GB/s | ~$550 | RDNA 4, doubled AI throughput per CU |
| M4 Max (40-core GPU) | 36-128 GB unified | 546 GB/s | ~$3,500+ (laptop) | Runs 70B models if you have 128GB |
| M3 Max (40-core GPU) | 36-128 GB unified | 400 GB/s | ~$2,800+ (laptop) | Slightly slower but same capacity |
| M4 Pro | 24-48 GB unified | 273 GB/s | ~$2,000+ (laptop) | Good for 14B-32B models |
| M2 Ultra | 64-192 GB unified | 800 GB/s | ~$4,000+ (Studio) | Can run anything that fits |
Key insight: The RTX 3090 has higher memory bandwidth (936 GB/s) than the newer RTX 4080 (717 GB/s). This is why it sometimes outperforms the 4080 in token generation despite being two generations older. For local LLM work, a used 3090 at $800 is one of the best deals in the market.
VRAM Tiers and What Fits
The amount of VRAM determines which models you can run. Quantization (compressing model weights from 16-bit to 4-bit or lower) is what makes large models fit on consumer hardware. Q4_K_M is the sweet spot: roughly 75% smaller than full precision with minimal quality loss.
| VRAM Tier | Model Sizes (Q4_K_M) | Example GPUs |
|---|---|---|
| 8 GB | Up to 7-8B | RTX 4060, RX 7600 |
| 12 GB | Up to 12-14B | RTX 4070, RTX 3060 12GB |
| 16 GB | Up to 14B comfortably, 20B tight | RTX 4080, RTX 4070 Ti Super, RX 9070 XT |
| 24 GB | Up to 32B (tight), 14B at high quant | RTX 4090, RTX 3090, RX 7900 XTX |
| 32 GB | Up to 32B comfortably | RTX 5090, Mac M3 Pro 36GB |
| 48-64 GB | 70B (Q3-Q4) | Mac M3/M4 Max 64GB |
| 96-128 GB | 70B+ at high quant | Mac M3/M4 Max 96-128GB, M2 Ultra |
VRAM Requirements by Model and Quantization (8K Context)
| Model | Q3_K_M | Q4_K_M | Q5_K_M |
|---|---|---|---|
| Llama 3.2 3B | 2.7 GB | 3.6 GB | 4.2 GB |
| Qwen 3 4B | 3.2 GB | 4.0 GB | 4.8 GB |
| Llama 3.1 8B | 5.8 GB | 6.2 GB | 7.3 GB |
| Gemma 3 12B | 9.5 GB | 12.4 GB | 14.0 GB |
| Qwen 3 14B | 9.0 GB | 10.7 GB | 12.0 GB |
| Phi-4 14B | 8.5 GB | 11.0 GB | 12.5 GB |
| DeepSeek-R1 14B | 9.5 GB | 11.0 GB | 12.5 GB |
| Gemma 3 27B | 19.4 GB | 22.5 GB | 26.0 GB |
| Qwen 3 32B | 18.6 GB | 22.2 GB | 26.0 GB |
| Llama 3.3 70B | 37.5 GB | 45.6 GB | 55.0 GB |
| Qwen 2.5 72B | 40.9 GB | 50.5 GB | 62.0 GB |
Important caveat: These numbers are for 8K context. The KV cache scales linearly with context length and is the hidden VRAM killer. At 32K context, add 30-50% more VRAM. You can run a big model or a long context - rarely both on consumer hardware.
Rankings by VRAM Tier
8 GB Tier (7-8B Models)
The entry-level tier. These models fit on any modern GPU and deliver surprisingly strong results for their size.
| Rank | Model | MMLU | GSM8K | MATH | HumanEval | GPQA | Best For |
|---|---|---|---|---|---|---|---|
| 1 | Qwen 3 8B | 76.9 | 89.8 | 60.8 | 67.7 | 44.4 | All-rounder, best overall quality |
| 2 | Llama 3.1 8B | 69.4 | 84.5 | 51.9 | 72.6 | 30.4 | Coding (highest HumanEval) |
| 3 | Gemma 3 4B | 59.6 | 38.4 | 24.2 | 36.0 | 15.0 | Ultralight, fits in 4GB |
Pick: Qwen 3 8B dominates this tier on reasoning and math. But if coding is your primary use case, Llama 3.1 8B's 72.6% HumanEval score is hard to beat at this size. For extremely constrained hardware (8GB total system RAM, no dedicated GPU), Gemma 3 4B is the only viable option and still handles basic tasks.
12-16 GB Tier (12-14B Models)
The sweet spot for most users. A 14B model with Q4 quantization delivers a dramatic quality jump over 8B while fitting on mainstream GPUs.
| Rank | Model | MMLU | GSM8K | MATH | HumanEval | GPQA | Best For |
|---|---|---|---|---|---|---|---|
| 1 | Qwen 3 14B | 81.1 | 92.5 | 62.0 | 72.2 | 39.9 | Best quality at this size |
| 2 | DeepSeek-R1 14B | ~79 | ~90 | ~65 | ~68 | ~38 | Reasoning chains, step-by-step |
| 3 | Phi-4 14B | ~80 | ~91 | ~63 | ~70 | ~36 | Microsoft's compact powerhouse |
Honorable mention: Gemma 3 12B scores lower on paper (MMLU 74.5, MATH 43.3) but excels at instruction-following and multilingual tasks. If you work in non-English languages, consider it.
Pick: Qwen 3 14B is the clear winner. It outperforms or matches models twice its size on math and reasoning benchmarks. On a 16GB GPU with Q4_K_M quantization, it uses about 10.7 GB, leaving headroom for context. DeepSeek-R1 14B is the pick if you specifically want chain-of-thought reasoning (it is a distilled version of the full DeepSeek-R1 reasoning model).
24 GB Tier (27-32B Models)
This is where things get interesting. A 32B model at Q4 barely fits in 24GB (22.2 GB for Qwen 3 32B), leaving very little room for context. A 27B model is more comfortable. The RTX 4090, RTX 3090, and RX 7900 XTX all have 24GB.
| Rank | Model | MMLU | GSM8K | MATH | HumanEval | GPQA | VRAM (Q4) |
|---|---|---|---|---|---|---|---|
| 1 | Qwen 3 32B | 83.6 | 93.4 | 61.6 | 72.1 | 49.5 | 22.2 GB |
| 2 | DeepSeek-R1 32B | ~82 | ~92 | ~68 | ~70 | ~46 | 22.0 GB |
| 3 | Gemma 3 27B | 78.6 | 82.6 | 50.0 | 48.8 | 24.3 | 22.5 GB |
The MoE wildcard: Qwen 3 30B-A3B. This is a Mixture-of-Experts model with 30B total parameters but only ~3B active at any time. It fits in 24GB easily and runs at nearly 196 tok/s on RTX 4090 - faster than the 8B dense models. Quality is competitive with 14B dense models. If you value speed over peak quality, this is remarkable.
Pick: Qwen 3 32B is the best 24GB-tier model, but at 22.2 GB Q4_K_M, your context window is severely limited on a 24GB card. If you need longer context, drop to Gemma 3 27B (22.5 GB but more efficient KV cache) or the Qwen 3 30B MoE for the best speed/quality tradeoff. On the RTX 5090 (32GB), the Qwen 3 32B runs comfortably with room to spare.
48-64 GB Tier (70B Models)
You need a Mac with 64GB+ unified memory or dual GPUs to play here. The 70B class represents a massive quality leap - these models compete with frontier commercial APIs from a year ago.
| Rank | Model | MMLU | GSM8K | MATH | HumanEval | GPQA | VRAM (Q4) |
|---|---|---|---|---|---|---|---|
| 1 | Qwen 2.5 72B | ~84 | ~94 | ~70 | ~80 | ~48 | 50.5 GB |
| 2 | Llama 3.3 70B | 83.6 | 95.1 | 68.0 | 80.5 | 46.7 | 45.6 GB |
| 3 | DeepSeek-R1 70B | ~83 | ~93 | ~72 | ~78 | ~50 | 46.0 GB |
Pick: This tier is tight. Qwen 2.5 72B and Llama 3.3 70B trade blows across benchmarks. Llama 3.3 70B is slightly more VRAM-efficient (45.6 GB vs 50.5 GB at Q4) and has the better ecosystem (more fine-tunes, better tool support). DeepSeek-R1 70B is the reasoning specialist if you need step-by-step problem solving.
Token Generation Speed: Real Hardware Benchmarks
This is where theory meets reality. All speeds below are for token generation (output tokens/s), which is the speed you experience while the model responds. Measured with llama.cpp or Ollama unless noted otherwise.
NVIDIA GPUs - Generation Speed (tok/s)
| Model (Q4_K_M) | RTX 4060 (8GB) | RTX 4070 Ti (12GB) | RTX 4080 (16GB) | RTX 4090 (24GB) | RTX 5090 (32GB) |
|---|---|---|---|---|---|
| Llama 3.1 8B | 42 | 60 | ~100 | 113 | ~190 |
| Qwen 3 8B | ~40 | ~58 | ~95 | 131 | ~213 |
| Qwen 2.5 14B | - | ~42 | ~60 | 64 | ~105 |
| Phi-4 14B | - | - | ~62 | 69 | ~110 |
| DeepSeek-R1 14B | - | - | ~55 | 59 | ~95 |
| Gemma 2 27B | - | - | - | 38 | ~62 |
| Qwen 3 32B | - | - | - | 34 | ~58 |
| DeepSeek-R1 32B | - | - | - | 34 | ~55 |
| Qwen 3 30B MoE | - | - | - | 196 | ~320 |
| Llama 3.3 70B (Q2-Q3) | - | - | - | ~20 | ~35 |
RTX 5090 impact: The 5090 delivers 28-67% faster token generation than the 4090 across all model sizes, thanks to its 1,790 GB/s memory bandwidth (77% higher than the 4090's 1,008 GB/s). For a Qwen 2.5 Coder 7B, CloudRift measured 5,841 tok/s prompt processing - a 72% improvement.
Apple Silicon - Generation Speed (tok/s, llama.cpp)
| Model (Q4) | M4 Pro (273 GB/s) | M3 Max (400 GB/s) | M4 Max (546 GB/s) | M2 Ultra (800 GB/s) |
|---|---|---|---|---|
| Llama 8B | ~50 | ~66 | 70-83 | 94 |
| Qwen 2.5 14B | ~25 | ~35 | ~45 | ~60 |
| Qwen 2.5 32B | - | ~15 | ~25 | ~40 |
| Llama 3.3 70B | - | - | 8-15 | ~12 |
MLX vs llama.cpp on Apple Silicon: MLX (Apple's native framework) achieves up to 50% higher throughput than llama.cpp on Apple Silicon. On an M2 Ultra, MLX sustains ~230 tok/s vs ~150 tok/s for llama.cpp for smaller models. On M1 Max 64GB, Qwen 2.5 7B runs at 63.7 tok/s on MLX vs 40.75 tok/s on Ollama (GGUF). If you are on a Mac, use MLX.
Context Length vs Speed Tradeoff
Context length dramatically impacts generation speed. Here is Qwen 3 8B Q4 on RTX 4090:
| Context Length | Generation (tok/s) | Prompt Processing (tok/s) |
|---|---|---|
| 4K | 131 | 9,121 |
| 16K | 96 | 5,697 |
| 32K | 77 | 4,224 |
| 65K | 53 | 2,614 |
| 131K | 32 | 1,451 |
Generation speed drops by 75% from 4K to 131K context. Plan your context budget carefully.
Best Models by Task
Best for Coding
| VRAM Tier | Model | HumanEval | Why |
|---|---|---|---|
| 8 GB | Llama 3.1 8B | 72.6% | Highest coding score at this size |
| 16 GB | Qwen 3 14B | 72.2% | Near-identical to 8B but better reasoning |
| 24 GB | Qwen 3 32B | 72.1% | Better at complex multi-file tasks |
| 64 GB | Llama 3.3 70B | 80.5% | Significant jump in coding ability |
Best for Math and Reasoning
| VRAM Tier | Model | MATH | GSM8K | Why |
|---|---|---|---|---|
| 8 GB | Qwen 3 8B | 60.8 | 89.8 | Crushes math at this size |
| 16 GB | DeepSeek-R1 14B | ~65 | ~90 | Chain-of-thought reasoning |
| 24 GB | DeepSeek-R1 32B | ~68 | ~92 | Best reasoning at this tier |
| 64 GB | DeepSeek-R1 70B | ~72 | ~93 | Strongest open reasoning model |
Best for General Knowledge
| VRAM Tier | Model | MMLU | Why |
|---|---|---|---|
| 8 GB | Qwen 3 8B | 76.9 | Significantly ahead of alternatives |
| 16 GB | Qwen 3 14B | 81.1 | Approaches frontier model territory |
| 24 GB | Qwen 3 32B | 83.6 | Competitive with GPT-4 class on knowledge |
| 64 GB | Qwen 2.5 72B | ~84 | Marginal gains over 32B on MMLU |
Key Takeaways
Qwen 3 Dominates Nearly Every Tier
Alibaba's Qwen 3 family is the model to beat at every parameter count from 4B to 32B. The 8B model matches or exceeds Llama 3.1 8B on everything except HumanEval. The 14B model punches well above its weight, scoring 81.1 on MMLU - territory that required 70B parameters just a year ago. The 32B model is competitive with early GPT-4 on knowledge benchmarks.
MoE Models Change the Economics
The Qwen 3 30B-A3B Mixture-of-Experts model is a game-changer for local inference. It generates tokens at 196 tok/s on an RTX 4090 - faster than the 8B dense model - while delivering quality closer to the 14B class. The catch: MoE models still load all parameters into VRAM, so the memory requirement does not shrink proportionally. But if the model fits, the speed advantage is dramatic.
Memory Bandwidth Matters More Than Compute Generation
The RTX 3090 (936 GB/s, $800 used) often outperforms the RTX 4080 (717 GB/s, $900 new) in token generation. The M3 Max (400 GB/s) outperforms the newer M4 Pro (273 GB/s). When shopping for LLM hardware, look at the memory bandwidth spec first. The formula is simple: generation speed in tok/s is approximately (memory bandwidth in GB/s) / (model size in GB).
The 14B Sweet Spot
For most users with a 16GB GPU, a 14B model with Q4_K_M quantization (~10-11 GB) is the sweet spot. It leaves enough VRAM for 8-16K context, delivers dramatically better quality than 8B models, and generates at 60-70 tok/s on an RTX 4080/4090 - fast enough for interactive use. If you only run one model locally, make it Qwen 3 14B.
Apple Silicon Is Viable for 70B
A Mac with 96-128GB unified memory can run Llama 3.3 70B or Qwen 2.5 72B at 8-15 tok/s with Q4 quantization. That is slow but usable for batch processing, creative writing, or tasks where latency is not critical. No consumer NVIDIA GPU can run 70B without multi-GPU setups. This is Apple Silicon's unique advantage: massive unified memory at the cost of lower bandwidth per dollar.
Use MLX on Mac, Not Ollama
Framework choice matters enormously on Apple Silicon. MLX achieves up to 6-10x the throughput of Ollama for the same model on the same hardware. On M2 Ultra, MLX sustains ~230 tok/s versus Ollama's 20-40 tok/s. If you are running models on a Mac, MLX is non-negotiable.
Honorable Mention: Distributed Inference with Exo
Everything above assumes a single device. But what if you could pool multiple machines together?
Exo is an open source distributed inference framework that lets you combine multiple devices - Macs, NVIDIA GPUs, even phones - into a single cluster that runs models no individual machine could handle. It uses automatic peer discovery, topology-aware parallelization, and supports both pipeline parallelism (splitting layers across devices) and tensor parallelism (sharding individual layers).
The results are genuinely interesting for anyone with old hardware collecting dust.
What Exo Makes Possible
Three RTX 3090s ($2,400 total used) running a 120B model at 124 tok/s. That is over 3x faster than a single NVIDIA DGX Spark ($3,000) at 38.5 tok/s on the same workload. Three used GPUs from 2020 outperforming a purpose-built AI appliance from 2025.
DGX Spark + Mac Studio M3 Ultra running Llama 3.1 8B at 2.8x the speed of either device alone. The DGX Spark handles prefill 3.8x faster than the Mac (it has more compute), while the Mac generates tokens 3.4x faster (it has more memory bandwidth). Exo automatically routes each phase to the device that is best at it.
3x M4 Mac Mini cluster processing 108.8 tok/s on Llama 3.2 3B - 2.2x the throughput of a single device for concurrent requests. Four M3 Ultra Mac Studios running Qwen3 235B (8-bit) at 28.3 tok/s with RDMA over Thunderbolt 5.
The Speculation: What Old Hardware Could Do
This is where it gets exciting for people with hardware graveyards. Consider what you could build:
2x RTX 3090 (48GB combined, ~$1,600 used): Run Qwen 2.5 72B at Q4 entirely in GPU memory. Based on the single-card bandwidth of 936 GB/s each and Exo's measured multi-device scaling, you could expect roughly 15-20 tok/s generation - comparable to a $4,000+ Mac Studio with 192GB. That is a 70B model at interactive speeds for the price of a mid-range laptop.
RTX 3090 + Mac Mini M4 Pro: The GPU handles compute-heavy prefill while the Mac (with its 48GB unified memory option) handles generation for models that spill beyond 24GB. Exo routes each phase to the stronger device automatically.
3x RTX 3060 12GB (36GB combined, ~$750 used): Enough combined VRAM for Qwen 3 32B at Q4 with context headroom. Lower bandwidth per card (360 GB/s) limits generation speed, but at ~30-40 tok/s estimated for a 32B model across three cards, that is still very usable - and the hardware cost is less than a single RTX 4070 Ti.
Mac Mini M4 + old gaming PC with RTX 3080: Hybrid setups are Exo's sweet spot. The Mac contributes unified memory capacity, the NVIDIA card contributes raw compute for prefill. Models that are too large for either device alone become runnable.
The catch: network overhead is real. Single-request latency increases with more devices (39.7 tok/s on 3 devices vs 49.3 on 1 for small models). Exo shines when the model does not fit on a single device, or when you are serving multiple concurrent requests. For a model that already fits in one GPU's VRAM, adding more devices hurts rather than helps.
But for running 70B+ models on a budget? A stack of used RTX 3090s with Exo is arguably the best price/performance option in existence. The software is still maturing (Linux GPU support is under development, and Thunderbolt 5 RDMA is Mac-only for now), but the fundamental approach - pooling commodity hardware for distributed inference - is the most interesting development in local LLM hardware since Apple Silicon unified memory.
Methodology Notes
Benchmark scores come from official model cards (Meta, Alibaba, Google) and the Qwen 3 Technical Report. Token generation speeds are from llama.cpp GitHub benchmarks, Hardware-Corner, DatabaseMart, singhajit.com, and the MLX vs llama.cpp comparative study (arXiv:2511.05502). All speeds use Q4_K_M quantization unless noted otherwise. Scores marked with ~ are from aggregated third-party benchmarks where official numbers were not available for the specific instruct variant.
VRAM requirements are measured at 8K context windows using Ollama. Longer context windows increase VRAM usage significantly due to the KV cache.
Sources:
- llama.cpp CUDA GPU Benchmarks - Discussion #15013
- llama.cpp Apple Silicon Benchmarks - Discussion #4167
- Hardware-Corner - RTX 4090 LLM Benchmarks
- Hardware-Corner - GPU Ranking for LLMs
- Qwen 3 Technical Report (arXiv)
- Gemma 3 Technical Report (arXiv)
- Meta Llama 3.2 Model Card
- Meta Llama 3.1 Model Card
- MLX vs llama.cpp Comparative Study (arXiv:2511.05502)
- DatabaseMart - RTX 4090 Ollama Benchmarks
- DatabaseMart - RTX 4060 Ollama Benchmarks
- DatabaseMart - RTX 5090 Ollama Benchmarks
- CloudRift - RTX 4090 vs 5090 LLM Inference Benchmarks
- LocalLLM.in - Ollama VRAM Requirements
- LocalLLM.in - Best GPUs for LLM Inference 2025
- Singhajit - LLM Inference Speed Comparison
- LocalScore.ai - RX 9070 XT Results
- XiongjieDai - GPU Benchmarks on LLM Inference
- Exo - Distributed AI Inference
- Exo Labs - Transparent Benchmarks (Day 1)
- Exo Labs - DGX Spark + Mac Studio Benchmarks
- Tom's Hardware - Exo Labs Disaggregated AI Inference