Home GPU LLM Leaderboard: Best Open Source Models by VRAM Tier with Token/s Benchmarks

Running large language models locally has gone from a niche hobby to a practical reality. With the right hardware and the right model, you can get genuinely useful AI inference on a consumer GPU or a Mac without sending a single token to the cloud. But the landscape is overwhelming: dozens of model families, multiple quantization formats, and wildly different performance characteristics across hardware. This leaderboard cuts through the noise.

I have organized everything by VRAM tier - because that is the hard constraint that determines what you can actually run. For each tier, I rank the top models by quality across coding, reasoning, and general knowledge benchmarks, then show real-world token generation speeds on popular consumer hardware so you know exactly what to expect.

The Hardware People Are Actually Using

Before diving into models, here is the consumer hardware landscape for local LLM inference in 2026. The critical spec is not compute power - it is memory bandwidth. Token generation is memory-bound: the GPU spends most of its time loading model weights from VRAM, not computing. Higher bandwidth means faster tokens.

GPU / Chip	VRAM	Bandwidth	Street Price (USD)	Notes
RTX 5090	32 GB GDDR7	1,790 GB/s	~$2,000	New king. 77% more bandwidth than 4090
RTX 4090	24 GB GDDR6X	1,008 GB/s	~$1,600	Still the enthusiast standard
RTX 3090	24 GB GDDR6X	936 GB/s	~$800 used	Best value for 24GB VRAM
RTX 4080	16 GB GDDR6X	717 GB/s	~$900	Good middle ground
RTX 4070 Ti Super	16 GB GDDR6X	672 GB/s	~$750	Popular 16GB option
RTX 4060	8 GB GDDR6	272 GB/s	~$300	Budget entry point
RX 7900 XTX	24 GB GDDR6	960 GB/s	~$750	AMD's best; ROCm/Vulkan support improving
RX 9070 XT	16 GB GDDR6	~640 GB/s	~$550	RDNA 4, doubled AI throughput per CU
M4 Max (40-core GPU)	36-128 GB unified	546 GB/s	~$3,500+ (laptop)	Runs 70B models if you have 128GB
M3 Max (40-core GPU)	36-128 GB unified	400 GB/s	~$2,800+ (laptop)	Slightly slower but same capacity
M4 Pro	24-48 GB unified	273 GB/s	~$2,000+ (laptop)	Good for 14B-32B models
M2 Ultra	64-192 GB unified	800 GB/s	~$4,000+ (Studio)	Can run anything that fits

Key insight: The RTX 3090 has higher memory bandwidth (936 GB/s) than the newer RTX 4080 (717 GB/s). This is why it sometimes outperforms the 4080 in token generation despite being two generations older. For local LLM work, a used 3090 at $800 is one of the best deals in the market.

VRAM Tiers and What Fits

The amount of VRAM determines which models you can run. Quantization (compressing model weights from 16-bit to 4-bit or lower) is what makes large models fit on consumer hardware. Q4_K_M is the sweet spot: roughly 75% smaller than full precision with minimal quality loss.

VRAM Tier	Model Sizes (Q4_K_M)	Example GPUs
8 GB	Up to 7-8B	RTX 4060, RX 7600
12 GB	Up to 12-14B	RTX 4070, RTX 3060 12GB
16 GB	Up to 14B comfortably, 20B tight	RTX 4080, RTX 4070 Ti Super, RX 9070 XT
24 GB	Up to 32B (tight), 14B at high quant	RTX 4090, RTX 3090, RX 7900 XTX
32 GB	Up to 32B comfortably	RTX 5090, Mac M3 Pro 36GB
48-64 GB	70B (Q3-Q4)	Mac M3/M4 Max 64GB
96-128 GB	70B+ at high quant	Mac M3/M4 Max 96-128GB, M2 Ultra

VRAM Requirements by Model and Quantization (8K Context)

Model	Q3_K_M	Q4_K_M	Q5_K_M
Llama 3.2 3B	2.7 GB	3.6 GB	4.2 GB
Qwen 3 4B	3.2 GB	4.0 GB	4.8 GB
Llama 3.1 8B	5.8 GB	6.2 GB	7.3 GB
Gemma 3 12B	9.5 GB	12.4 GB	14.0 GB
Qwen 3 14B	9.0 GB	10.7 GB	12.0 GB
Phi-4 14B	8.5 GB	11.0 GB	12.5 GB
DeepSeek-R1 14B	9.5 GB	11.0 GB	12.5 GB
Gemma 3 27B	19.4 GB	22.5 GB	26.0 GB
Qwen 3 32B	18.6 GB	22.2 GB	26.0 GB
Llama 3.3 70B	37.5 GB	45.6 GB	55.0 GB
Qwen 2.5 72B	40.9 GB	50.5 GB	62.0 GB

Important caveat: These numbers are for 8K context. The KV cache scales linearly with context length and is the hidden VRAM killer. At 32K context, add 30-50% more VRAM. You can run a big model or a long context - rarely both on consumer hardware.

Rankings by VRAM Tier

8 GB Tier (7-8B Models)

The entry-level tier. These models fit on any modern GPU and deliver surprisingly strong results for their size.

Rank	Model	MMLU	GSM8K	MATH	HumanEval	GPQA	Best For
1	Qwen 3 8B	76.9	89.8	60.8	67.7	44.4	All-rounder, best overall quality
2	Llama 3.1 8B	69.4	84.5	51.9	72.6	30.4	Coding (highest HumanEval)
3	Gemma 3 4B	59.6	38.4	24.2	36.0	15.0	Ultralight, fits in 4GB

Pick: Qwen 3 8B dominates this tier on reasoning and math. But if coding is your primary use case, Llama 3.1 8B's 72.6% HumanEval score is hard to beat at this size. For extremely constrained hardware (8GB total system RAM, no dedicated GPU), Gemma 3 4B is the only viable option and still handles basic tasks.

12-16 GB Tier (12-14B Models)

The sweet spot for most users. A 14B model with Q4 quantization delivers a dramatic quality jump over 8B while fitting on mainstream GPUs.

Rank	Model	MMLU	GSM8K	MATH	HumanEval	GPQA	Best For
1	Qwen 3 14B	81.1	92.5	62.0	72.2	39.9	Best quality at this size
2	DeepSeek-R1 14B	~79	~90	~65	~68	~38	Reasoning chains, step-by-step
3	Phi-4 14B	~80	~91	~63	~70	~36	Microsoft's compact powerhouse

Honorable mention: Gemma 3 12B scores lower on paper (MMLU 74.5, MATH 43.3) but excels at instruction-following and multilingual tasks. If you work in non-English languages, consider it.

Pick: Qwen 3 14B is the clear winner. It outperforms or matches models twice its size on math and reasoning benchmarks. On a 16GB GPU with Q4_K_M quantization, it uses about 10.7 GB, leaving headroom for context. DeepSeek-R1 14B is the pick if you specifically want chain-of-thought reasoning (it is a distilled version of the full DeepSeek-R1 reasoning model).

24 GB Tier (27-32B Models)

This is where things get interesting. A 32B model at Q4 barely fits in 24GB (22.2 GB for Qwen 3 32B), leaving very little room for context. A 27B model is more comfortable. The RTX 4090, RTX 3090, and RX 7900 XTX all have 24GB.

Rank	Model	MMLU	GSM8K	MATH	HumanEval	GPQA	VRAM (Q4)
1	Qwen 3 32B	83.6	93.4	61.6	72.1	49.5	22.2 GB
2	DeepSeek-R1 32B	~82	~92	~68	~70	~46	22.0 GB
3	Gemma 3 27B	78.6	82.6	50.0	48.8	24.3	22.5 GB

The MoE wildcard: Qwen 3 30B-A3B. This is a Mixture-of-Experts model with 30B total parameters but only ~3B active at any time. It fits in 24GB easily and runs at nearly 196 tok/s on RTX 4090 - faster than the 8B dense models. Quality is competitive with 14B dense models. If you value speed over peak quality, this is remarkable.

Pick: Qwen 3 32B is the best 24GB-tier model, but at 22.2 GB Q4_K_M, your context window is severely limited on a 24GB card. If you need longer context, drop to Gemma 3 27B (22.5 GB but more efficient KV cache) or the Qwen 3 30B MoE for the best speed/quality tradeoff. On the RTX 5090 (32GB), the Qwen 3 32B runs comfortably with room to spare.

48-64 GB Tier (70B Models)

You need a Mac with 64GB+ unified memory or dual GPUs to play here. The 70B class represents a massive quality leap - these models compete with frontier commercial APIs from a year ago.

Rank	Model	MMLU	GSM8K	MATH	HumanEval	GPQA	VRAM (Q4)
1	Qwen 2.5 72B	~84	~94	~70	~80	~48	50.5 GB
2	Llama 3.3 70B	83.6	95.1	68.0	80.5	46.7	45.6 GB
3	DeepSeek-R1 70B	~83	~93	~72	~78	~50	46.0 GB

Pick: This tier is tight. Qwen 2.5 72B and Llama 3.3 70B trade blows across benchmarks. Llama 3.3 70B is slightly more VRAM-efficient (45.6 GB vs 50.5 GB at Q4) and has the better ecosystem (more fine-tunes, better tool support). DeepSeek-R1 70B is the reasoning specialist if you need step-by-step problem solving.

Token Generation Speed: Real Hardware Benchmarks

This is where theory meets reality. All speeds below are for token generation (output tokens/s), which is the speed you experience while the model responds. Measured with llama.cpp or Ollama unless noted otherwise.

NVIDIA GPUs - Generation Speed (tok/s)

Model (Q4_K_M)	RTX 4060 (8GB)	RTX 4070 Ti (12GB)	RTX 4080 (16GB)	RTX 4090 (24GB)	RTX 5090 (32GB)
Llama 3.1 8B	42	60	~100	113	~190
Qwen 3 8B	~40	~58	~95	131	~213
Qwen 2.5 14B	-	~42	~60	64	~105
Phi-4 14B	-	-	~62	69	~110
DeepSeek-R1 14B	-	-	~55	59	~95
Gemma 2 27B	-	-	-	38	~62
Qwen 3 32B	-	-	-	34	~58
DeepSeek-R1 32B	-	-	-	34	~55
Qwen 3 30B MoE	-	-	-	196	~320
Llama 3.3 70B (Q2-Q3)	-	-	-	~20	~35

RTX 5090 impact: The 5090 delivers 28-67% faster token generation than the 4090 across all model sizes, thanks to its 1,790 GB/s memory bandwidth (77% higher than the 4090's 1,008 GB/s). For a Qwen 2.5 Coder 7B, CloudRift measured 5,841 tok/s prompt processing - a 72% improvement.

Apple Silicon - Generation Speed (tok/s, llama.cpp)

Model (Q4)	M4 Pro (273 GB/s)	M3 Max (400 GB/s)	M4 Max (546 GB/s)	M2 Ultra (800 GB/s)
Llama 8B	~50	~66	70-83	94
Qwen 2.5 14B	~25	~35	~45	~60
Qwen 2.5 32B	-	~15	~25	~40
Llama 3.3 70B	-	-	8-15	~12

MLX vs llama.cpp on Apple Silicon: MLX (Apple's native framework) achieves up to 50% higher throughput than llama.cpp on Apple Silicon. On an M2 Ultra, MLX sustains ~230 tok/s vs ~150 tok/s for llama.cpp for smaller models. On M1 Max 64GB, Qwen 2.5 7B runs at 63.7 tok/s on MLX vs 40.75 tok/s on Ollama (GGUF). If you are on a Mac, use MLX.

Context Length vs Speed Tradeoff

Context length dramatically impacts generation speed. Here is Qwen 3 8B Q4 on RTX 4090:

Context Length	Generation (tok/s)	Prompt Processing (tok/s)
4K	131	9,121
16K	96	5,697
32K	77	4,224
65K	53	2,614
131K	32	1,451

Generation speed drops by 75% from 4K to 131K context. Plan your context budget carefully.

Best Models by Task

Best for Coding

VRAM Tier	Model	HumanEval	Why
8 GB	Llama 3.1 8B	72.6%	Highest coding score at this size
16 GB	Qwen 3 14B	72.2%	Near-identical to 8B but better reasoning
24 GB	Qwen 3 32B	72.1%	Better at complex multi-file tasks
64 GB	Llama 3.3 70B	80.5%	Significant jump in coding ability

Best for Math and Reasoning

VRAM Tier	Model	MATH	GSM8K	Why
8 GB	Qwen 3 8B	60.8	89.8	Crushes math at this size
16 GB	DeepSeek-R1 14B	~65	~90	Chain-of-thought reasoning
24 GB	DeepSeek-R1 32B	~68	~92	Best reasoning at this tier
64 GB	DeepSeek-R1 70B	~72	~93	Strongest open reasoning model

Best for General Knowledge

VRAM Tier	Model	MMLU	Why
8 GB	Qwen 3 8B	76.9	Significantly ahead of alternatives
16 GB	Qwen 3 14B	81.1	Approaches frontier model territory
24 GB	Qwen 3 32B	83.6	Competitive with GPT-4 class on knowledge
64 GB	Qwen 2.5 72B	~84	Marginal gains over 32B on MMLU

Key Takeaways

Qwen 3 Dominates Nearly Every Tier

Alibaba's Qwen 3 family is the model to beat at every parameter count from 4B to 32B. The 8B model matches or exceeds Llama 3.1 8B on everything except HumanEval. The 14B model punches well above its weight, scoring 81.1 on MMLU - territory that required 70B parameters just a year ago. The 32B model is competitive with early GPT-4 on knowledge benchmarks.

MoE Models Change the Economics

The Qwen 3 30B-A3B Mixture-of-Experts model is a game-changer for local inference. It generates tokens at 196 tok/s on an RTX 4090 - faster than the 8B dense model - while delivering quality closer to the 14B class. The catch: MoE models still load all parameters into VRAM, so the memory requirement does not shrink proportionally. But if the model fits, the speed advantage is dramatic.

Memory Bandwidth Matters More Than Compute Generation

The RTX 3090 (936 GB/s, $800 used) often outperforms the RTX 4080 (717 GB/s, $900 new) in token generation. The M3 Max (400 GB/s) outperforms the newer M4 Pro (273 GB/s). When shopping for LLM hardware, look at the memory bandwidth spec first. The formula is simple: generation speed in tok/s is approximately (memory bandwidth in GB/s) / (model size in GB).

The 14B Sweet Spot

For most users with a 16GB GPU, a 14B model with Q4_K_M quantization (~10-11 GB) is the sweet spot. It leaves enough VRAM for 8-16K context, delivers dramatically better quality than 8B models, and generates at 60-70 tok/s on an RTX 4080/4090 - fast enough for interactive use. If you only run one model locally, make it Qwen 3 14B.

Apple Silicon Is Viable for 70B

A Mac with 96-128GB unified memory can run Llama 3.3 70B or Qwen 2.5 72B at 8-15 tok/s with Q4 quantization. That is slow but usable for batch processing, creative writing, or tasks where latency is not critical. No consumer NVIDIA GPU can run 70B without multi-GPU setups. This is Apple Silicon's unique advantage: massive unified memory at the cost of lower bandwidth per dollar.

Use MLX on Mac, Not Ollama

Framework choice matters enormously on Apple Silicon. MLX achieves up to 6-10x the throughput of Ollama for the same model on the same hardware. On M2 Ultra, MLX sustains ~230 tok/s versus Ollama's 20-40 tok/s. If you are running models on a Mac, MLX is non-negotiable.

Honorable Mention: Distributed Inference with Exo

Everything above assumes a single device. But what if you could pool multiple machines together?

Exo is an open source distributed inference framework that lets you combine multiple devices - Macs, NVIDIA GPUs, even phones - into a single cluster that runs models no individual machine could handle. It uses automatic peer discovery, topology-aware parallelization, and supports both pipeline parallelism (splitting layers across devices) and tensor parallelism (sharding individual layers).

The results are genuinely interesting for anyone with old hardware collecting dust.

What Exo Makes Possible

Three RTX 3090s ($2,400 total used) running a 120B model at 124 tok/s. That is over 3x faster than a single NVIDIA DGX Spark ($3,000) at 38.5 tok/s on the same workload. Three used GPUs from 2020 outperforming a purpose-built AI appliance from 2025.

DGX Spark + Mac Studio M3 Ultra running Llama 3.1 8B at 2.8x the speed of either device alone. The DGX Spark handles prefill 3.8x faster than the Mac (it has more compute), while the Mac generates tokens 3.4x faster (it has more memory bandwidth). Exo automatically routes each phase to the device that is best at it.

3x M4 Mac Mini cluster processing 108.8 tok/s on Llama 3.2 3B - 2.2x the throughput of a single device for concurrent requests. Four M3 Ultra Mac Studios running Qwen3 235B (8-bit) at 28.3 tok/s with RDMA over Thunderbolt 5.

The Speculation: What Old Hardware Could Do

This is where it gets exciting for people with hardware graveyards. Consider what you could build:

2x RTX 3090 (48GB combined, ~$1,600 used): Run Qwen 2.5 72B at Q4 entirely in GPU memory. Based on the single-card bandwidth of 936 GB/s each and Exo's measured multi-device scaling, you could expect roughly 15-20 tok/s generation - comparable to a $4,000+ Mac Studio with 192GB. That is a 70B model at interactive speeds for the price of a mid-range laptop.
RTX 3090 + Mac Mini M4 Pro: The GPU handles compute-heavy prefill while the Mac (with its 48GB unified memory option) handles generation for models that spill beyond 24GB. Exo routes each phase to the stronger device automatically.
3x RTX 3060 12GB (36GB combined, ~$750 used): Enough combined VRAM for Qwen 3 32B at Q4 with context headroom. Lower bandwidth per card (360 GB/s) limits generation speed, but at ~30-40 tok/s estimated for a 32B model across three cards, that is still very usable - and the hardware cost is less than a single RTX 4070 Ti.
Mac Mini M4 + old gaming PC with RTX 3080: Hybrid setups are Exo's sweet spot. The Mac contributes unified memory capacity, the NVIDIA card contributes raw compute for prefill. Models that are too large for either device alone become runnable.

The catch: network overhead is real. Single-request latency increases with more devices (39.7 tok/s on 3 devices vs 49.3 on 1 for small models). Exo shines when the model does not fit on a single device, or when you are serving multiple concurrent requests. For a model that already fits in one GPU's VRAM, adding more devices hurts rather than helps.

But for running 70B+ models on a budget? A stack of used RTX 3090s with Exo is arguably the best price/performance option in existence. The software is still maturing (Linux GPU support is under development, and Thunderbolt 5 RDMA is Mac-only for now), but the fundamental approach - pooling commodity hardware for distributed inference - is the most interesting development in local LLM hardware since Apple Silicon unified memory.

Methodology Notes

Benchmark scores come from official model cards (Meta, Alibaba, Google) and the Qwen 3 Technical Report. Token generation speeds are from llama.cpp GitHub benchmarks, Hardware-Corner, DatabaseMart, singhajit.com, and the MLX vs llama.cpp comparative study (arXiv:2511.05502). All speeds use Q4_K_M quantization unless noted otherwise. Scores marked with ~ are from aggregated third-party benchmarks where official numbers were not available for the specific instruct variant.

VRAM requirements are measured at 8K context windows using Ollama. Longer context windows increase VRAM usage significantly due to the KV cache.

Sources: