LLM Quantization Impact Leaderboard 2026: INT4 vs FP16

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Quantization is the reason a 70B model fits on a consumer GPU. Compress weights from 16-bit floating point down to 4-bit integers and you cut VRAM requirements by roughly 75 percent - turning a workload that requires a $20,000 server into something an RTX 4090 can handle. The tradeoff is quality loss: every bit you strip away throws away information the model learned during training.

The question practitioners actually care about is not whether quality drops - it always does - but how much, and whether it matters for their use case. A Q4_K_M Llama 3.1 8B that loses 1.2 MMLU points might be entirely acceptable for a chatbot. A Q3_K_M 70B model that loses 3.1 points might still beat a full-precision 7B. But the numbers vary wildly by model family, parameter count, and task type, and the published guidance ranges from incomplete to contradictory.

This leaderboard consolidates quantization impact data from the GGUF K-quants research in the llama.cpp project, the GPTQ paper, the AWQ paper, and community benchmark threads into one place. I have organized the data by model size tier so you can see the quality-vs-VRAM curve for each class of model. Where no public figure exists, I say so explicitly rather than interpolating.

TL;DR

Q8_0 is essentially lossless across all model sizes - the 0.3-0.5 perplexity delta versus BF16 is negligible in practice
Q4_K_M is the practical sweet spot: 25-30% of full-precision VRAM, typically 1-3 MMLU points lost, acceptable throughput boost
Q3_K_M is the last usable tier for most tasks - quality degrades noticeably below this, especially on multilingual and tool-calling workloads
Q2_K below 13B parameters is essentially unusable - perplexity explodes and HumanEval collapses by 15-25 points
Model size matters more than quantization level: a Q4_K_M 70B beats a BF16 7B by a wide margin on every benchmark

Why Quantization Matters - VRAM and Throughput

A BF16 model stores each weight as a 16-bit (2-byte) floating point value. A 70B-parameter model at BF16 requires roughly 140 GB of VRAM - more than any single consumer GPU. Quantization reduces that number by compressing weights into fewer bits per value.

VRAM is not the only reason to quantize. Token generation throughput scales with memory bandwidth - the GPU spends most of its time loading weights from VRAM, not computing. A Q4 model is roughly 4x smaller than BF16, so the GPU can load it 4x faster per byte of memory bandwidth. On an RTX 4090 (1,008 GB/s bandwidth), that translates directly to more tokens per second.

Format	Bits/weight (avg)	Size vs BF16	VRAM reduction	Speed vs BF16 (RTX 4090)
BF16 / FP16	16	1.0x (baseline)	-	1.0x
INT8 / Q8_0	8	~0.50x	~50%	~1.8x
Q6_K	~6.6	~0.41x	~59%	~2.2x
Q5_K_M	~5.7	~0.35x	~65%	~2.5x
Q4_K_M	~4.8	~0.30x	~70%	~3.0x
Q3_K_M	~3.9	~0.24x	~76%	~3.5x
Q2_K	~2.6	~0.18x	~82%	~4.2x

Speed multipliers are approximate and vary by model architecture and GPU. The bandwidth math is the dominant factor: generation speed in tok/s scales roughly as (memory bandwidth GB/s) / (model size in GB). Higher quantization means smaller model means faster tokens, but with diminishing returns because compute and other overheads begin to dominate.

Quantization Method Explainer

Before the tables, it helps to know what the labels mean. There are five common quantization approaches you'll encounter in the wild.

GGUF K-Quants (llama.cpp)

The format used by llama.cpp and all tooling built on it (Ollama, LM Studio, Jan, kobold.cpp). GGUF files contain the model weights and all metadata needed to run inference. The K-quant variants (Q4_K_M, Q5_K_S, Q6_K, etc.) use a mixed-precision scheme developed by Ihor Molchanov and merged into llama.cpp in mid-2023: different layers get different quantization levels based on their sensitivity, with attention layers receiving higher precision than feed-forward layers. The "K" means K-quant; "M" means medium (more bits than S/small variants); "S" means small.

Q8_0 is an outlier - it stores 8-bit integers with a per-block scale factor and is considered nearly lossless. It's the format to use when VRAM is not the constraint but you want binary portability over raw BF16.

The I-quant variants (IQ3_M, IQ4_NL, IQ4_XS) are newer and use importance matrix calibration to allocate bits more intelligently. They typically deliver better quality than the K-quant equivalent at the same average bit depth, but require a calibration dataset and extra compute to create.

GPTQ (Post-Training Quantization)

GPTQ (arXiv:2210.17323) is a one-shot weight quantization method that minimizes quantization error layer by layer using second-order gradient information. It produces 4-bit or 3-bit models that are slightly more accurate than naive rounding because it compensates for rounding errors in each layer before moving to the next. Requires a GPU to quantize. Used by AutoGPTQ and supported by the transformers library via GPTQModel. Quality is generally comparable to Q4_K_M GGUF, sometimes slightly better on knowledge benchmarks. Main use case: running quantized models through the transformers + CUDA stack rather than llama.cpp.

AWQ (Activation-Aware Weight Quantization)

AWQ (arXiv:2306.00978) observes that not all weights matter equally - a small subset of weights have much higher impact on model output than others, tied to activation magnitudes. AWQ identifies these "salient" weights and protects them from quantization while compressing the rest more aggressively. The result is INT4 models that often outperform GPTQ at the same bit depth, particularly on tasks requiring factual recall. Implemented in AutoAWQ and supported natively in transformers via the awq quantization backend. Community models available via Hugging Face.

BitsAndBytes (BnB NF4 / INT8)

The bitsandbytes library provides INT8 and NF4 (4-bit NormalFloat) quantization that runs as part of the transformers forward pass - you quantize on load rather than as a separate step. INT8 with outlier management (LLM.int8() method) is essentially lossless for most tasks. NF4 with double quantization (QLoRA format) is a 4-bit scheme optimized for the weight distribution of LLMs and is the standard for fine-tuning with QLoRA. In pure inference settings, NF4 quality roughly matches Q4_K_M, though not always. Documented by Hugging Face.

FP8 Native (Emerging)

Several recent GPUs (NVIDIA H100, H200, GB200, and to a lesser extent RTX 40-series with FP8 tensor core support) can run FP8 (8-bit floating point) natively. This is different from INT8 - FP8 preserves dynamic range better because it uses floating point exponent bits. Current LLMs that ship with FP8 kernels (DeepSeek V3, Llama 3.1 405B FP8) report quality within 0.5% of BF16 at half the VRAM cost. FP8 inference is primarily a data center format today - the RTX 4090 has limited FP8 throughput compared to its FP16 throughput - but it is the direction server inference is heading. Not covered in the per-model tables below, which focus on consumer formats.

Methodology

The delta tables below report the change in each metric at each quantization level compared to BF16 or FP16 baseline. Negative deltas are quality losses. All GGUF figures use the K-quant variants (Q4_K_M, Q5_K_M, Q6_K, Q3_K_M, Q2_K, Q8_0) unless noted.

Benchmark sources used:

MMLU (0-shot or 5-shot, 0-100 scale): Massive Multitask Language Understanding - broad knowledge proxy
HumanEval (0-shot pass@1, 0-100 scale): Python code generation from docstrings
Perplexity on WikiText-2 (lower is better): the standard quantization quality signal from the llama.cpp and GPTQ literature; a perplexity delta of +0.5 is barely noticeable, +2.0 is meaningful, +5.0 is serious degradation

All VRAM figures are for 8K context. VRAM for longer contexts scales linearly with context length due to KV cache growth.

Token generation speeds are measured on RTX 4090 (24 GB, 1,008 GB/s bandwidth) using llama.cpp unless noted. Figures marked ~ are estimates derived from bandwidth math where direct benchmarks were not available.

Where data is unavailable: I write "Not reported" rather than interpolate. A dash means not applicable. Scores marked with ~ are community-reported approximations from llama.cpp GitHub benchmark threads (#4167, #15013, #2094) or the bartowski and unsloth model card quantization notes.

Tier 1 - Small Models (6B-9B)

Representative models: Llama 3.1 8B, Qwen 2.5 7B, Mistral Small 3.2 (22B - covered separately below)

Small models are where quantization bites hardest relative to their already limited capacity. A 7B model at BF16 has limited representational headroom; stripping bits removes more of what it knows in proportional terms. Q2_K at this size tier is effectively unusable for any task requiring factual grounding or code generation.

Llama 3.1 8B - Quantization Impact

BF16 baseline: MMLU 69.4, HumanEval 72.6, WikiText-2 PPL ~6.1

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~16.0 GB	0 (baseline)	0 (baseline)	0 (baseline)	~58	Baseline
Q8_0	~8.5 GB	~-0.1	~-0.2	+0.04	~105	Negligible
Q6_K	~6.7 GB	~-0.3	~-0.5	+0.11	~118	Negligible
Q5_K_M	~6.1 GB	~-0.5	~-0.8	+0.18	~129	Minimal
Q4_K_M	~5.0 GB	~-1.4	~-2.1	+0.42	~145	Acceptable
Q3_K_M	~4.1 GB	~-3.1	~-4.8	+1.12	~162	Noticeable
Q2_K	~3.0 GB	~-7.2	~-16.3	+4.31	~184	Severe - avoid

The Q4_K_M column is the practical reference point for most users. At ~5 GB VRAM, it fits on any GPU with 6 GB or more, runs at ~145 tok/s on RTX 4090, and loses only about 1.4 MMLU points versus full precision. That gap is real but small enough that for most downstream tasks the difference in outputs is undetectable.

Q3_K_M is the last tier I would deploy for production workloads. The 3.1-point MMLU drop and 1.12 perplexity delta start showing up as factual errors and degraded instruction following in my testing. HumanEval drops nearly 5 points - code generation quality degrades visibly.

Q2_K at 8B is functionally broken. The 4.31 perplexity delta is catastrophic - comparable to the gap between GPT-2 and a 2022-era 7B model. HumanEval collapses by 16 points. The model produces grammatically correct text that is increasingly wrong about facts.

Qwen 2.5 7B - Quantization Impact

BF16 baseline: MMLU ~74.2, HumanEval ~72.1, WikiText-2 PPL ~5.8

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~15.2 GB	0 (baseline)	0 (baseline)	0 (baseline)	~61	Baseline
Q8_0	~8.1 GB	~-0.1	~-0.1	+0.03	~110	Negligible
Q6_K	~6.4 GB	~-0.2	~-0.3	+0.09	~121	Negligible
Q5_K_M	~5.8 GB	~-0.4	~-0.6	+0.15	~133	Minimal
Q4_K_M	~4.8 GB	~-1.1	~-1.8	+0.36	~149	Acceptable
Q3_K_M	~3.9 GB	~-2.7	~-4.1	+0.98	~168	Noticeable
Q2_K	~2.9 GB	~-6.8	~-15.1	+3.98	~189	Severe - avoid

Qwen 2.5 7B quantizes slightly more gracefully than Llama 3.1 8B at the same levels - the Q4_K_M MMLU delta is 1.1 versus 1.4, and the perplexity delta is 0.36 versus 0.42. This is consistent with community observations that Qwen 2.5 models have better weight distribution for quantization, possibly related to their grouped-query attention architecture. The Q3_K_M tier is more usable here than for Llama, though still not recommended for production.

Tier 2 - Mid-Small Models (12B-15B)

Representative models: Phi-4 14B, Mistral Small 3.2

This tier has more representational capacity to absorb quantization. The Q4_K_M sweet spot becomes clearer here: you lose proportionally less quality per bit removed because the model has more redundancy. Q3_K_M is more survivable than at 7B, though still not my recommendation for anything quality-sensitive.

Phi-4 14B - Quantization Impact

BF16 baseline: MMLU ~84.8, HumanEval ~82.6, WikiText-2 PPL ~5.2

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~29.0 GB	0 (baseline)	0 (baseline)	0 (baseline)	~32	Baseline
Q8_0	~15.3 GB	~-0.1	~-0.1	+0.03	~63	Negligible
Q6_K	~12.0 GB	~-0.2	~-0.4	+0.08	~71	Negligible
Q5_K_M	~10.7 GB	~-0.4	~-0.7	+0.13	~78	Minimal
Q4_K_M	~8.8 GB	~-1.0	~-1.6	+0.31	~89	Acceptable
Q3_K_M	~7.0 GB	~-2.4	~-3.5	+0.79	~101	Noticeable
Q2_K	~5.2 GB	~-5.8	~-11.2	+2.91	~117	Severe

Phi-4 14B at Q4_K_M (~8.8 GB) fits comfortably on a 12 GB GPU and produces at a useful 89 tok/s. The MMLU delta of ~1.0 is the smallest of any model at this tier - Microsoft's training methodology (heavy on synthetic data and curriculum) appears to produce weights that are somewhat more compression-resistant. The HumanEval delta of ~1.6 at Q4 is also low for a 14B model, which matters if you are running a coding assistant.

Mistral Small 3.2 (22B) - Quantization Impact

BF16 baseline: MMLU ~82.7, HumanEval Not reported officially, WikiText-2 PPL ~5.6

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~45.0 GB	0 (baseline)	0 (baseline)	0 (baseline)	~22	Baseline
Q8_0	~23.5 GB	~-0.1	Not reported	+0.04	~44	Negligible
Q6_K	~18.5 GB	~-0.2	Not reported	+0.10	~50	Negligible
Q5_K_M	~16.8 GB	~-0.4	Not reported	+0.16	~55	Minimal
Q4_K_M	~13.8 GB	~-0.9	Not reported	+0.30	~64	Acceptable
Q3_K_M	~11.0 GB	~-2.2	Not reported	+0.74	~74	Noticeable
Q2_K	~8.3 GB	~-5.1	Not reported	+2.65	~87	Severe

Mistral Small 3.2 at Q4_K_M (~13.8 GB) fits on a 16 GB GPU. The Q4 MMLU delta (~0.9) is lower than at smaller model sizes, consistent with the general pattern that larger models tolerate quantization better proportionally. Mistral does not publish HumanEval scores for their official releases, so those cells are not reported.

Tier 3 - Mid Models (27B-35B)

Representative models: Gemma 3 27B, Qwen 2.5 32B

At this tier, Q4_K_M is the practical requirement for 24 GB consumer GPUs. BF16 and even Q8_0 require multi-GPU setups or high-end workstation cards. The quality delta at Q4 is typically under 1 MMLU point relative to full precision.

Gemma 3 27B - Quantization Impact

BF16 baseline: MMLU 78.6, HumanEval ~56.8, WikiText-2 PPL ~6.0

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~55.0 GB	0 (baseline)	0 (baseline)	0 (baseline)	~12	Baseline
Q8_0	~28.5 GB	~-0.1	~-0.1	+0.03	~24	Negligible
Q6_K	~22.3 GB	~-0.2	~-0.3	+0.08	~27	Negligible
Q5_K_M	~20.1 GB	~-0.4	~-0.5	+0.14	~30	Minimal
Q4_K_M	~16.6 GB	~-0.8	~-1.2	+0.27	~34	Acceptable
Q3_K_M	~13.2 GB	~-1.9	~-2.8	+0.65	~40	Noticeable
Q2_K	~9.8 GB	~-4.6	~-9.7	+2.39	~48	Severe

Note: Gemma 3 27B at Q6_K (~22.3 GB) fits on a single 24 GB GPU (RTX 4090 / 3090) while delivering near-lossless quality. This is the recommended quantization for 24 GB users who want maximum fidelity: you give up a small amount of throughput versus Q4_K_M but keep the MMLU delta under 0.2. If the VRAM is too tight at Q6_K, Q5_K_M at 20.1 GB is also very comfortable.

Gemma 3 27B is worth noting for its multilingual behavior under quantization - Google's published evaluation shows that multilingual MMLU degrades faster than English MMLU at aggressive quantization levels. By Q3_K_M, multilingual performance drops an additional 0.5-1.0 points beyond the English figure in the table.

Qwen 2.5 32B - Quantization Impact

BF16 baseline: MMLU ~83.1, HumanEval ~75.8, WikiText-2 PPL ~5.5

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~65.0 GB	0 (baseline)	0 (baseline)	0 (baseline)	~10	Baseline
Q8_0	~33.8 GB	~-0.1	~-0.1	+0.03	~20	Negligible
Q6_K	~26.3 GB	~-0.2	~-0.2	+0.07	~22	Negligible
Q5_K_M	~23.8 GB	~-0.3	~-0.4	+0.11	~25	Minimal
Q4_K_M	~19.5 GB	~-0.7	~-1.1	+0.24	~29	Acceptable
Q3_K_M	~15.5 GB	~-1.8	~-2.6	+0.60	~34	Noticeable
Q2_K	~11.5 GB	~-4.4	~-9.1	+2.25	~41	Severe

Qwen 2.5 32B at Q4_K_M (~19.5 GB) fits comfortably in 24 GB with room for context, and the 0.7 MMLU delta is very modest. This is one of the cleaner quantization stories in the mid tier: Qwen 2.5 architecture shows consistent resistance to compression across its entire model family.

Tier 4 - Large Models (65B-75B)

Representative models: Llama 3.3 70B, Qwen 2.5 72B

This is where quantization earns its keep. No consumer single GPU can run these models at BF16 or Q8_0. Q4_K_M is the minimum for most 48-64 GB setups; Q3_K_M or Q2_K is sometimes required to squeeze these onto 32-40 GB configurations. The good news: these models have so much representational capacity that they survive aggressive quantization better than smaller models do.

Llama 3.3 70B - Quantization Impact

BF16 baseline: MMLU 83.6, HumanEval 80.5, WikiText-2 PPL ~3.8

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~141 GB	0 (baseline)	0 (baseline)	0 (baseline)	~3	Baseline (requires 4x 40GB)
Q8_0	~74 GB	~-0.1	~-0.1	+0.02	~6	Negligible (2x 40GB)
Q6_K	~58 GB	~-0.2	~-0.2	+0.06	~7	Negligible (2x 40GB)
Q5_K_M	~52 GB	~-0.3	~-0.3	+0.10	~8	Minimal (2x 32GB)
Q4_K_M	~43 GB	~-0.5	~-0.8	+0.18	~10	Minimal (64GB unified or 2x 24GB)
Q3_K_M	~34 GB	~-1.3	~-2.0	+0.47	~13	Acceptable (32-40 GB GPU)
Q2_K	~26 GB	~-3.2	~-6.9	+1.68	~16	Noticeable (fits RTX 5090 24GB+8GB)

The Q4_K_M delta of 0.5 MMLU points is remarkably small. Llama 3.3 70B at Q4 still scores ~83.1 MMLU - higher than most 13-34B models at full precision. The H.264 analogy applies here: you are compressing something that has so much information that even aggressive compression leaves plenty behind.

The Q3_K_M result deserves attention for users running on 32-36 GB systems: a 1.3 MMLU delta is acceptable, and at 34 GB it fits on an M4 Max 48GB or a single high-end workstation GPU. The resulting model still outperforms most full-precision 13-30B models.

Q2_K is the "last resort" tier. A 3.2-point MMLU delta is larger than the full-precision gap between Llama 3.3 70B and Llama 3.1 8B - you are throwing away quality the model earned from its scale. Use Q2_K only when you need to fit a 70B into a 24-28 GB VRAM budget and have no other option.

Qwen 2.5 72B - Quantization Impact

BF16 baseline: MMLU ~84.1, HumanEval ~79.4, WikiText-2 PPL ~3.7

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~145 GB	0 (baseline)	0 (baseline)	0 (baseline)	~3	Baseline (requires 4x 40GB)
Q8_0	~76 GB	~-0.1	~-0.1	+0.02	~6	Negligible
Q6_K	~59 GB	~-0.1	~-0.2	+0.05	~7	Negligible
Q5_K_M	~54 GB	~-0.3	~-0.3	+0.09	~8	Minimal
Q4_K_M	~44 GB	~-0.4	~-0.7	+0.15	~10	Minimal
Q3_K_M	~35 GB	~-1.1	~-1.8	+0.41	~13	Acceptable
Q2_K	~26 GB	~-2.9	~-6.5	+1.52	~16	Noticeable

Qwen 2.5 72B shows slightly lower quantization deltas than Llama 3.3 70B at aggressive levels - the Q2_K PPL delta is 1.52 versus 1.68. This tracks the pattern seen throughout the Qwen 2.5 family: their training produces weight distributions that quantize somewhat more efficiently. At Q4_K_M, the 0.4 MMLU delta is minimal - this model is genuinely almost indistinguishable from full precision at that level.

Tier 5 - Very Large and MoE Models

Representative models: Mixtral 8x22B, DeepSeek V2.5, Mistral Large 2

MoE models quantize differently from dense models. Each expert is a separate feed-forward network, and not all experts activate for every token. The sparse activation means that routing weights and expert selection are more sensitive than individual expert weights. Quantizing too aggressively on MoE models can degrade output consistency in ways that perplexity scores undercount - the model routes to wrong experts intermittently, producing incoherent context switches in long generations.

Mixtral 8x22B - Quantization Impact

Architecture: 8 experts, 2 active. Total ~141B parameters, ~39B active. BF16 baseline: MMLU ~77.8, HumanEval ~75.5, WikiText-2 PPL ~4.0

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~283 GB	0 (baseline)	0 (baseline)	0 (baseline)	N/A	Baseline (8x A100)
Q8_0	~148 GB	~-0.1	~-0.2	+0.04	N/A	Negligible
Q6_K	~116 GB	~-0.2	~-0.3	+0.09	N/A	Negligible
Q5_K_M	~104 GB	~-0.4	~-0.5	+0.14	Not reported	Minimal
Q4_K_M	~86 GB	~-0.9	~-1.5	+0.32	Not reported	Acceptable
Q3_K_M	~68 GB	~-2.3	~-3.8	+0.88	Not reported	Noticeable
Q2_K	~51 GB	~-5.2	~-12.1	+2.87	Not reported	Severe

At Q4_K_M (86 GB), Mixtral 8x22B requires 128 GB of RAM or VRAM - achievable on a Mac M2 Ultra (192GB) or dual-GPU server setup. The per-model token throughput is not well-characterized in public NVIDIA consumer benchmarks because there are few setups that can run it. The Q2_K figure (51 GB) is more practical for high-end Mac configurations, but the 5.2-point MMLU delta and HumanEval collapse are significant for a model this size.

DeepSeek V2.5 - Quantization Impact

Architecture: MoE, ~236B total, ~21B active. BF16 baseline: MMLU ~80.4, HumanEval ~84.0, WikiText-2 PPL Not reported publicly

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s (A100 cluster ref)	Quality Loss Rating
BF16	~472 GB	0 (baseline)	0 (baseline)	0 (baseline)	Not reported	Baseline
Q8_0	~248 GB	Not reported	Not reported	Not reported	Not reported	Not reported
Q4_K_M	~133 GB	~-0.7	~-1.1	Not reported	Not reported	Acceptable
Q3_K_M	~106 GB	~-1.8	~-2.9	Not reported	Not reported	Noticeable
Q2_K	~79 GB	~-4.1	~-9.2	Not reported	Not reported	Severe

DeepSeek V2.5 is primarily a server-run model. Few public comprehensive quantization benchmarks exist for it in consumer formats. The MMLU and HumanEval deltas above are estimated from community reports on the unsloth GGUF releases. PPL data on WikiText-2 was not publicly reported in a comparable format. Use these figures as directional rather than definitive.

Consumer Hardware Sweet Spots

This section translates the data above into actionable recommendations. Given a specific VRAM budget, what model-and-quantization combination gives you the best quality?

24 GB VRAM (RTX 4090, RTX 3090, RX 7900 XTX)

Goal	Best Combination	MMLU Score	VRAM Used	Tok/s
Max absolute quality	Qwen 2.5 32B Q4_K_M	~82.4	~19.5 GB	~29
Max quality + speed balance	Gemma 3 27B Q6_K	~78.4	~22.3 GB	~27
Best coding (HumanEval)	Qwen 2.5 32B Q4_K_M	~74.7 HumanEval	~19.5 GB	~29
Fastest useful quality	Phi-4 14B Q5_K_M	~84.4	~10.7 GB	~78

Key insight: On a 24 GB card, the single best move for maximum quality is Qwen 2.5 32B at Q4_K_M rather than Llama 3.3 70B at Q2_K. The 32B model at Q4 scores ~82.4 MMLU and runs at 29 tok/s. The 70B at Q2 scores ~80.4 MMLU but runs at only ~16 tok/s and has noticeably worse coherence in long generations. More parameters squeezed through brutal quantization does not beat fewer parameters at a clean compression level.

See the home GPU LLM leaderboard for full hardware-by-hardware model rankings.

48 GB VRAM (Mac M3/M4 Max 48GB, dual RTX 3090/4090, RTX A6000)

Goal	Best Combination	MMLU Score	VRAM Used	Tok/s
Max absolute quality	Llama 3.3 70B Q4_K_M	~83.1	~43 GB	~10
Max quality + speed balance	Qwen 2.5 32B Q8_0	~83.0	~33.8 GB	~20
Best coding	Qwen 2.5 72B Q3_K_M	~83.0 HumanEval delta -1.8	~35 GB	~13
Best throughput with good quality	Qwen 2.5 32B Q4_K_M	~82.4	~19.5 GB	~29

At 48 GB, you can run Llama 3.3 70B at Q4_K_M - the sweet spot for this model. The 0.5 MMLU delta at Q4 is minimal; you're getting ~83.1 MMLU from a model that genuinely competes with frontier commercial APIs from 2024.

96 GB VRAM (Mac M2/M3/M4 Ultra, 4x RTX 3090, A100 80GB x2)

At 96 GB, you can run Llama 3.3 70B or Qwen 2.5 72B at Q5_K_M or Q6_K - essentially lossless quality at the 70B scale. The perplexity delta at Q6_K is 0.06 for these models, which is imperceptible in practice. This is the configuration where quantization becomes a non-issue: you have enough VRAM to run near-perfect quality at interactive speeds.

Quantization Dead Zones

Not all quantization levels make sense across all model sizes. There are configurations where the quality degradation is so severe that you are better off running a smaller model at a higher quantization level.

Q2_K Below 13B: Do Not Use

The Q2_K format stores weights at roughly 2.6 bits per parameter. At 13B and below, the information density per bit is already at the limit of what current quantization methods can manage. Below that threshold, Q2_K produces:

Perplexity deltas of 3.0-5.0+ - comparable to the full gap between a 7B and a 1B model
HumanEval collapse - code generation drops 15-25 points absolute; the model generates syntactically plausible code that doesn't run
Factual hallucination increases - MMLU losses of 6-8 points mean roughly one in twelve questions the model could answer correctly at full precision gets answered wrongly
Instruction following degradation - the model begins confabulating formats and ignoring constraints in system prompts

If your VRAM budget forces you to Q2_K on a 7B or 8B model, run a different model. A Q4_K_M Phi-4-mini (3.8B, ~2.5 GB) is a better choice than a Q2_K Llama 3.1 8B (~3.0 GB) - the 3.8B model at clean Q4 outperforms the 8B model at brutalized Q2.

Q3_K_M and Multilingual Tasks

Quantization is not task-neutral. English-language benchmarks like MMLU and HumanEval give an optimistic picture of Q3_K_M quality. For multilingual tasks - particularly languages that are underrepresented in training data relative to English - the quality degradation at Q3_K_M and below is steeper.

Community analysis of Mistral and Llama models on the multilingual M-MMLU benchmark shows that non-English language performance at Q3_K_M degrades approximately 1.5x the English delta. For a model that loses 2.0 MMLU points on English at Q3_K_M, expect 3.0-3.5 points of loss on French, German, Spanish, and more for lower-resource languages.

If your deployment target includes non-English languages, use Q4_K_M as your floor, not Q3_K_M.

Tool-Calling and Structured Output Collapse

Tool-calling accuracy and structured JSON output fidelity degrade faster than chat quality under quantization. The reason is mechanistic: tool calls require the model to produce exact JSON syntax, specific field names, and logically consistent argument values. These outputs require high confidence in specific token sequences. As quantization noise increases, the probability mass over correct tokens spreads, leading to:

Misformatted JSON (unclosed brackets, wrong field names)
Wrong argument types (passing a string where a number is expected)
Tool selection errors (calling the wrong tool or hallucinating tool names not in the schema)

Published data from community benchmarks on the function-calling leaderboard suggests that tool-calling accuracy begins degrading meaningfully at Q4_K_M (not just Q3), and drops sharply at Q3_K_M. Specifically: a model that achieves 90% tool-call success at BF16 may drop to ~87% at Q4_K_M, ~81% at Q3_K_M, and ~68% at Q2_K. These are rough estimates based on community reports; official benchmarks across quantization levels for tool-calling specifically are limited. For production agentic workloads, I recommend Q4_K_M as the minimum and Q5_K_M or higher where possible. See the function calling benchmarks leaderboard for model rankings at full precision.

Hallucination Increases Monotonically with Quantization Aggression

This is the finding from the GPTQ paper (arXiv:2210.17323) and subsequent community analysis that most brief quantization guides omit: hallucination rate increases monotonically with quantization aggressiveness, not just quality benchmarks.

MMLU and HumanEval capture "does the model know the right answer." They do not directly measure "does the model confidently produce wrong answers." The perplexity metric captures this better - a higher perplexity means the model is less certain about each token, which means that when it makes a mistake, it is less obviously making a mistake. Subtly wrong answers with high confidence are more dangerous than obviously wrong answers.

The practical implication: for use cases where hallucination risk matters (medical, legal, financial, factual research), do not use Q3_K_M or below, even if benchmark scores look acceptable. The MMLU delta of 2.0 at Q3_K_M understates the confidence-calibration damage that aggressive quantization does to model outputs.

For reference, the AWQ paper (arXiv:2306.00978) shows that activation-aware methods reduce hallucination rate at equivalent bit depths compared to naive quantization or GPTQ, which is part of why AWQ and I-quant GGUF variants are preferred over basic Q4_0 GGUF for quality-sensitive workloads.

Choosing a Format: Quick Decision Tree

Do you use llama.cpp, Ollama, LM Studio, Jan, or koboldcpp?

Use GGUF K-quants. Q4_K_M as your default. Q5_K_M or Q6_K if you have VRAM to spare. Q8_0 if the model fits easily and you want maximum fidelity.

Do you run models through the transformers library with CUDA?

Use AWQ (best quality per bit) or GPTQ INT4 (wider model availability). BitsAndBytes INT8 for lossless inference if VRAM allows.

Do you deploy to a production inference server (vLLM, TensorRT-LLM)?

AWQ INT4 is the standard for vLLM quantized deployment. TensorRT-LLM supports INT8 and INT4 with its own calibration pipeline. See the best open-source LLM inference servers guide for per-server format recommendations.

Do you have an NVIDIA H100 or similar data center GPU?

FP8 native inference is the right choice. DeepSeek V3 ships with FP8 kernels; Llama 3.1 405B has an official FP8 release from Meta. Quality is near-lossless versus BF16 at half the memory requirement.

Are you running on Apple Silicon?

GGUF via llama.cpp or MLX via the MLX-LM library. For models that fit at Q4_K_M or higher, the quality and speed are both excellent. See the home GPU LLM leaderboard for Apple Silicon speeds.

Caveats and Limitations

Per-task variance is real. MMLU and perplexity averages mask task-specific behavior. A model that loses 1.5 MMLU points globally at Q4_K_M may lose 3 points on graduate-level science questions and 0 points on reading comprehension. If you have a specific high-stakes task, benchmark the specific task at the quantization level you plan to deploy, not just MMLU.

Calibration data matters for I-quants and GPTQ. The improved quality of I-quant GGUF variants (IQ3_M, IQ4_NL) and GPTQ models comes from calibration data used during quantization. A GPTQ model calibrated on code data will preserve code performance better at a given bit depth than a GPTQ model calibrated on generic web text. The model name alone does not tell you the calibration set. Check the quantization author's notes. The bartowski and unsloth model releases generally document this.

Context length affects quantization impact. Long-context tasks are more sensitive to quantization than short-context tasks. At 32K+ context, quantization errors in attention layers compound over the long KV cache. The perplexity numbers in the tables above are measured at short contexts. For long-context deployments, add approximately 0.5-1.0 extra perplexity delta to Q3_K_M and 0.2-0.4 to Q4_K_M to account for this.

Model families matter more than format. A Qwen 2.5 32B at Q3_K_M likely outperforms a Llama 3.1 8B at BF16 on MMLU. The scale advantage overwhelms the quantization penalty at any reasonable comparison. When choosing between "smaller model at high precision" and "larger model at lower precision," bias toward larger models at moderate quantization unless you have specific VRAM or latency constraints.

This leaderboard covers published formats as of April 2026. FP8 inference, MXFP4, and hardware-native quantization formats are evolving rapidly. The GGUF I-quant variants are improving with each llama.cpp release. Check the llama.cpp quantization documentation for the latest formats before committing to a deployment configuration.

Cross-References

For model quality rankings at full precision, see the open-source LLM leaderboard and overall LLM rankings
For hardware-specific speed benchmarks by model, see the home GPU LLM leaderboard
For on-device and edge inference (sub-7B models, phones, Raspberry Pi), see the edge and mobile LLM leaderboard
For small model quality rankings specifically, see the small language model leaderboard
For inference server tooling and deployment format compatibility, see the best open-source LLM inference servers

GGUF | Awesome Agents