Google's TurboQuant Cuts LLM Memory 6x With Zero Loss

Google Research's TurboQuant compresses LLM key-value cache by 6x and delivers 8x speedup on H100 GPUs with zero accuracy loss - no fine-tuning required.

Google's TurboQuant Cuts LLM Memory 6x With Zero Loss

Google Research published TurboQuant, a compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup on H100 GPUs. The claim that matters: zero accuracy loss. No fine-tuning. No training data required. Drop it into any transformer and the KV cache shrinks to a fraction of its original size while producing identical outputs.

If this holds in production, it changes the economics of long-context inference overnight.

TL;DR

  • 6x KV cache memory reduction, 8x speedup on H100 for attention logits (4-bit vs 32-bit)
  • Zero accuracy loss on LongBench, Needle in a Haystack, ZeroSCROLLS, RULER, L-Eval
  • No fine-tuning or training data required - data-oblivious, works on any transformer
  • Two-stage algorithm: PolarQuant (coordinate transform) + QJL (1-bit error correction)
  • Tested on Gemma, Mistral, Llama-3.1-8B-Instruct
  • Presented at ICLR 2026

How It Works

The KV cache is the memory bottleneck for long-context LLM inference. Every token the model has seen gets stored as key-value pairs in memory, and attention computes over all of them for every new token generated. For a model serving a 1M-token context window, the KV cache can consume more memory than the model weights themselves.

TurboQuant compresses this cache through two sequential stages.

Stage 1: PolarQuant

Standard quantization struggles with KV cache because the value distributions across channels vary wildly - some channels have large values, others are near zero. Normalizing per-channel is expensive and adds overhead.

PolarQuant sidesteps this by converting from Cartesian to polar coordinates. It first randomly rotates the data vectors (a computationally cheap operation), then represents each vector as a radius (magnitude) and a set of angles (direction). The key insight: after rotation, the angles follow a known, concentrated distribution that maps cleanly to a fixed circular grid. No per-channel normalization needed. No calibration data required.

This stage alone hits 3-bit quantization without training - competitive with methods that require expensive calibration passes over datasets.

Stage 2: QJL Error Correction

The residual error from PolarQuant is reduced to single sign bits (+1 or -1) using the Johnson-Lindenstrauss Transform. This is a mathematical technique that preserves distances between high-dimensional vectors when projected to lower dimensions.

The QJL stage adds what the authors describe as "zero memory overhead" because it stores only sign bits. A specialized estimator balances high-precision queries against low-precision stored data to maintain accurate attention scores.

The combination: PolarQuant compresses aggressively, QJL corrects the errors at negligible cost.

The Numbers

MetricTurboQuantBaseline (FP32)
KV cache memory6x smaller1x
Attention logit speed (H100)8x faster (4-bit)1x (32-bit)
Accuracy lossZeroN/A
Fine-tuning requiredNoN/A
Calibration data requiredNoN/A

Benchmark Results

Tested on Gemma, Mistral, and Llama-3.1-8B-Instruct across five long-context benchmarks:

BenchmarkFocusResult
LongBenchLong-document QANo degradation
Needle in a HaystackInformation retrieval in long contextNo degradation
ZeroSCROLLSLong-context understandingNo degradation
RULERMulti-hop reasoning over contextNo degradation
L-EvalLong-form evaluationNo degradation

TurboQuant outperformed KIVI and RabbiQ (existing KV cache quantization methods) across all benchmarks. On the GloVe vector search dataset (d=200), TurboQuant achieved superior 1@k recall ratios versus all baselines.

What It Does Not Tell You

Only Tested on 8B Models

The published results cover Gemma, Mistral, and Llama-3.1-8B-Instruct. These are small models. Whether TurboQuant's "zero accuracy loss" claim holds on 70B+ models, mixture-of-experts architectures, or models with 1M token context windows remains undemonstrated.

"8x Speedup" Has Context

The 8x speedup specifically measures attention logit computation (4-bit TurboQuant vs 32-bit unquantized keys on H100). This is not 8x end-to-end inference speedup. Attention is a significant but not sole bottleneck - the actual wall-clock improvement for full inference will be lower.

Production Integration Unknown

The paper demonstrates the algorithm. It doesn't ship an inference framework, a CUDA kernel, or integration with vLLM, TensorRT-LLM, or SGLang. The gap between "this algorithm works in a research setting" and "you can pip install this and see 6x memory savings" is measured in engineering months.

Google Published It But Has Not Deployed It

The blog post is from Google Research. It doesn't say TurboQuant is running in Gemini, Google Search, or any production system. Research papers from Google frequently describe techniques that never ship. Whether TurboQuant is headed for production or remains an academic contribution isn't stated.

Why It Matters

KV cache compression is the most impactful efficiency lever for long-context LLM serving. A 6x reduction means:

  • A model that needed 8 H100s for 1M-token context could potentially serve the same context on 2 H100s
  • Inference providers can serve 6x more concurrent long-context requests on the same hardware
  • The cost of long-context inference - currently prohibitive for many applications - drops proportionally

If TurboQuant's claims survive independent reproduction, it makes long-context inference economically viable at scales that are currently only affordable to frontier labs. The "no fine-tuning required" property is critical: existing quantization methods that require calibration runs on representative data are impractical for general-purpose serving across varied workloads.


Google Research published a KV cache compression algorithm that claims 6x memory reduction and 8x speedup with zero accuracy loss, no fine-tuning, and no calibration data. The math is elegant - polar coordinates plus Johnson-Lindenstrauss error correction. The benchmarks are clean across five long-context tasks. The limitations are real: only tested on 8B models, the speedup is for attention logits not end-to-end inference, and no production integration exists. But if TurboQuant works as published at larger scales, it's the most significant efficiency result for long-context LLM serving in 2026. The difference between "KV cache is the bottleneck" and "KV cache is 6x smaller" is the difference between long-context being a luxury and long-context being the default.

Sources:

Google's TurboQuant Cuts LLM Memory 6x With Zero Loss
About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.