llm-d Joins CNCF - Kubernetes Gets a Native LLM Inference Stack

You have a DeepSeek-R1 deployment. It's 671 billion parameters, spread across eight H100 nodes. Prompt traffic arrives in bursts. When a long prompt lands, your decode latency spikes - because the same pods that spent 800ms grinding through the prompt now have to start creating tokens while the KV cache is still hot. You're either under-provisioning the prefill phase or wasting expensive GPU time on the decode side. Kubernetes doesn't understand the difference.

That's the class of problem llm-d was built to solve. At KubeCon + CloudNativeCon Europe in Amsterdam on March 24, IBM Research, Red Hat, and Google Cloud donated llm-d to the Cloud Native Computing Foundation as a sandbox project. Joining them as founding collaborators: NVIDIA, AMD, CoreWeave, Cisco, Hugging Face, Intel, Lambda, and Mistral AI.

TL;DR

IBM, Red Hat, Google Cloud donate llm-d to CNCF; Apache 2.0 licensed
Kubernetes-native distributed inference built on vLLM with disaggregated prefill and decode phases
v0.5 adds hierarchical KV offloading (GPU → CPU → SSD → S3), scale-to-zero, and active-active HA
Benchmarks: 50,000 output tokens/sec on 256 B200 GPUs; 40% lower per-output-token latency for DeepSeek V3 on H200
Supports NVIDIA A100+, AMD MI250+, Google TPU v5e+, Intel GPU Max; requires Kubernetes 1.29+

Why Kubernetes Inference is Still Broken

82% of organizations have adopted Kubernetes for AI workloads. Only 7% deploy those workloads daily, according to CNCF's own data from KubeCon. The gap is real and well-known: Kubernetes is a general-purpose orchestrator that wasn't designed to reason about KV caches, GPU memory topology, or the statistical properties of LLM token generation.

The result is engineers bolting on workarounds: custom sidecar proxies, ad-hoc Redis caches for KV state, homegrown schedulers that almost-but-don't-quite understand model parallelism. Every team invents the same wheel, slightly differently.

llm-d is an attempt to solve this once, in the open, at the infrastructure layer.

"AI is being developed by data scientists...but eventually it's going to be a CIO's problem. And what language do CIOs speak these days? They speak Kubernetes," said Brian Stevens, Red Hat SVP and AI CTO, at KubeCon EU.

The Stack

llm-d sits between vLLM (the inference engine) and Kubernetes (the orchestrator). It doesn't replace either. It makes them work together in ways that neither does out of the box.

The architecture has three main components: a distributed inference scheduler built on Envoy, a KV cache management layer with hierarchical tiering, and primitives for prefill/decode disaggregation using NVIDIA's NIXL library for inter-pod KV transfer.

Prefill/Decode Disaggregation

This is the core insight. LLM inference has two computationally distinct phases:

Prefill processes the full input prompt in parallel. It's compute-bound and loves high FLOP/s.
Decode creates tokens one at a time. It's memory-bandwidth-bound and cares more about HBM speed than raw compute.

Running both phases on the same pod means you're always making the wrong tradeoff. llm-d disaggregates them: separate pods handle prefill and decode, independently autoscaled. A prompt arrives, gets routed to a prefill pod, and the resulting KV cache is transferred to a decode pod via NIXL - which abstracts over CPU memory, GPU HBM, and storage backends using libfabric, UCCL, and GPUDirect Storage.

The v0.5 release adds bidirectional KV transfer between prefill and decode nodes, eliminating extra recomputation on multi-turn conversations.

The Inference Scheduler

The scheduler extends Envoy with LLM-specific routing logic. Standard load balancing treats every request as equivalent; that's wrong for LLMs. A request that matches a prefix already in cache should go to the pod holding that cache, not the pod with the lowest queue depth.

llm-d's scheduler does prefix-cache-aware routing - it hashes request prefixes and directs traffic to pods most likely to score a cache hit. It also supports use-based load balancing, per-user fairness, SLA-aware prioritization, and (new in v0.5) cache-aware LoRA routing, which routes LoRA-adapted requests to pods that already hold the adapter weights.

KV Cache Management

KV cache tiering: GPU HBM → CPU DRAM → local SSD → remote object storage (S3 and compatible). v0.5 adds hierarchical offloading with a custom filesystem backend that improves cross-pod cache hit rates. The goal is trading memory cost for latency without losing cache effectiveness.

llm-d architecture diagram showing disaggregated prefill/decode pods with KV cache transfer via NIXL llm-d routes prefill and decode phases to separate pods and transfers KV cache state between them using NVIDIA's NIXL library. Source: thenewstack.io

Multi-Node and MoE Support

For large MoE models like DeepSeek V3, llm-d uses LeaderWorkerSet (LWS) primitives for multi-node replica management and wide expert parallelism. Cross-node expert parallelism is supported via NVIDIA Multi-Node NVLink on Grace Blackwell systems.

Rolling out llm-d

The project uses Helm for deployment. A minimal install against an existing Kubernetes cluster with GPU nodes:

# Add the llm-d Helm repo
helm repo add llm-d https://llm-d.github.io/llm-d
helm repo update

# Install with P/D disaggregation enabled
helm install llm-d llm-d/llm-d \
  --set inferenceScheduler.enabled=true \
  --set pdDisaggregation.enabled=true \
  --set kvCache.tiering.enabled=true \
  --set autoscaler.scaleToZero=true \
  --namespace llm-serving \
  --create-namespace

An InferencePool resource then defines the model and hardware target:

apiVersion: inference.llm-d.io/v1alpha1
kind: InferencePool
metadata:
  name: deepseek-v3-pool
spec:
  model: deepseek-ai/deepseek-v3
  replicas:
    prefill: 4
    decode: 8
  hardware:
    accelerator: nvidia-h100
  kvCache:
    tier: gpu-cpu-ssd

What's in v0.5

Released February 2026, v0.5 is the version current at the CNCF donation. Key additions:

Feature	What it does
Hierarchical KV offloading	Tiered cache: GPU → CPU → SSD → S3
Cache-aware LoRA routing	Routes LoRA requests to pods holding adapter weights
Active-active HA	Dual-primary availability without failover lag
Scale-to-zero autoscaling	Idle pods release GPU resources entirely
UCCL transport	More solid cross-pod KV cache transfer
Bidirectional P/D transfer	Removes KV recomputation on multi-turn conversations

Performance Numbers

CNCF's benchmarks, run on publicly reproducible workflows added in v0.5:

Qwen3-32B, 8 pods, 16 H100s: near-zero latency up to ~120,000 tokens/second under load
DeepSeek V3 on H200 (v0.4): 40% reduction in per-output-token latency vs. baseline Kubernetes
256 B200 GPUs (16×16 P/D topology): up to 50,000 output tokens/second
~3,100 tokens/second per B200 in wide expert-parallelism configurations

KubeCon EU 2026 Amsterdam conference venue with attendees KubeCon + CloudNativeCon EU 2026 drew 13,350 attendees to Amsterdam; inference workloads were the dominant theme. Source: cncf.io

Platform Requirements

Component	Requirement
Kubernetes	1.29+
NVIDIA GPUs	A100 or newer
AMD GPUs	MI250 or newer
Intel GPUs	GPU Max series
Google TPU	v5e or newer
vLLM	0.6.0+
Model types	PyTorch and JAX generative, 1B+ parameters

NVIDIA's Broader Kubernetes Push

llm-d wasn't the only CNCF donation at KubeCon. NVIDIA donated its GPU DRA (Dynamic Resource Allocation) Driver to the Kubernetes project, enabling fine-grained GPU resource requests - fractional usage, Multi-Instance GPU partitioning, and multi-node NVLink for Grace Blackwell systems. Previously vendor-managed, the DRA Driver is now under CNCF community governance.

NVIDIA also had KAI Scheduler accepted as a CNCF Sandbox project - a high-performance AI workload scheduler for GPU clusters - and announced Grove, an open-source Kubernetes API for orchestrating AI workloads, which integrates with the llm-d stack.

The practical effect: GPU resource management in Kubernetes is converging around a set of community-owned primitives, rather than each cloud vendor shipping their own scheduler.

Where It Falls Short

The project is 11 months old and honest about what it hasn't solved yet.

Documentation is thin outside the main architecture overview. The Helm chart defaults require non-trivial tuning for production deployments - the scale-to-zero feature in particular has a cold-start latency that makes it unsuitable for SLA-bound serving without warming strategies.

The NIXL-based KV cache transfer between P and D pods currently requires NVIDIA hardware on both sides for optimal performance. AMD and Intel paths use slower CPU-mediated transfers, which narrows the performance advantage of disaggregation on heterogeneous clusters.

Bidirectional P/D KV transfer - the feature that eliminates multi-turn recomputation - is still a RFC at v0.5, not shipping code. It's the most important capability for conversational workloads and it isn't there yet.

The inference scheduler's prefix-cache routing also assumes reasonably predictable prefix distributions. For highly varied prompt traffic (common in API-as-a-service deployments), cache hit rates drop and the routing overhead starts to eat the latency gains.

None of that makes llm-d a bad project. It makes it a realistic one. The CNCF sandbox status brings in community contributors who'll close these gaps faster than IBM Research and Red Hat could alone. The vLLM community is already a collaborator - that's the most active inference engine project in open source and it's building toward llm-d's architecture natively.

The 7% daily deployment figure is what llm-d is actually trying to fix. The missing 75% of organizations don't have a model problem; they have an infrastructure problem. llm-d is betting that CNCF governance, a Helm install, and a vLLM backend are enough to close it.

Sources: