NVIDIA Dynamo Snapshot Slashes Kubernetes AI Cold Starts

Kubernetes is good at scaling web servers. It's not good at scaling 70-billion-parameter language models. When traffic spikes and the autoscaler spins up a new inference pod, that pod has to load a model from scratch - pulling weights from storage, initializing CUDA contexts, warming the KV cache. On a large model, that process takes minutes. During those minutes, GPUs sit idle and SLA timers run.

NVIDIA's answer is Dynamo Snapshot, a checkpoint-and-restore system that freezes a fully initialized inference container into a snapshot, then restores it on any available node in seconds. The experimental release, published May 27, 2026, supports single-GPU vLLM and SGLang workloads.

TL;DR

CRIU captures Linux process state; cuda-checkpoint dumps GPU device memory - restoring both puts a container back exactly where it left off
KV cache unmapping cuts checkpoint size from ~190 GiB to ~6 GiB for Qwen3-0.6B
gpt-oss-120b startup drops from several minutes to under 15 seconds via CRIU alone, and to sub-5 seconds with the GPU Memory Service and local NVMe SSDs
Current limit: single-GPU, vLLM and SGLang only - multi-GPU and TensorRT-LLM support are on the roadmap

The Cold-Start Problem

NVIDIA Dynamo Snapshot architecture overview NVIDIA Dynamo Snapshot combines CRIU and cuda-checkpoint to restore fully initialized inference containers in seconds. Source: developer.nvidia.com

Kubernetes autoscaling was designed around stateless services. A HTTP worker process starts in milliseconds and handles requests right away. Inference containers don't work that way.

Spinning up a vLLM server for a 120-billion-parameter model involves transferring tens or hundreds of gigabytes from storage to GPU memory, compiling CUDA kernels, and running warmup steps. By the time a new pod is ready, the traffic spike that triggered scaling may already have passed - or the cluster may have violated SLA thresholds on the existing pods trying to absorb the load.

The community has tried several partial fixes: keeping "warm" pods on standby (expensive), reducing model sizes (accuracy tradeoff), pre-provisioning GPU nodes (slow). Dynamo Snapshot takes a different approach: treat the initialized state of a running container as a reusable artifact.

Why freezing a GPU process is hard

Checkpointing a standard Linux process is solved. CRIU has been doing it since 2012. GPU processes add a second layer of complexity: CUDA contexts, device memory allocations, and streams all live on the GPU and aren't represented in the Linux process image. Any usable snapshot needs to capture both sides of that boundary simultaneously.

NVIDIA's cuda-checkpoint tool, released as part of the Dynamo project, handles the GPU side. It dumps the full CUDA device state - contexts, memory mappings, streams - into CPU memory before CRIU serializes the host-side process. At restore time, the sequence runs in reverse.

Inside Dynamo Snapshot

CRIU handles the host side

CRIU (Checkpoint/Restore in Userspace) works by asking the Linux kernel to freeze a process tree, then dumping every relevant piece of kernel bookkeeping: memory maps, file descriptors, namespaces, threads. The result is a directory of files that, on restore, are fed back to the kernel to reconstruct the original process.

For inference containers, CRIU's main challenge is size. A vLLM process with a large model loaded can have hundreds of gigabytes of private memory. Dynamo Snapshot's first optimization addresses this directly.

KV cache unmapping cuts size dramatically

Before checkpointing, Dynamo's quiesce hook uses CUDA Virtual Memory Management APIs to unmap the KV cache buffer. Since no requests have been served yet from this fresh pod, the cache holds nothing useful. Unmapping it means CRIU never includes it in the snapshot.

The difference is large. Qwen3-0.6B's checkpoint drops from roughly 190 GiB to 6.2 GiB after unmapping. That is a 30x reduction in checkpoint size, which directly controls how long restore takes.

cuda-checkpoint freezes GPU state

With the process quiesced, cuda-checkpoint serializes the GPU side. It walks the CUDA context, finds all active device allocations, and copies them to host memory in a format that CRIU can include in its dump. At restore, the process reconstructs those allocations and remaps them into the right CUDA virtual addresses.

The CUDA context itself - kernels, streams, synchronization primitives - is reconstructed from the serialized state. From the inference framework's perspective, the process wakes up exactly where it stopped.

Async I/O speeds up memory restore

A further optimization targets the restore path itself. Standard CRIU restores private memory pages using sequential preadv calls. Dynamo Snapshot replaces this with Linux native AIO, maintaining up to 128 concurrent read requests. On storage-bottlenecked restores, this saturates available bandwidth rather than leaving I/O largely idle.

Shared memory buffers get similar treatment via parallel memfd restore, processing multiple buffers concurrently with a thread pool.

Performance Results

The benchmarks below come from NVIDIA's published technical blog. All tests used single-GPU workloads with the non-GMS checkpoint path unless noted.

Model	Checkpoint Size	Restore Time	vs. Upstream CRIU	vs. Cold Start
Qwen3-0.6B	6.2 GiB	2.4 s	2.8x faster	Much faster
Qwen3-8B	26 GiB	4.7 s	5.1x faster	Much faster
gpt-oss-120b	129 GiB	15 s	7.9x faster	21x faster (GMS)

The gpt-oss-120b row is the one that matters for production deployments. A 21x improvement over traditional cold starts - reached with the GPU Memory Service and striped local NVMe SSDs - brings a 120B model from several minutes to under 5 seconds.

How to Deploy

The Dynamo Snapshot snapshot-agent runs as a privileged DaemonSet on each Kubernetes node. A minimal setup looks like this:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: snapshot-agent
  namespace: nvidia-dynamo
spec:
  selector:
    matchLabels:
      app: snapshot-agent
  template:
    spec:
      hostPID: true
      hostIPC: true
      containers:
      - name: snapshot-agent
        image: nvcr.io/nvidia/ai-dynamo/snapshot-agent:latest
        securityContext:
          privileged: true
        volumeMounts:
        - name: criu-images
          mountPath: /snapshots
      volumes:
      - name: criu-images
        hostPath:
          path: /var/dynamo/snapshots

Inference workers signal readiness using annotations, and the snapshot-agent handles checkpoint/restore without requiring changes to the container runtime itself. The system uses runc-managed containers and the standard Kubernetes lifecycle hooks.

Requirements and Compatibility

Server racks in a data center Dynamo Snapshot targets data center clusters running Kubernetes - not consumer or edge deployments. Source: pexels.com

Requirement	Detail
Inference framework	vLLM or SGLang (TensorRT-LLM: not yet)
GPU count	Single-GPU only (multi-GPU: roadmap)
Container runtime	runc (CRI-O supported, containerd in progress)
Kubernetes	Any cluster with DaemonSet and hostPID access
Storage (basic)	NFS or local disk for snapshots
Storage (GMS sub-5s)	Striped local NVMe SSDs + GPUDirect Storage
Privilege level	Snapshot-agent requires privileged DaemonSet
CUDA	cuda-checkpoint included in the Dynamo release

The requirement for privileged DaemonSets will raise flags in security-hardened clusters. This isn't an oversight - checkpoint/restore of GPU processes requires direct host PID namespace access. Clusters running strict pod security standards will need policy exceptions.

Where It Falls Short

The 21x speedup number applies specifically to gpt-oss-120b with GMS and local NVMe. Smaller models get more modest gains. Qwen3-0.6B restores in 2.4 seconds without GMS, which is fast, but a cold start on a small model in a well-provisioned cluster might already clear in 30-60 seconds - a narrower gap than the headline suggests.

Single-GPU only is the bigger constraint for most production deployments. Models above 34B normally run across multiple GPUs. The Dynamo team has multi-GPU and multi-node support on its roadmap, but has given no timeline. Until those ship, Dynamo Snapshot doesn't help with the workloads that take longest to cold-start.

TensorRT-LLM support is also missing. Many teams running NVIDIA hardware use TensorRT-LLM specifically because of its throughput advantages on Blackwell. They won't benefit from this release in its current form.

The GMS path - needed for the most dramatic speedups - requires local NVMe SSDs with GPUDirect Storage. NFS works, but the results are substantially slower. Cloud Kubernetes clusters with networked block storage will see different numbers than NVIDIA's internal benchmarks.

This is still an experimental release. The API and snapshot format may change before it hits stable. Running this in production now means accepting migration work when it does.

Dynamo Snapshot joins a growing set of Kubernetes-native AI inference primitives - with llm-d's CNCF-donated inference stack and vLLM's elastic parallelism work - that are collectively making Kubernetes a more plausible substrate for inference workloads rather than a compromise. The cold-start problem isn't fully solved. But sub-5-second restoration of a 120B-parameter model is a different class of solution from anything available six months ago.

Code and documentation are available at github.com/ai-dynamo/dynamo.

Sources: