NVIDIA Dynamo Snapshot Slashes Kubernetes AI Cold Starts
NVIDIA's Dynamo Snapshot uses CRIU and cuda-checkpoint to freeze and restore GPU inference containers in seconds, cutting Kubernetes cold-start times by up to 21x for large models.

Kubernetes is good at scaling web servers. It's not good at scaling 70-billion-parameter language models. When traffic spikes and the autoscaler spins up a new inference pod, that pod has to load a model from scratch - pulling weights from storage, initializing CUDA contexts, warming the KV cache. On a large model, that process takes minutes. During those minutes, GPUs sit idle and SLA timers run.
NVIDIA's answer is Dynamo Snapshot, a checkpoint-and-restore system that freezes a fully initialized inference container into a snapshot, then restores it on any available node in seconds. The experimental release, published May 27, 2026, supports single-GPU vLLM and SGLang workloads.
TL;DR
- CRIU captures Linux process state; cuda-checkpoint dumps GPU device memory - restoring both puts a container back exactly where it left off
- KV cache unmapping cuts checkpoint size from ~190 GiB to ~6 GiB for Qwen3-0.6B
- gpt-oss-120b startup drops from several minutes to under 15 seconds via CRIU alone, and to sub-5 seconds with the GPU Memory Service and local NVMe SSDs
- Current limit: single-GPU, vLLM and SGLang only - multi-GPU and TensorRT-LLM support are on the roadmap
The Cold-Start Problem
NVIDIA Dynamo Snapshot combines CRIU and cuda-checkpoint to restore fully initialized inference containers in seconds.
Source: developer.nvidia.com
Kubernetes autoscaling was designed around stateless services. A HTTP worker process starts in milliseconds and handles requests right away. Inference containers don't work that way.
Spinning up a vLLM server for a 120-billion-parameter model involves transferring tens or hundreds of gigabytes from storage to GPU memory, compiling CUDA kernels, and running warmup steps. By the time a new pod is ready, the traffic spike that triggered scaling may already have passed - or the cluster may have violated SLA thresholds on the existing pods trying to absorb the load.
The community has tried several partial fixes: keeping "warm" pods on standby (expensive), reducing model sizes (accuracy tradeoff), pre-provisioning GPU nodes (slow). Dynamo Snapshot takes a different approach: treat the initialized state of a running container as a reusable artifact.
Why freezing a GPU process is hard
Checkpointing a standard Linux process is solved. CRIU has been doing it since 2012. GPU processes add a second layer of complexity: CUDA contexts, device memory allocations, and streams all live on the GPU and aren't represented in the Linux process image. Any usable snapshot needs to capture both sides of that boundary simultaneously.
NVIDIA's cuda-checkpoint tool, released as part of the Dynamo project, handles the GPU side. It dumps the full CUDA device state - contexts, memory mappings, streams - into CPU memory before CRIU serializes the host-side process. At restore time, the sequence runs in reverse.
Inside Dynamo Snapshot
CRIU handles the host side
CRIU (Checkpoint/Restore in Userspace) works by asking the Linux kernel to freeze a process tree, then dumping every relevant piece of kernel bookkeeping: memory maps, file descriptors, namespaces, threads. The result is a directory of files that, on restore, are fed back to the kernel to reconstruct the original process.
For inference containers, CRIU's main challenge is size. A vLLM process with a large model loaded can have hundreds of gigabytes of private memory. Dynamo Snapshot's first optimization addresses this directly.
KV cache unmapping cuts size dramatically
Before checkpointing, Dynamo's quiesce hook uses CUDA Virtual Memory Management APIs to unmap the KV cache buffer. Since no requests have been served yet from this fresh pod, the cache holds nothing useful. Unmapping it means CRIU never includes it in the snapshot.
The difference is large. Qwen3-0.6B's checkpoint drops from roughly 190 GiB to 6.2 GiB after unmapping. That is a 30x reduction in checkpoint size, which directly controls how long restore takes.
cuda-checkpoint freezes GPU state
With the process quiesced, cuda-checkpoint serializes the GPU side. It walks the CUDA context, finds all active device allocations, and copies them to host memory in a format that CRIU can include in its dump. At restore, the process reconstructs those allocations and remaps them into the right CUDA virtual addresses.
The CUDA context itself - kernels, streams, synchronization primitives - is reconstructed from the serialized state. From the inference framework's perspective, the process wakes up exactly where it stopped.
Async I/O speeds up memory restore
A further optimization targets the restore path itself. Standard CRIU restores private memory pages using sequential preadv calls. Dynamo Snapshot replaces this with Linux native AIO, maintaining up to 128 concurrent read requests. On storage-bottlenecked restores, this saturates available bandwidth rather than leaving I/O largely idle.
Shared memory buffers get similar treatment via parallel memfd restore, processing multiple buffers concurrently with a thread pool.
Performance Results
The benchmarks below come from NVIDIA's published technical blog. All tests used single-GPU workloads with the non-GMS checkpoint path unless noted.
| Model | Checkpoint Size | Restore Time | vs. Upstream CRIU | vs. Cold Start |
|---|---|---|---|---|
| Qwen3-0.6B | 6.2 GiB | 2.4 s | 2.8x faster | Much faster |
| Qwen3-8B | 26 GiB | 4.7 s | 5.1x faster | Much faster |
| gpt-oss-120b | 129 GiB | 15 s | 7.9x faster | 21x faster (GMS) |
The gpt-oss-120b row is the one that matters for production deployments. A 21x improvement over traditional cold starts - reached with the GPU Memory Service and striped local NVMe SSDs - brings a 120B model from several minutes to under 5 seconds.
How to Deploy
The Dynamo Snapshot snapshot-agent runs as a privileged DaemonSet on each Kubernetes node. A minimal setup looks like this:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: snapshot-agent
namespace: nvidia-dynamo
spec:
selector:
matchLabels:
app: snapshot-agent
template:
spec:
hostPID: true
hostIPC: true
containers:
- name: snapshot-agent
image: nvcr.io/nvidia/ai-dynamo/snapshot-agent:latest
securityContext:
privileged: true
volumeMounts:
- name: criu-images
mountPath: /snapshots
volumes:
- name: criu-images
hostPath:
path: /var/dynamo/snapshots
Inference workers signal readiness using annotations, and the snapshot-agent handles checkpoint/restore without requiring changes to the container runtime itself. The system uses runc-managed containers and the standard Kubernetes lifecycle hooks.
Requirements and Compatibility
Dynamo Snapshot targets data center clusters running Kubernetes - not consumer or edge deployments.
Source: pexels.com
| Requirement | Detail |
|---|---|
| Inference framework | vLLM or SGLang (TensorRT-LLM: not yet) |
| GPU count | Single-GPU only (multi-GPU: roadmap) |
| Container runtime | runc (CRI-O supported, containerd in progress) |
| Kubernetes | Any cluster with DaemonSet and hostPID access |
| Storage (basic) | NFS or local disk for snapshots |
| Storage (GMS sub-5s) | Striped local NVMe SSDs + GPUDirect Storage |
| Privilege level | Snapshot-agent requires privileged DaemonSet |
| CUDA | cuda-checkpoint included in the Dynamo release |
The requirement for privileged DaemonSets will raise flags in security-hardened clusters. This isn't an oversight - checkpoint/restore of GPU processes requires direct host PID namespace access. Clusters running strict pod security standards will need policy exceptions.
Where It Falls Short
The 21x speedup number applies specifically to gpt-oss-120b with GMS and local NVMe. Smaller models get more modest gains. Qwen3-0.6B restores in 2.4 seconds without GMS, which is fast, but a cold start on a small model in a well-provisioned cluster might already clear in 30-60 seconds - a narrower gap than the headline suggests.
Single-GPU only is the bigger constraint for most production deployments. Models above 34B normally run across multiple GPUs. The Dynamo team has multi-GPU and multi-node support on its roadmap, but has given no timeline. Until those ship, Dynamo Snapshot doesn't help with the workloads that take longest to cold-start.
TensorRT-LLM support is also missing. Many teams running NVIDIA hardware use TensorRT-LLM specifically because of its throughput advantages on Blackwell. They won't benefit from this release in its current form.
The GMS path - needed for the most dramatic speedups - requires local NVMe SSDs with GPUDirect Storage. NFS works, but the results are substantially slower. Cloud Kubernetes clusters with networked block storage will see different numbers than NVIDIA's internal benchmarks.
This is still an experimental release. The API and snapshot format may change before it hits stable. Running this in production now means accepting migration work when it does.
Dynamo Snapshot joins a growing set of Kubernetes-native AI inference primitives - with llm-d's CNCF-donated inference stack and vLLM's elastic parallelism work - that are collectively making Kubernetes a more plausible substrate for inference workloads rather than a compromise. The cold-start problem isn't fully solved. But sub-5-second restoration of a 120B-parameter model is a different class of solution from anything available six months ago.
Code and documentation are available at github.com/ai-dynamo/dynamo.
Sources:
