vLLM 0.17 Ships FlashAttention 4 and Live MoE Scaling

TL;DR

vLLM v0.17.0 ships with FlashAttention 4 - 1,605 TFLOPs/s on Blackwell B200, 2.7x over Triton
Elastic Expert Parallelism Milestone 2 lets MoE deployments rescale GPU counts without a restart
Full Qwen3.5 family support including FP8 quant, GDN architecture, and realtime ASR streaming
New --performance-mode flag: pick balanced, interactivity, or throughput at launch time
PyTorch 2.10.0 upgrade is a breaking change - existing environments need updates before upgrading

vLLM v0.17.0 landed on March 7, 2026, three days after FlashAttention 4's own release. Two days from upstream publication to a production-grade integration inside the most widely deployed open-source inference engine isn't a marketing milestone; it reflects a community that has grown to 272 active contributors across 699 commits in a single cycle - 48 of them committing to vLLM for the first time.

The release covers a lot of ground: attention backends, parallelism strategies, model support, multi-hardware targets, and a small but useful operator experience improvement. Below is what matters and what doesn't.

FlashAttention 4 Integration

What FA4 Actually Delivers

FlashAttention 4, published by Together AI on March 5, was designed specifically for NVIDIA Blackwell's asymmetric architecture - tensor core throughput scaled from 1 to 2.25 petaFLOPs between H100 and B200, but shared memory bandwidth and exponential function units didn't keep pace. FA4's forward pass uses ping-pong scheduling with dual query tile processing to keep both pipelines saturated, and its backward pass exploits Blackwell's 2-CTA MMA mode to cut shared memory traffic in half.

The result: 1,605 TFLOPs/s on B200 with 71% hardware use, versus around 600 TFLOPs/s from FlashAttention 3 on Hopper. That's a 2.7x speedup over Triton implementations and a 1.1-1.3x improvement over cuDNN 9.13. Deterministic mode holds 85-90% of that ceiling.

vLLM Blackwell throughput vs latency Pareto frontier for Llama 3.3 70B vLLM Blackwell throughput vs. latency on Llama 3.3 70B (1K input, 8K output) compared to Hopper-generation performance. Source: blog.vllm.ai

For Hopper-class hardware (H100, H200) that doesn't benefit from the Blackwell-specific scheduling, FA4 still offers modest improvements through its CuTe-DSL compilation approach, which cuts compile times by 20-30x compared to prior releases.

Enabling FA4 in Practice

FA4 activates on supported hardware without extra configuration:

pip install vllm==0.17.0
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
    --max-model-len 32768 \
    --tensor-parallel-size 4

The engine selects FA4 automatically on Blackwell. To confirm the backend is active:

from vllm import LLM
llm = LLM(model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8")
print(llm.llm_engine.model_config.attention_backend)
# -> flashattention4 on SM100+ hardware

If you're on Hopper, the output will read flashattention3 or flashattention2 depending on your CUDA version.

Elastic Expert Parallelism - The Infrastructure Story

This is the feature most production teams should care about, and it received the least press. MoE models like Qwen3.5-122B-A10B and Llama 4 Maverick spread expert layers across GPUs, and until now that spread was static. To resize the cluster - add GPUs during a traffic spike, or remove them when utilization drops - you had to take the endpoint offline, reconfigure, and restart. In practice that meant either overprovisioning or accepting dropped requests during capacity changes.

Elastic EP Milestone 2

Elastic Expert Parallelism Milestone 2 introduces two mechanisms: an async EPLB (Elastic Expert Parallelism Load Balancer) with a rebalancing algorithm that redistributes expert assignments in the background, and a sync enforcement layer using the NCCL backend to commit the rebalance without pausing in-flight requests.

The practical shape of a rebalance looks like this:

from vllm.config import ParallelConfig

# Initial deployment: 8 GPUs, experts distributed 4-way
config = ParallelConfig(
    tensor_parallel_size=4,
    expert_parallel_size=8,
)

# Scale to 16 GPUs without restart
config.elastic_scale(target_expert_parallel_size=16)
# Triggers async EPLB rebalance + NCCL sync enforcement

The rebalance algorithm works by monitoring per-expert token load and migrating cold experts to newly added workers. Hot experts stay pinned to avoid unnecessary data movement. The NCCL sync enforces a consistent view across all workers before the new routing table activates.

This doesn't completely solve MoE serving - hot-expert imbalance is still a real problem on heterogeneous workloads, and the rebalance itself adds a brief coordination latency. But it removes the hard requirement for offline restarts, which changes how operators can deploy and price MoE capacity.

Model Runner V2 at Full Maturity

Model Runner V2 has supported inference since v0.15, but v0.17.0 fills the last major gaps. Pipeline parallelism and decode context parallelism now work inside MR V2, along with Eagle3 speculative decoding backed by CUDA graphs. The practical effect: all the performance optimizations that shipped in V2's first iteration now apply across the full range of serving configurations, including multi-node pipeline-parallel deployments.

The --performance-mode flag is the user-visible interface to the underlying scheduler behavior:

# Optimized for interactive chat - minimize time to first token
vllm serve Qwen/Qwen3.5-32B \
    --performance-mode interactivity

# Maximize batch throughput for offline workloads
vllm serve Qwen/Qwen3.5-32B \
    --performance-mode throughput

# Default: balanced trade-off
vllm serve Qwen/Qwen3.5-32B \
    --performance-mode balanced

These map to internal scheduler knobs that previously required knowing which config parameters to set. The abstraction is sensible - most operators know whether they're building a chat product or a batch pipeline.

Qwen3.5 Full Stack Support

After the Qwen3.5 FP8 weights release we covered in February, v0.17.0 closes the loop on the full family. The GDN (Gated Delta Networks) architecture that the smaller Qwen3.5 models use is now natively supported, with:

FP8 quantization for all Qwen3.5 variants
MTP (Multi-Token Prediction) speculative decoding
Realtime ASR streaming via Qwen3-ASR, FunASR, and FireRedASR2

The ASR streaming support is new territory for vLLM. The inference path runs audio through chunked preprocessing before entering the same batching engine used for text, which means audio and text requests can share GPU capacity on a single server without separate process management.

FlashAttention 4 achieves 71% GPU utilization on NVIDIA Blackwell B200 Together AI's FlashAttention 4 hits 71% hardware use on B200 - vLLM v0.17.0 integrates it two days after release. Source: blockchain.news

Beyond Qwen3.5, nine new model architectures land in this release: COLQwen3, Ring 2.5, Ovis 2.6, plus expanded multimodal handling with audio chunking and video input. If you're running a Nemotron 3 Nano or similar MoE architecture, the elastic EP improvements apply directly.

Hardware Coverage

Platform	New in v0.17.0
NVIDIA SM100 (Blackwell)	FMHA FP8 prefill, MXFP8 blockscaled ops, NVFP4 kernel fixes
AMD ROCm	AITER fused RoPE, MXFP4 ops, bitsandbytes support
Intel XPU	CUDA graph support, GPUDirect RDMA via NIXL
CPU (ARM)	BF16 cross-compilation, KleidiAI INT8_W4A8
CPU (x86/s390x)	FP16, s390x vector intrinsics

Multi-hardware coverage has been a consistent vLLM priority, and this release extends it further. The Intel XPU additions, in particular, reflect real production demand from enterprise teams that can't use NVIDIA hardware for compliance or cost reasons.

Anthropic API Compatibility

An underreported addition: Anthropic API support now includes thinking blocks, the count_tokens API, tool_choice=none, and improved streaming and image handling. If you're using vLLM as a drop-in inference backend for Claude-compatible clients, this matters. It's also relevant for developers assessing whether to run Claude-style workflows on local hardware - see our guide on running open-source LLMs locally for the full setup picture.

Where It Falls Short

The performance gains in v0.17.0 are real but narrow. The 13.9% throughput improvement for pooling models and 2.9% E2E gain from async pipeline-parallel send/recv apply to specific workloads - you won't see these numbers on a standard chat endpoint running Llama.

Elastic EP is useful, but the NCCL sync enforcement adds coordination overhead during rebalance. Under sustained high load, a rebalance event will create a brief latency spike. The project's RFC for Elastic EP acknowledges this and lists it as a known limitation pending further optimization. Production teams should test rebalance behavior under load before relying on it for autoscaling.

The PyTorch 2.10.0 upgrade is a genuine friction point. Existing vLLM deployments using pip install --upgrade vllm will break if PyTorch isn't updated separately first. CUDA 12.9+ users also face an edge case: CUBLAS_STATUS_INVALID_VALUE errors that require removing system CUDA from LD_LIBRARY_PATH or using specific install flags. Neither issue is insurmountable, but the upgrade path is less clean than previous releases.

The vLLM project isn't slowing down. The combination of FA4 for Blackwell, elastic parallelism for MoE, and a growing multi-hardware matrix makes v0.17.0 the most production-capable release the project has shipped. Whether that translates to simplified operations depends on whether your team has the engineering capacity to absorb PyTorch migration and test elastic rebalancing under real load - both of which require hands-on work that the release notes don't make obvious. Check our comparison of local LLM tools if you're still deciding which inference stack fits your setup.

Sources: