News

Qwen 3.5 FP8 Weights Drop - How to Actually Deploy a 397B Model on 8 GPUs

Alibaba releases official FP8-quantized weights for the Qwen 3.5 flagship and 27B dense model, cutting memory requirements roughly in half and enabling deployment on 8x H100 GPUs with native vLLM and SGLang support.

Qwen 3.5 FP8 Weights Drop - How to Actually Deploy a 397B Model on 8 GPUs

Alibaba's Qwen team has released official FP8-quantized weights for the Qwen 3.5 model family. The two models that matter most: Qwen3.5-397B-A17B-FP8 (the flagship MoE) and Qwen3.5-27B-FP8 (the dense workhorse). Both are live on HuggingFace with native support for vLLM and SGLang out of the box.

Key Specs

SpecValue
ModelsQwen3.5-397B-A17B-FP8, Qwen3.5-27B-FP8
FP8 formatF8_E4M3 (4-bit exponent, 3-bit mantissa)
Memory savings~50% vs BF16 weights
Flagship GPU requirement8x H100 (80GB) with tensor parallelism
FrameworksvLLM, SGLang (native), HF Transformers
Context262K native, 1M+ with YaRN scaling
LicenseApache 2.0

Why FP8 Matters Here

The BF16 version of Qwen3.5-397B-A17B weighs roughly 800GB. That's a 16-node problem for most teams, or an exercise in creative sharding that nobody wants to debug at 2 AM. FP8 cuts the weight storage roughly in half - about 400GB - which fits across 8x H100 80GB GPUs with room left for KV cache.

The quantization uses E4M3 format (4-bit exponent, 3-bit mantissa), which preserves more dynamic range than INT8 while still halving memory. Qwen trained with a FP8-native pipeline that applies low-precision computing to activations, MoE routing, and matrix operations from the start, so these aren't post-hoc quantized weights. They're a first-class artifact from the training run.

For the 27B dense model, FP8 is less dramatic but still useful - it drops the model from ~56GB to ~28GB, making single-GPU deployment feasible on an A100 or H100.

Deployment Configurations

vLLM

The standard launch for the 397B flagship:

vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --port 8000 \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

For tool calling (agentic workflows):

vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --port 8000 \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Multi-token prediction (MTP) for speculative decoding - this is where the throughput gains get interesting:

vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --port 8000 \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --speculative-config \
    '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

If you're running text-only workloads and want to reclaim the memory the vision encoder consumes:

vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --port 8000 \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --language-model-only

SGLang

SGLang auto-detects the hybrid Gated DeltaNet attention layers and loads optimized kernels. Standard launch:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-397B-A17B-FP8 \
  --port 8000 \
  --tp-size 8 \
  --mem-fraction-static 0.8 \
  --context-length 262144 \
  --reasoning-parser qwen3

MTP speculative decoding on SGLang:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-397B-A17B-FP8 \
  --port 8000 \
  --tp-size 8 \
  --mem-fraction-static 0.8 \
  --context-length 262144 \
  --reasoning-parser qwen3 \
  --speculative-algo NEXTN \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4

Both frameworks expose an OpenAI-compatible API, so swapping in the FP8 model behind an existing chat or completion endpoint is a config change, not a rewrite.

Extending to 1M+ Context

YaRN rope scaling pushes context to 1,010,000 tokens. On vLLM:

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve \
  Qwen/Qwen3.5-397B-A17B-FP8 \
  --tensor-parallel-size 8 \
  --max-model-len 1010000 \
  --reasoning-parser qwen3 \
  --hf-overrides '{"text_config": {"rope_parameters": {
    "rope_type": "yarn", "factor": 4.0,
    "original_max_position_embeddings": 262144}}}'

You'll need substantially more GPU memory for the KV cache at this length. Plan accordingly.

Hardware Requirements

ConfigurationGPUsVRAM (total)Context Budget
397B FP8 (full)8x H100 80GB640GB262K tokens
397B FP8 (text-only)8x H100 80GB640GB262K+ (more KV headroom)
397B FP8 (1M context)8x H200 141GB+1TB+1M tokens
27B FP81x H100 80GB80GB262K tokens
27B FP8 (reduced)1x A100 40GB40GB~64K tokens

The --mem-fraction-static 0.8 flag on SGLang reserves 80% of GPU memory for model weights and KV cache. Adjust down if you're running other processes on the same GPUs.

Benchmark Sanity Check

FP8 quantization typically costs 0.1-0.5% on benchmarks versus full precision. Qwen's reported numbers for the 397B FP8 variant:

BenchmarkScoreCategory
MMLU-Pro87.8Knowledge
GPQA Diamond88.4Reasoning
HMMT Feb 2594.8Math
SWE-bench Verified76.4Coding
LiveCodeBench v683.6Coding
MMMU85.0Vision
MathVision88.6Vision

These are strong numbers across the board. The SWE-bench score of 76.4 puts it in the same tier as Claude Sonnet 4.6 and ahead of GPT-5-mini. The MMMU 85.0 is competitive with the best proprietary vision-language models.

Where It Falls Short

No FP8 for the MoE medium models yet. The 35B-A3B and 122B-A10B variants - the medium series models that made headlines last week - don't have official FP8 weights from Qwen. Community quantizations exist (GGUF from Unsloth, etc.), but if you want Qwen-blessed FP8 for the MoE mediums, you're waiting.

Gated DeltaNet framework support is still catching up. vLLM and SGLang have native support, but TensorRT-LLM and other inference stacks may not fully optimize for the hybrid attention architecture yet. If your deployment pipeline is locked to a specific framework, check compatibility before committing.

8x H100 isn't cheap. FP8 halves the memory floor, but the floor was already high. Running this model in production still requires serious GPU infrastructure. For teams without H100 clusters, the 27B-FP8 on a single GPU or the community-quantized medium models are more realistic starting points.

Context scaling has sharp cost curves. Going from 262K to 1M tokens with YaRN is technically supported, but the KV cache memory requirements scale linearly with context length. At 1M tokens across 8 GPUs, you're memory-constrained well before you're compute-constrained.

The FP8 weights are a deployment story, not a capability story. The model is the same Qwen3.5-397B that launched two weeks ago. What's new is that running it in production just got meaningfully more accessible - if your definition of accessible includes an 8-GPU cluster.

Sources

Qwen 3.5 FP8 Weights Drop - How to Actually Deploy a 397B Model on 8 GPUs
About the author AI Infrastructure & Open Source Reporter

Sophie is a journalist and former systems engineer who covers AI infrastructure, open-source models, and the developer tooling ecosystem.