DeepSeek V3.2 vs V4 - What Changes With a Trillion Parameters
A pre-release comparison of DeepSeek V3.2 and V4 - examining the generational leap from 671B text-only to a trillion-parameter natively multimodal model with 1M context.

DeepSeek V3.2 is the model that broke the pricing floor in late 2025 - 671 billion parameters, 37 billion active per token, MIT licensed, and API pricing so aggressive it forced every competitor to rethink their economics. Now DeepSeek V4 is arriving in the first week of March with roughly 1 trillion parameters, native multimodal, and a 1 million token context window.
This is not a point release. V4 is a generational leap - new architecture, new modalities, new hardware target, and potentially even cheaper pricing. Every major dimension of the model has changed.
Note: V4 has not been officially released. This comparison uses leaked benchmarks, reporting from FT/Reuters/CNBC, and the V4 Lite leak. We will update this article with verified data after DeepSeek publishes official specifications.
TL;DR
- V3.2 is the known quantity - fully released, benchmarked, MIT licensed, and the cheapest frontier API available. If you need a production model today, V3.2 is shipping.
- V4 is the upgrade worth waiting days for - if the leaked benchmarks hold, it closes V3.2's biggest gaps (SWE-bench, multimodal, context length) while potentially being even cheaper.
Quick Comparison
| Feature | DeepSeek V3.2 | DeepSeek V4 (pre-release) |
|---|---|---|
| Status | Released (Dec 2025) | Expected March 3-7, 2026 |
| Architecture | MoE + MLA + DSA | MoE + MLA + mHC + Engram Memory + DSA/Lightning |
| Total Parameters | 671B | ~1T |
| Active Parameters | 37B | ~32B |
| Expert Routing | Top-2/top-4 | 16 experts per token |
| Context Window | 128K | 1M |
| Input Modalities | Text only | Text, Image, Video, Audio |
| API Pricing (Input) | $0.28/M (cache miss) | ~$0.14/M (estimated) |
| API Pricing (Output) | $0.42/M | ~$0.28/M (estimated) |
| License | MIT | Expected MIT or Apache 2.0 |
| Hardware | Nvidia primary | Huawei Ascend primary |
| SWE-bench Verified | 73.1% | 80%+ (leaked) |
| HumanEval | - | ~90% (leaked) |
| AIME 2025 | 93.1 | TBD |
| Codeforces | 2386 | TBD |
| MMLU-Pro | 85.0 | TBD |
| BrowseComp | 51.4-67.6% | TBD |
V3.2: The Current Champion
V3.2 landed in September 2025 (experimental) and December 2025 (official release), and it reshaped the industry's assumptions about what open-weight models could cost. At $0.28 per million input tokens on cache miss - dropping to $0.028 on cache hit - it is the cheapest frontier-quality API in the world. The automatic context caching alone makes it the obvious choice for any workload with repeated system prompts or overlapping context.
The benchmark profile is strong but has clear gaps. On competition math (AIME 93.1, Codeforces 2386) and coding evaluation (LiveCodeBench 83.3), V3.2 competes with or beats proprietary models costing 10-30x more. On SWE-bench Verified (73.1%), it trails Claude Opus 4.6 by nearly 8 points - a meaningful gap for production code repair. On agentic tasks like BrowseComp (51.4-67.6%), the gap to the proprietary frontier is even larger. For our full assessment, see the V3.2 review.
V3.2 is text-only. No image input, no video, no audio. If your workload involves any visual understanding, V3.2 is out of the running and you need a separate model - DeepSeek-VL, Kimi K2.5, or one of the proprietary options.
V4: What the Leaks Tell Us
V4 changes nearly everything except the core philosophy of maximizing capability per dollar.
The architecture scales from 671B to approximately 1 trillion total parameters while actually reducing active parameters from 37B to ~32B per token. This means inference costs per token should be comparable to or lower than V3.2, despite the model being 50% larger overall. The expert routing system expands from V3.2's top-2/top-4 selection to 16 expert pathways per token, drawn from hundreds of available experts per MoE layer.
Three new architectural innovations address V3.2's limitations:
- Manifold-Constrained Hyper-Connections (mHC) solves the training stability challenges that historically made trillion-parameter models unreliable to optimize
- Engram Conditional Memory provides the retrieval mechanism for the 8x expansion from 128K to 1M tokens of context
- DeepSeek Sparse Attention with Lightning Indexer extends V3.2's DSA to handle million-token sequences efficiently
The multimodal capability is the most significant functional change. V3.2 is text-only. V4 is described as natively multimodal - trained from the start on text, image, video, and audio data. This means V4 can replace both V3.2 and DeepSeek-VL in a single model, simplifying infrastructure for teams that currently run separate text and vision models.
The Coding Gap May Close
V3.2's most cited weakness is its SWE-bench Verified score of 73.1%, which trails Claude Opus 4.6 (80.8%) and GPT-5.3 Codex (80.0%) by 7-8 points. For teams building automated code repair pipelines, that gap matters.
The leaked V4 benchmarks suggest SWE-bench Verified above 80%. If that holds, V4 would match the proprietary frontier on the single most important real-world coding benchmark - while costing a fraction of the price. HumanEval at ~90% corroborates this picture.
The V4 Lite leak provides additional evidence. The smaller variant demonstrated breakthrough SVG code generation and was described by one inference provider as producing "more optimized code than DeepSeek 3.2, Claude Opus 4.6, and Gemini 3.1." If the smaller variant already surpasses V3.2 on code quality, V4 full should be a significant step up.
Benchmark Comparison
| Benchmark | DeepSeek V3.2 | DeepSeek V4 (leaked) | Delta |
|---|---|---|---|
| SWE-bench Verified | 73.1% | 80%+ | V4 +7 points or more |
| HumanEval | - | ~90% | V4 (no V3.2 baseline) |
| AIME 2025 | 93.1 | TBD | - |
| MMLU-Pro | 85.0 | TBD | - |
| GPQA Diamond | 82.4 | TBD | - |
| Codeforces | 2386 | TBD | - |
| LiveCodeBench | 83.3 | TBD | - |
| BrowseComp | 51.4-67.6% | TBD | - |
| Context Window | 128K | 1M | V4 (8x longer) |
| Active Params | 37B | ~32B | V4 (fewer, more efficient) |
| Total Params | 671B | ~1T | V4 (50% larger) |
| Modalities | Text | Text, Image, Video, Audio | V4 (native multimodal) |
The comparison is frustratingly incomplete because most V4 benchmark data hasn't leaked. What we can say: on the two leaked benchmarks (SWE-bench and HumanEval), V4 appears to be a substantial improvement. On the architectural dimensions - context length, modality, parameter efficiency - V4 is unambiguously superior.
Pricing Analysis
| Cost Factor | DeepSeek V3.2 | DeepSeek V4 (estimated) |
|---|---|---|
| Input (per 1M tokens) | $0.28 (cache miss) | ~$0.14 |
| Input (cache hit) | $0.028 | TBD |
| Output (per 1M tokens) | $0.42 | ~$0.28 |
| License | MIT | Expected MIT or Apache 2.0 |
If the estimated pricing holds, V4 would be roughly 50% cheaper on input and 33% cheaper on output than V3.2 - despite being a larger, multimodal model with 8x the context window. DeepSeek's efficiency at reducing per-token costs with each generation has been the single most disruptive force in AI pricing. V4 would continue that trend.
For a team processing 10 million output tokens per day, the annual cost drops from ~$1,533 (V3.2) to ~$1,022 (V4 estimated). The savings are meaningful but not dramatic at the per-token level - the real value is getting multimodal, 1M context, and better coding for less money.
DeepSeek V3.2: Pros and Cons
Pros:
- Released, verified, and production-stable - known quantity with months of real-world use
- Cheapest frontier API available today ($0.028/M on cache hit)
- MIT license with massive community and third-party ecosystem
- AIME 93.1 and Codeforces 2386 - best-in-class competition math and coding
- Extensive documentation, model card, and technical report available
- Runs on Nvidia GPUs with optimized inference tooling
Cons:
- Text-only - no vision, video, or audio input
- 128K context window is 8x smaller than V4's expected 1M
- SWE-bench 73.1% trails the proprietary frontier by 7-8 points
- BrowseComp 51.4-67.6% shows real weakness on agentic tasks
- Will likely be superseded by V4 within days
DeepSeek V4: Pros and Cons
Pros (expected):
- ~1T parameters with only ~32B active - potentially better performance per FLOP than V3.2
- Native multimodal replaces the need for separate vision models
- 1M context window with purpose-built Engram Conditional Memory
- Leaked SWE-bench 80%+ would match the proprietary frontier on coding
- Potentially 33-50% cheaper than V3.2 on per-token pricing
- Three new architectural innovations that advance MoE design
Cons (expected/potential):
- Not yet released - all claims are unverified
- Huawei Ascend optimization means potentially slower Nvidia inference at launch
- No leaked data on agentic tasks, math, or reasoning benchmarks
- Trillion-parameter model requires even more infrastructure to self-host than V3.2
- Community tooling and third-party integrations will take time to develop
- V3.2's BrowseComp weakness may persist if the improvements focus on coding
Verdict
If you need a model today, use V3.2. It is released, stable, MIT licensed, thoroughly benchmarked, and the cheapest frontier API available. The community tooling is mature, the documentation is extensive, and it works on Nvidia GPUs without friction.
If you can wait days, wait for V4. The generational improvements - native multimodal, 8x context window, leaked SWE-bench closing the proprietary gap, potentially cheaper pricing - are substantial enough that deploying V3.2 for a new project right now only makes sense if your timeline cannot tolerate even a one-week delay.
If multimodal matters, the choice is already clear. V3.2 is text-only. V4 is not. If your workload involves any visual or audio input, V4 eliminates the need to run a separate model entirely.
The biggest unknown is whether V4's agentic performance improves over V3.2. BrowseComp and sustained tool-use tasks were V3.2's weakest areas. If V4 addresses that gap alongside coding, it would be the most complete open-weight model ever released. If V4 focuses narrowly on reasoning and coding while leaving agentic performance flat, there will still be compelling reasons to pair it with proprietary models for complex workflows. We'll know in a matter of days.
