Open Source LLM Hosting Costs - June 2026
Verified June 2026: real cost per million tokens for self-hosting Llama 4 Scout, Maverick, Qwen3-235B, and DeepSeek V3.2 - GPU requirements, cost formulas, and when cheap APIs actually win.

TL;DR
- Llama 4 Scout on a single RunPod H100 PCIe ($2.89/hr) costs ~$0.18/MTok at 100% utilization and ~$0.36/MTok at 50% - competitive with mid-tier API providers
- DeepSeek V3.2 on 8x H100 costs $4.43/MTok at full use - the API at $0.27/$1.10 per MTok is cheaper by 5x unless you need privacy
- Self-hosting wins on compliance, fine-tuned models, and predictable monthly budgets - rarely on raw cost alone against 2026 API prices
The economics of running open-source models have shifted hard over the past year. API providers now compete aggressively on price, and commodity inference pricing for models like Llama 4 Scout starts at $0.08/MTok input - cheaper than what most teams can self-host. So the question isn't just "how much does self-hosting cost?" It's also "compared to what?"
This page walks through the hardware requirements, GPU cloud costs, cost-per-million-token math, and the honest comparison to current API pricing. The numbers here use verified GPU rates from RunPod and Lambda Labs as of June 2026 and estimated throughput figures from vLLM benchmarks. Throughput varies by workload, batch size, and context length - treat these as planning baselines, not guarantees.
What Each Model Actually Needs
Every major open-weight release in 2026 uses a Mixture-of-Experts architecture. MoE is efficient at inference time because only a fraction of model parameters activates per token - but the full weight set still lives in VRAM. You load all the experts, even though most sit idle for any given token. That disconnect between "active parameters" and "memory footprint" is what trips up hardware planning.
| Model | Total Params | Active Params | VRAM at INT4 | Min GPU Setup | Notes |
|---|---|---|---|---|---|
| Llama 4 Scout | 109B MoE | 17B | ~55GB | 1x H100 80GB | Comfortable fit with 25GB KV cache headroom |
| Llama 4 Maverick | 400B MoE | 17B | ~200GB | 3x H100 80GB | Tight at Q4, needs tensor parallelism |
| Qwen3-235B-A22B | 235B MoE | 22B | ~132GB | 2x H100 80GB (min) | 4x recommended for production KV cache |
| DeepSeek V3.2 | 685B MoE | 37B | ~350GB FP8 | 8x H100 80GB | Limited context headroom at FP8; 8x H200 for full 128K |
For Scout, the 1x H100 setup is truly comfortable - 55GB of weights leaves 25GB of KV cache headroom at 80GB total VRAM. At FP16 precision, Scout balloons to ~218GB and needs 4x H100s, but no production deployment should be running Scout at FP16.
Maverick is trickier. At Q4 quantization the weights run roughly 200GB, which means 3x H100 SXM (240GB total) barely fits. If you're running long contexts or large batches, you'll want 4x for headroom rather than the minimum of 3.
DeepSeek V3.2 is a data-center class deployment. There's no consumer path here. At FP8, 8x H100 SXM gives you 640GB with enough room for inference at moderate context lengths, but full 128K context runs require stepping up to H200 hardware with its 141GB per GPU.
The H100 SXM remains the baseline compute unit for serious open-source model hosting in 2026. The PCIe variant (same die, different interconnect) costs less per hour but scales less efficiently in multi-GPU configurations.
Source: commons.wikimedia.org
GPU Cloud Rates - June 2026
Prices verified directly from provider pricing pages. On-demand rates are single-GPU unless noted.
| Provider | H100 PCIe | H100 SXM | A100 80GB | B200 | Spot Available |
|---|---|---|---|---|---|
| RunPod | $2.89/hr | $3.29/hr | $1.49/hr | $5.89/hr | Yes |
| Vast.ai | $2.40/hr | $2.33/hr | varies | - | Yes (~$1.00/hr H100) |
| Lambda Labs | $3.29/hr | $4.29/hr (1x) | $2.79/hr (8x) | $6.99/hr (1x) | No |
| Nebius | - | $2.95/hr | - | $5.50/hr | No |
| CoreWeave | - | ~$6.16/hr | - | - | No (reserved only) |
| Lambda reserved | - | $1.89/hr (12mo) | - | - | - |
| AWS p5 | - | ~$6.88/hr | - | ~$14.24/hr | Yes (partial) |
Lambda's 8x H100 SXM cluster configuration comes out to $3.99/hr per GPU, not the single-GPU rate - so a full 8-GPU pod runs $31.92/hr. That's the relevant number for DeepSeek V3.2 deployment planning. See the GPU rental pricing page for the full 20-provider breakdown including spot pricing depth.
The two things that stand out in June 2026 pricing: CoreWeave remains expensive on-demand (reserved contracts can bring that down 50-60%, but require enterprise agreements), and the B200 GPU is now available but still at significant premium - $5.89-6.99/hr on-demand against the H100's $2.89-3.29/hr range.
Multi-GPU server configurations for running models like Maverick or DeepSeek V3.2 require rack-scale deployments. Most teams access this via cloud providers rather than owning the hardware.
Source: unsplash.com
The Cost Per Million Tokens Formula
Converting GPU hourly rates into per-token costs requires two inputs: the hourly rate and the throughput in tokens per second. The formula:
cost_per_MTok = (hourly_rate / (throughput_tps × 3600)) × 1,000,000
At 100% GPU use. For real workloads, use sits lower - 50% is a common planning figure for workloads with predictable demand curves.
The throughput figures below are vLLM estimates at high batch sizes (output-dominated workloads). MoE models with fewer active parameters achieve higher throughput than dense models of equivalent total size. A 17B-active MoE achieves throughput closer to a 17B dense model than to a 400B dense model.
| Model | GPU Setup | Hourly Cost | Est. Throughput | Cost/MTok (100%) | Cost/MTok (50%) |
|---|---|---|---|---|---|
| Llama 4 Scout | 1x RunPod H100 PCIe | $2.89 | ~4,500 tps | ~$0.18 | ~$0.36 |
| Llama 4 Maverick | 3x RunPod H100 PCIe | $8.67 | ~3,500 tps | ~$0.69 | ~$1.38 |
| Qwen3-235B-A22B | 4x RunPod H100 PCIe | $11.56 | ~3,000 tps | ~$1.07 | ~$2.14 |
| DeepSeek V3.2 | 8x Lambda H100 SXM | $31.92 | ~2,000 tps | ~$4.43 | ~$8.87 |
These figures apply equally to input and output tokens - self-hosted inference doesn't distinguish. KV cache hits speed up following tokens in long contexts, so actual effective cost per token may be lower for context-heavy workloads.
The Scout number at 100% use ($0.18/MTok) looks competitive until you compare it to current API prices.
Self-Hosting vs. API: The Honest Comparison
API pricing for open-weight models has compressed dramatically in 2026. Commodity inference is cheap.
| Model | API Provider | Input /MTok | Output /MTok | Blended 1:2 ratio |
|---|---|---|---|---|
| Llama 4 Scout | DeepInfra (cheapest) | $0.08 | $0.14 | $0.12 |
| Llama 4 Scout | Groq | $0.11 | $0.34 | $0.26 |
| Llama 4 Maverick | DeepInfra / Fireworks | $0.15 | $0.60 | $0.45 |
| Qwen3-235B-A22B | OpenRouter | $0.455 | $1.82 | $0.91 |
| DeepSeek V3.2 | DeepSeek API | $0.27 | $1.10 | $0.82 |
Comparing against self-hosting:
Scout: Self-hosted at full use costs $0.18/MTok. The cheapest API (DeepInfra) charges $0.12/MTok blended. Self-hosting costs 50% more at perfect use - and most workloads aren't at perfect utilization. Against Groq ($0.26/MTok blended), self-hosting wins at 70%+ use.
Maverick: Self-hosted at full use runs $0.69/MTok. The cheapest API runs $0.45/MTok blended. Self-hosting costs 53% more even at 100% GPU use. There is no use rate at which Maverick self-hosting beats the cheapest API on raw cost.
DeepSeek V3.2: Self-hosted at full utilization costs $4.43/MTok. The DeepSeek API charges $0.82/MTok blended. The API is 5.4x cheaper at full utilization. Self-hosting DeepSeek V3.2 isn't a cost optimization - it's a sovereignty decision.
The math is clear: self-hosting 2026 open-weight models almost never saves money compared to current API prices. The decision to self-host is a privacy or compliance decision, not a cost decision.
The exception to this pattern is smaller dense models (7B-13B), where self-hosting on an older A100 at $1.49/hr can produce very low per-token costs at scale. A single A100 running a 13B model at 50% utilization hits roughly $0.08-0.10/MTok, competitive with the cheapest APIs. But for the frontier open-weight models most teams are actually trying to run in 2026, the economics favor APIs.
Hidden Costs That Skew the Math
The GPU hourly rate is the visible cost. Four other cost categories tend to be underestimated.
Engineering overhead
Maintaining a production vLLM deployment requires 10-20 hours per month for monitoring, version updates, CUDA compatibility fixes, and queue tuning. At $75-150/hr for a senior ML engineer or DevOps specialist, that's $750-3,000/month in labor that doesn't show up in your GPU spend. For a single H100 Scout deployment at $2.89/hr, the GPU costs $2,081/month at full time - adding just 10 hours of engineer time at $100/hr adds a 48% overhead.
Idle GPU time
Cloud APIs scale to zero. Your GPU doesn't. If your workload peaks during business hours and runs light on nights and weekends, you're paying for 24 hours while using meaningful capacity for maybe 10-12. At 40% average use, your effective per-token cost doubles compared to the 100% use figure in the table above.
Modal's serverless model (scale-to-zero, ~$3.95/hr for H100 but billed per second) partially addresses this for inference workloads where cold starts under 15 seconds are acceptable.
Storage and networking
Model weights for a Q4 Scout deployment are ~55GB. At $0.10-0.20/GB/month, persistent storage for the model runs $5-11/month - trivial. But if you're running checkpoints, extended context caches, or logging full inference traces, storage costs grow fast. Networking egress is the real trap: most GPU cloud providers charge $0.05-0.12/GB out. If your application returns long outputs to downstream services, run the math before committing.
Quantization tradeoffs
Moving from BF16 to INT4 cuts memory requirements roughly 75% - Scout from ~218GB to ~55GB - but quality degrades measurably on complex reasoning tasks. The cost calculation above uses INT4. If your use case requires BF16 precision, multiply your GPU count by 4 and recalculate.
The visible cost is the GPU rack. The hidden costs include the engineering team, egress charges, and the idle time between your peak and your trough.
Source: pexels.com
When Self-Hosting Actually Wins
The privacy and compliance case is real and frequently underweighted:
Medical and legal data - Prompts containing PHI, legal strategy, or client-privileged information can't route through third-party inference providers without data processing agreements. Many teams self-host specifically to avoid managing those agreements.
Custom fine-tuned models - If you've fine-tuned Scout or Maverick on proprietary data, no API provider hosts your specific checkpoint. Self-hosting is the only option.
Predictable monthly cost - An API bill varies with usage. A RunPod reservation gives you a fixed monthly number that finance can model. The predictability has value independent of whether the fixed cost is lower than the variable alternative.
Latency-critical applications - Shared inference APIs experience queue saturation during peak hours. Reserved or dedicated GPU capacity gives you consistent throughput when your SLA requires sub-200ms first-token latency.
For cost-only decisions on publicly available model weights, the LLM API pricing comparison should be the first stop. Cheap commodity inference has made the "we'll self-host to save money" argument much harder to justify in 2026.
FAQ
What's the cheapest way to self-host a capable open-source model in 2026?
Llama 4 Scout on a single RunPod H100 PCIe at $2.89/hr. At 50% utilization that's about $0.36/MTok. For comparison, the cheapest API access to Scout-equivalent quality runs $0.08-0.15/MTok, so self-hosting carries a cost premium unless you have sustained high use.
How many H100s do I need for DeepSeek V3.2?
Eight minimum at FP8 precision to fit the 685B parameter weights (requiring ~680GB of VRAM across eight 80GB cards). For full 128K context with generous KV cache headroom, you need eight H200s instead. This is a 8-figure-per-year infrastructure commitment at market GPU rates.
Does spot pricing make self-hosting cheaper?
For batch workloads that tolerate interruption, yes. Vast.ai H100 spot pricing starts around $1.00/hr - cutting the Scout per-token cost in half. For production inference serving users synchronously, spot instances are unsuitable. An instance preemption mid-inference causes user-visible failures.
What inference framework should I use?
vLLM is the default for production throughput. SGLang leads by roughly 29% throughput on H100 benchmarks for standard workloads and is worth testing if throughput is your primary constraint. Ollama works well for development environments and single-user deployments but isn't designed for multi-user production serving.
Is the B200 worth using for open-source model hosting?
Not yet for most teams. At $5.89-6.99/hr on-demand versus $2.89/hr for the H100 PCIe, the B200 costs roughly 2x more per hour. Its performance advantage is real for training workloads, but for inference on current open-weight models, the cost premium isn't justified unless you're running at very large scale where the compute efficiency improvement translates directly to lower per-token cost.
How does self-hosting compare to fine-tuning costs?
Different category. Self-hosting covers ongoing inference costs. Fine-tuning is a one-time training spend on top of that. See the fine-tuning costs comparison for training cost estimates across providers.
Sources:
- RunPod GPU Pricing
- Lambda AI Cloud Pricing
- Vast.ai GPU Pricing
- Llama 4 GPU Requirements - willitrunai.com
- DeepSeek V3.2 Hardware Requirements - spheron.network
- Qwen3-235B VRAM Requirements - willitrunai.com
- Llama 4 Scout API Pricing - artificialanalysis.ai
- Llama 4 Maverick API Pricing - artificialanalysis.ai
- Qwen3-235B Pricing - OpenRouter
- DeepSeek API Pricing
- vLLM vs Ollama Throughput 2026 - tech-insider.org
- Self-Host LLM vs API Cost 2026 - devtk.ai
- GPU Cloud Pricing 2026 - spheron.network
✓ Last verified June 1, 2026
