Open Source LLM Hosting Costs - June 2026

Verified June 2026: real cost per million tokens for self-hosting Llama 4 Scout, Maverick, Qwen3-235B, and DeepSeek V3.2 - GPU requirements, cost formulas, and when cheap APIs actually win.

Cheapest: Llama 4 Scout on 1x H100 (~$0.18/MTok at full utilization) Best Value: Llama 4 Scout for privacy-required workloads - single H100, no per-token billing Updated monthly
Open Source LLM Hosting Costs - June 2026

TL;DR

  • Llama 4 Scout on a single RunPod H100 PCIe ($2.89/hr) costs ~$0.18/MTok at 100% utilization and ~$0.36/MTok at 50% - competitive with mid-tier API providers
  • DeepSeek V3.2 on 8x H100 costs $4.43/MTok at full use - the API at $0.27/$1.10 per MTok is cheaper by 5x unless you need privacy
  • Self-hosting wins on compliance, fine-tuned models, and predictable monthly budgets - rarely on raw cost alone against 2026 API prices

The economics of running open-source models have shifted hard over the past year. API providers now compete aggressively on price, and commodity inference pricing for models like Llama 4 Scout starts at $0.08/MTok input - cheaper than what most teams can self-host. So the question isn't just "how much does self-hosting cost?" It's also "compared to what?"

This page walks through the hardware requirements, GPU cloud costs, cost-per-million-token math, and the honest comparison to current API pricing. The numbers here use verified GPU rates from RunPod and Lambda Labs as of June 2026 and estimated throughput figures from vLLM benchmarks. Throughput varies by workload, batch size, and context length - treat these as planning baselines, not guarantees.

What Each Model Actually Needs

Every major open-weight release in 2026 uses a Mixture-of-Experts architecture. MoE is efficient at inference time because only a fraction of model parameters activates per token - but the full weight set still lives in VRAM. You load all the experts, even though most sit idle for any given token. That disconnect between "active parameters" and "memory footprint" is what trips up hardware planning.

ModelTotal ParamsActive ParamsVRAM at INT4Min GPU SetupNotes
Llama 4 Scout109B MoE17B~55GB1x H100 80GBComfortable fit with 25GB KV cache headroom
Llama 4 Maverick400B MoE17B~200GB3x H100 80GBTight at Q4, needs tensor parallelism
Qwen3-235B-A22B235B MoE22B~132GB2x H100 80GB (min)4x recommended for production KV cache
DeepSeek V3.2685B MoE37B~350GB FP88x H100 80GBLimited context headroom at FP8; 8x H200 for full 128K

For Scout, the 1x H100 setup is truly comfortable - 55GB of weights leaves 25GB of KV cache headroom at 80GB total VRAM. At FP16 precision, Scout balloons to ~218GB and needs 4x H100s, but no production deployment should be running Scout at FP16.

Maverick is trickier. At Q4 quantization the weights run roughly 200GB, which means 3x H100 SXM (240GB total) barely fits. If you're running long contexts or large batches, you'll want 4x for headroom rather than the minimum of 3.

DeepSeek V3.2 is a data-center class deployment. There's no consumer path here. At FP8, 8x H100 SXM gives you 640GB with enough room for inference at moderate context lengths, but full 128K context runs require stepping up to H200 hardware with its 141GB per GPU.

NVIDIA H100 GPU chip, close-up view of the SXM variant used in dense inference clusters The H100 SXM remains the baseline compute unit for serious open-source model hosting in 2026. The PCIe variant (same die, different interconnect) costs less per hour but scales less efficiently in multi-GPU configurations. Source: commons.wikimedia.org

GPU Cloud Rates - June 2026

Prices verified directly from provider pricing pages. On-demand rates are single-GPU unless noted.

ProviderH100 PCIeH100 SXMA100 80GBB200Spot Available
RunPod$2.89/hr$3.29/hr$1.49/hr$5.89/hrYes
Vast.ai$2.40/hr$2.33/hrvaries-Yes (~$1.00/hr H100)
Lambda Labs$3.29/hr$4.29/hr (1x)$2.79/hr (8x)$6.99/hr (1x)No
Nebius-$2.95/hr-$5.50/hrNo
CoreWeave-~$6.16/hr--No (reserved only)
Lambda reserved-$1.89/hr (12mo)---
AWS p5-~$6.88/hr-~$14.24/hrYes (partial)

Lambda's 8x H100 SXM cluster configuration comes out to $3.99/hr per GPU, not the single-GPU rate - so a full 8-GPU pod runs $31.92/hr. That's the relevant number for DeepSeek V3.2 deployment planning. See the GPU rental pricing page for the full 20-provider breakdown including spot pricing depth.

The two things that stand out in June 2026 pricing: CoreWeave remains expensive on-demand (reserved contracts can bring that down 50-60%, but require enterprise agreements), and the B200 GPU is now available but still at significant premium - $5.89-6.99/hr on-demand against the H100's $2.89-3.29/hr range.

A dense server rack installation with GPU nodes Multi-GPU server configurations for running models like Maverick or DeepSeek V3.2 require rack-scale deployments. Most teams access this via cloud providers rather than owning the hardware. Source: unsplash.com

The Cost Per Million Tokens Formula

Converting GPU hourly rates into per-token costs requires two inputs: the hourly rate and the throughput in tokens per second. The formula:

cost_per_MTok = (hourly_rate / (throughput_tps × 3600)) × 1,000,000

At 100% GPU use. For real workloads, use sits lower - 50% is a common planning figure for workloads with predictable demand curves.

The throughput figures below are vLLM estimates at high batch sizes (output-dominated workloads). MoE models with fewer active parameters achieve higher throughput than dense models of equivalent total size. A 17B-active MoE achieves throughput closer to a 17B dense model than to a 400B dense model.

ModelGPU SetupHourly CostEst. ThroughputCost/MTok (100%)Cost/MTok (50%)
Llama 4 Scout1x RunPod H100 PCIe$2.89~4,500 tps~$0.18~$0.36
Llama 4 Maverick3x RunPod H100 PCIe$8.67~3,500 tps~$0.69~$1.38
Qwen3-235B-A22B4x RunPod H100 PCIe$11.56~3,000 tps~$1.07~$2.14
DeepSeek V3.28x Lambda H100 SXM$31.92~2,000 tps~$4.43~$8.87

These figures apply equally to input and output tokens - self-hosted inference doesn't distinguish. KV cache hits speed up following tokens in long contexts, so actual effective cost per token may be lower for context-heavy workloads.

The Scout number at 100% use ($0.18/MTok) looks competitive until you compare it to current API prices.

Self-Hosting vs. API: The Honest Comparison

API pricing for open-weight models has compressed dramatically in 2026. Commodity inference is cheap.

ModelAPI ProviderInput /MTokOutput /MTokBlended 1:2 ratio
Llama 4 ScoutDeepInfra (cheapest)$0.08$0.14$0.12
Llama 4 ScoutGroq$0.11$0.34$0.26
Llama 4 MaverickDeepInfra / Fireworks$0.15$0.60$0.45
Qwen3-235B-A22BOpenRouter$0.455$1.82$0.91
DeepSeek V3.2DeepSeek API$0.27$1.10$0.82

Comparing against self-hosting:

Scout: Self-hosted at full use costs $0.18/MTok. The cheapest API (DeepInfra) charges $0.12/MTok blended. Self-hosting costs 50% more at perfect use - and most workloads aren't at perfect utilization. Against Groq ($0.26/MTok blended), self-hosting wins at 70%+ use.

Maverick: Self-hosted at full use runs $0.69/MTok. The cheapest API runs $0.45/MTok blended. Self-hosting costs 53% more even at 100% GPU use. There is no use rate at which Maverick self-hosting beats the cheapest API on raw cost.

DeepSeek V3.2: Self-hosted at full utilization costs $4.43/MTok. The DeepSeek API charges $0.82/MTok blended. The API is 5.4x cheaper at full utilization. Self-hosting DeepSeek V3.2 isn't a cost optimization - it's a sovereignty decision.

The math is clear: self-hosting 2026 open-weight models almost never saves money compared to current API prices. The decision to self-host is a privacy or compliance decision, not a cost decision.

The exception to this pattern is smaller dense models (7B-13B), where self-hosting on an older A100 at $1.49/hr can produce very low per-token costs at scale. A single A100 running a 13B model at 50% utilization hits roughly $0.08-0.10/MTok, competitive with the cheapest APIs. But for the frontier open-weight models most teams are actually trying to run in 2026, the economics favor APIs.

Hidden Costs That Skew the Math

The GPU hourly rate is the visible cost. Four other cost categories tend to be underestimated.

Engineering overhead

Maintaining a production vLLM deployment requires 10-20 hours per month for monitoring, version updates, CUDA compatibility fixes, and queue tuning. At $75-150/hr for a senior ML engineer or DevOps specialist, that's $750-3,000/month in labor that doesn't show up in your GPU spend. For a single H100 Scout deployment at $2.89/hr, the GPU costs $2,081/month at full time - adding just 10 hours of engineer time at $100/hr adds a 48% overhead.

Idle GPU time

Cloud APIs scale to zero. Your GPU doesn't. If your workload peaks during business hours and runs light on nights and weekends, you're paying for 24 hours while using meaningful capacity for maybe 10-12. At 40% average use, your effective per-token cost doubles compared to the 100% use figure in the table above.

Modal's serverless model (scale-to-zero, ~$3.95/hr for H100 but billed per second) partially addresses this for inference workloads where cold starts under 15 seconds are acceptable.

Storage and networking

Model weights for a Q4 Scout deployment are ~55GB. At $0.10-0.20/GB/month, persistent storage for the model runs $5-11/month - trivial. But if you're running checkpoints, extended context caches, or logging full inference traces, storage costs grow fast. Networking egress is the real trap: most GPU cloud providers charge $0.05-0.12/GB out. If your application returns long outputs to downstream services, run the math before committing.

Quantization tradeoffs

Moving from BF16 to INT4 cuts memory requirements roughly 75% - Scout from ~218GB to ~55GB - but quality degrades measurably on complex reasoning tasks. The cost calculation above uses INT4. If your use case requires BF16 precision, multiply your GPU count by 4 and recalculate.

Modern data center corridor with illuminated server racks The visible cost is the GPU rack. The hidden costs include the engineering team, egress charges, and the idle time between your peak and your trough. Source: pexels.com

When Self-Hosting Actually Wins

The privacy and compliance case is real and frequently underweighted:

  1. Medical and legal data - Prompts containing PHI, legal strategy, or client-privileged information can't route through third-party inference providers without data processing agreements. Many teams self-host specifically to avoid managing those agreements.

  2. Custom fine-tuned models - If you've fine-tuned Scout or Maverick on proprietary data, no API provider hosts your specific checkpoint. Self-hosting is the only option.

  3. Predictable monthly cost - An API bill varies with usage. A RunPod reservation gives you a fixed monthly number that finance can model. The predictability has value independent of whether the fixed cost is lower than the variable alternative.

  4. Latency-critical applications - Shared inference APIs experience queue saturation during peak hours. Reserved or dedicated GPU capacity gives you consistent throughput when your SLA requires sub-200ms first-token latency.

For cost-only decisions on publicly available model weights, the LLM API pricing comparison should be the first stop. Cheap commodity inference has made the "we'll self-host to save money" argument much harder to justify in 2026.

FAQ

What's the cheapest way to self-host a capable open-source model in 2026?

Llama 4 Scout on a single RunPod H100 PCIe at $2.89/hr. At 50% utilization that's about $0.36/MTok. For comparison, the cheapest API access to Scout-equivalent quality runs $0.08-0.15/MTok, so self-hosting carries a cost premium unless you have sustained high use.

How many H100s do I need for DeepSeek V3.2?

Eight minimum at FP8 precision to fit the 685B parameter weights (requiring ~680GB of VRAM across eight 80GB cards). For full 128K context with generous KV cache headroom, you need eight H200s instead. This is a 8-figure-per-year infrastructure commitment at market GPU rates.

Does spot pricing make self-hosting cheaper?

For batch workloads that tolerate interruption, yes. Vast.ai H100 spot pricing starts around $1.00/hr - cutting the Scout per-token cost in half. For production inference serving users synchronously, spot instances are unsuitable. An instance preemption mid-inference causes user-visible failures.

What inference framework should I use?

vLLM is the default for production throughput. SGLang leads by roughly 29% throughput on H100 benchmarks for standard workloads and is worth testing if throughput is your primary constraint. Ollama works well for development environments and single-user deployments but isn't designed for multi-user production serving.

Is the B200 worth using for open-source model hosting?

Not yet for most teams. At $5.89-6.99/hr on-demand versus $2.89/hr for the H100 PCIe, the B200 costs roughly 2x more per hour. Its performance advantage is real for training workloads, but for inference on current open-weight models, the cost premium isn't justified unless you're running at very large scale where the compute efficiency improvement translates directly to lower per-token cost.

How does self-hosting compare to fine-tuning costs?

Different category. Self-hosting covers ongoing inference costs. Fine-tuning is a one-time training spend on top of that. See the fine-tuning costs comparison for training cost estimates across providers.


Sources:

✓ Last verified June 1, 2026

LinkedIn
Reddit
Hacker News
Telegram
Open Source LLM Hosting Costs - June 2026
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.