Open-Source LLM Hosting Costs - March 2026
Real costs of self-hosting Llama, Mistral, Qwen, and DeepSeek models on cloud GPUs vs. API access - with break-even analysis and hardware pricing.

TL;DR
- Self-hosting a 70B model on cloud A100s costs ~$3,000-5,000/month but delivers ~$0.07/MTok at full utilization
- Break-even vs. API: roughly 70M+ tokens/day for a 70B model, 200M+/day for 8B models
- Cheapest cloud GPUs: RunPod A100 PCIe at $1.19/hr, Lambda H100 at $2.99/hr
- For most teams, API providers (Groq at $0.11/MTok for Llama 4 Scout, DeepSeek at $0.28/MTok) are cheaper than self-hosting until you hit serious volume
When Self-Hosting Beats API Pricing
The math is straightforward but the variables aren't. Self-hosting has a fixed cost (GPU rental) and near-zero marginal cost per token. APIs have zero fixed cost and a constant per-token rate. The crossover depends on your model size, throughput needs, and use rate.
For a 8B model on a single RTX 4090 ($0.44/hr on RunPod), you'll process roughly 5,000 tokens/second. That's ~13 billion tokens/month for about $320. The equivalent via Groq's Llama 3.1 8B API ($0.05/$0.08 per MTok) would cost ~$845 - so self-hosting saves 62% at full use. But if your GPU sits idle 70% of the time, the API wins.
For 70B models, the break-even volume is roughly 70 million tokens per day against DeepSeek V3.2's $0.28/$0.42 pricing. Below that, just use the API. For a detailed guide on getting started, see our how to run open-source LLMs locally.
Cloud GPU Pricing
Prices per GPU-hour as of March 2026. Sorted by price within each tier.
Data Center GPUs (For Production Workloads)
| GPU | VRAM | RunPod | Lambda | Jarvislabs | AWS (On-Demand) |
|---|---|---|---|---|---|
| A100 PCIe 80GB | 80 GB | $1.19/hr | $2.06/hr | $1.49/hr | ~$3.67/hr |
| A100 SXM 80GB | 80 GB | $1.39/hr | $2.06/hr | $1.49/hr | ~$4.10/hr |
| H100 SXM 80GB | 80 GB | $2.69/hr | $2.99/hr | N/A | ~$4.35/hr |
| H200 SXM 141GB | 141 GB | Coming soon | N/A | N/A | N/A |
Consumer/Prosumer GPUs (For Dev/Small-Scale)
| GPU | VRAM | RunPod | Vast.ai (Spot) | Buy New |
|---|---|---|---|---|
| RTX 3090 | 24 GB | $0.22/hr | $0.15/hr | ~$800 |
| RTX 4090 | 24 GB | $0.44/hr | $0.30/hr | ~$1,800 |
| RTX A6000 | 48 GB | $0.79/hr | $0.50/hr | ~$3,500 |
| RTX 5090 | 32 GB | N/A | N/A | ~$2,000 |
Hyperscalers (AWS, GCP, Azure) charge 40-100% more than specialist providers like RunPod, Lambda, and Jarvislabs. The premium buys you enterprise SLAs, managed Kubernetes, and compliance certifications. If you don't need those, you're overpaying.
Self-Hosted Cost Per Token
Estimated effective cost per million tokens at sustained throughput. Assumes 80% GPU use and vLLM/TGI serving framework. These are estimates - your actual throughput will vary with batch size, context length, and quantization.
| Model | Parameters | GPU Required | GPU Cost/hr | Throughput | Est. Cost/MTok |
|---|---|---|---|---|---|
| Llama 3.1 8B (Q4) | 8B | 1x RTX 4090 | $0.44 | ~5,000 tok/s | ~$0.024 |
| Mistral Small 3.1 | 24B | 1x A100 80GB | $1.19 | ~2,500 tok/s | ~$0.13 |
| Qwen3 32B (Q4) | 32B | 1x A100 80GB | $1.19 | ~2,000 tok/s | ~$0.17 |
| Llama 4 Scout (Q4) | 109B MoE | 2x A100 80GB | $2.38 | ~1,800 tok/s | ~$0.37 |
| Llama 3.3 70B (Q4) | 70B | 2x A100 80GB | $2.38 | ~1,500 tok/s | ~$0.44 |
| Llama 4 Maverick | 400B MoE | 4x A100 80GB | $4.76 | ~1,000 tok/s | ~$1.32 |
| DeepSeek V3.2 | 685B MoE | 8x H100 80GB | $21.52 | ~800 tok/s | ~$7.47 |
The takeaway: small models (8B-32B) are where self-hosting shines. At the 70B+ level, managed inference providers like Groq ($0.59/MTok for Llama 3.3 70B) are hard to beat because they've optimized hardware specifically for throughput. For the largest models, self-hosting is impractical for all but the largest organizations.
API vs. Self-Hosted Break-Even
| Model | API Provider | API Cost/MTok | Self-Hosted Cost/MTok | Break-Even Volume/Day |
|---|---|---|---|---|
| Llama 3.1 8B | Groq ($0.05/$0.08) | ~$0.065 | ~$0.024 | ~200M tokens |
| Llama 4 Scout | Groq ($0.11/$0.34) | ~$0.225 | ~$0.37 | Never (API cheaper) |
| Llama 3.3 70B | Groq ($0.59/$0.79) | ~$0.69 | ~$0.44 | ~50M tokens |
| DeepSeek V3.2 | DeepSeek ($0.28/$0.42) | ~$0.35 | ~$7.47 | Never (API far cheaper) |
A critical point: DeepSeek V3.2 and Llama 4 Scout are so aggressively priced that self-hosting them doesn't make economic sense unless you have privacy constraints or need guaranteed latency. For more on how open-source models compare, see our open-source LLM leaderboard.
Local Hardware Options
Not everyone needs cloud. Running models on your own hardware removes recurring costs completely after the initial purchase.
| Setup | Cost | Models It Can Run | Use Case |
|---|---|---|---|
| RTX 4090 (24 GB) | ~$1,800 | 8B-13B (full), 30B (Q4) | Solo developer, prototyping |
| RTX 5090 (32 GB) | ~$2,000 | 8B-30B (full), 70B (Q3) | Solo dev, slightly larger models |
| 2x RTX 4090 | ~$3,600 | 70B (Q4), MoE models | Small team, production testing |
| Mac Studio M4 Ultra (192 GB) | ~$6,000 | 70B (full), 120B (Q4) | Quiet, power-efficient inference |
| NVIDIA DGX Spark | ~$3,000 | Up to 200B (Q4) | Compact AI workstation |
For our hands-on experience with the DGX Spark, see the NVIDIA DGX Spark review. The Mac Studio M4 Ultra deserves special attention - its 192 GB unified memory lets it run 70B models at full precision, something no consumer GPU can match. The tradeoff is lower throughput (~30-50 tok/s on 70B) compared to a properly configured vLLM setup on A100s.
For gaming GPU setups running Ollama or vLLM, see the home GPU LLM leaderboard and our best local LLM tools guide.
Hidden Costs of Self-Hosting
Engineering Time
Setting up vLLM, configuring load balancing, managing GPU failures, and tuning batch sizes isn't free. Budget 20-40 hours of engineer time for initial setup and 5-10 hours per month for ongoing maintenance. At $75-150/hour fully loaded engineering cost, that's $2,000-6,000 in the first month alone.
Idle GPU Costs
Cloud GPUs charge by the hour whether you're using them or not. If your workload is bursty (heavy during business hours, quiet at night), you'll pay for idle capacity. Spot instances help (30-70% cheaper) but can be preempted. Auto-scaling with Kubernetes helps but adds complexity.
Model Updates
When a new model version drops, you need to download weights, re-test, and potentially reconfigure serving parameters. API providers handle this transparently. With self-hosting, every model update is a deployment event.
Networking and Egress
Cloud providers charge $0.05-0.12 per GB for data egress. If your application sends large contexts to the model, egress costs can add 10-20% to your effective GPU cost. Lambda and RunPod are more generous than AWS here.
FAQ
Is self-hosting a LLM cheaper than using an API?
It depends completely on volume. Below 50M tokens/day, APIs are almost always cheaper. Above 100M tokens/day with consistent utilization, self-hosting on A100s can save 40-60%. The break-even shifts depending on model size and the API provider you're comparing against.
What's the cheapest way to run a 70B model?
Two A100 80GB GPUs on RunPod at $2.38/hr total (~$1,714/month). With vLLM and Q4 quantization, you'll get ~1,500 tokens/second. For lighter usage, try Groq's API at $0.59/$0.79 per MTok - no infrastructure required.
Can I run LLMs on a consumer GPU?
Yes. A RTX 4090 (24 GB) handles 8B-13B models comfortably and can run quantized 30B models. The RTX 5090 (32 GB) extends this to quantized 70B at reduced context. See our home GPU LLM leaderboard for benchmarks.
Which inference framework should I use?
vLLM for production serving (continuous batching, PagedAttention). Ollama for local development and prototyping. TensorRT-LLM for maximum throughput on NVIDIA hardware. SGLang for advanced prompt programming.
How do managed providers like Groq compare?
Groq's custom LPU hardware delivers 394-840 tokens/second at prices competitive with self-hosting. For Llama 4 Scout, Groq charges $0.11/$0.34 per MTok - cheaper than renting your own GPUs for most workloads. The tradeoff is less control over model configuration and quantization.
Sources:
✓ Last verified March 11, 2026
