Fine-Tuning Costs Comparison - Train Your Own AI

TL;DR

Together AI now fine-tunes Llama 4, DeepSeek V3, Qwen 3.5, GLM, and Kimi model families at model-specific prices from $3-40/1M
Fireworks raised on-demand GPU deployment prices by $1/hr on May 1, 2026: A100 is now $2.90/hr, H100 $6.00/hr, B200 $9.00/hr - training rates unchanged
H100 cloud rental dropped to $1.50-$2.39/hr on RunPod and $1.55-$1.87/hr on Vast.ai - the self-hosted break-even point is now closer to 35M tokens
Together AI DPO starts at $0.54/1M LoRA for models up to 16B - about 12% above their SFT rate, not the steep 2x premium often cited

The Bottom Line

If you want the cheapest API fine-tuning, Together AI's $0.48/1M LoRA rate on models up to 16B still leads the field. For teams needing something more capable, Llama 4 Scout is now available for fine-tuning on Together AI at $3-8/1M - a 35B model that didn't exist in this slot three months ago.

Fireworks' May 1 deployment price hike shifts the self-hosted versus API calculus. Training costs didn't change, but serving your fine-tuned model on Fireworks just got $1/hr more expensive across all GPU tiers. Combined with lower H100 rental rates, the break-even for moving training in-house now sits around 35M tokens rather than the 50M threshold that held in early 2026.

API Fine-Tuning Pricing Table

All prices in USD per million tokens. Training cost covers tokens processed during fine-tuning (dataset size multiplied by epochs). Inference costs apply when you call your fine-tuned model afterward. Prices verified against official documentation on May 4, 2026.

Provider	Model	Training (/1M)	Inference In (/1M)	Inference Out (/1M)	Min Examples	Method
OpenAI	GPT-4.1 Nano	$0.20	$0.20	$0.80	10	SFT
OpenAI	GPT-4o-mini	$0.30	$0.30	$1.20	10	SFT
Together AI	Llama 3.1 8B	$0.48	$0.18	$0.18	1	LoRA
Together AI	Mistral 7B	$0.48	$0.20	$0.20	1	LoRA
Fireworks	≤16B models	$0.50	base rate	base rate	1	LoRA
Together AI	Llama 3.1 8B	$0.54	$0.18	$0.18	1	Full
OpenAI	GPT-4.1 Mini	$0.80	$0.80	$3.20	10	SFT
Fireworks	≤16B models	$1.00	base rate	base rate	1	Full
Mistral	Mistral 7B	$1.00	$0.25	$0.25	1	SFT
Google	Gemini 2.0 Flash Lite	$1.00	$0.075	$0.30	10	SFT
Together AI	Llama 3.1 70B	$1.50	$0.88	$0.88	1	LoRA
Together AI	Llama 3.1 70B	$1.65	$0.88	$0.88	1	Full
Mistral	Mistral Small	$2.00	$0.20	$0.60	1	SFT
Together AI	70-100B models	$2.90	varies	varies	1	LoRA
Cohere	Command R	$3.00	$0.30	$1.20	2	SFT
OpenAI	GPT-4.1	$3.00	$3.00	$12.00	10	SFT
Fireworks	16.1B-80B models	$3.00	base rate	base rate	1	LoRA
Google	Gemini 2.0 Flash	$3.00	$0.15	$0.60	10	SFT
Together AI	Llama 4 Scout	$3-8	varies	varies	1	SFT/LoRA
Together AI	Qwen 3.5 models	$3-9	varies	varies	1	SFT/LoRA
Fireworks	80.1B-300B models	$6.00	base rate	base rate	1	LoRA
OpenAI	GPT-3.5 Turbo	$8.00	$3.00	$6.00	10	SFT
Together AI	Llama 4 Maverick	$8	varies	varies	1	SFT/LoRA
Together AI	GLM 5.x models	$9-40	varies	varies	1	SFT
Fireworks	>300B models	$10.00	base rate	base rate	1	LoRA
Together AI	DeepSeek V3/R1	$10-25	varies	varies	1	SFT
Together AI	Kimi K2	$15	varies	varies	1	SFT
OpenAI	GPT-4o	$25.00	$3.75	$15.00	10	SFT

Fine-tuned inference on Fireworks and Together AI uses the base model rate for the specific model you trained. OpenAI charges a 50% premium on fine-tuned inference versus base rates. Google doesn't apply any surcharge - tuned model inference stays at base model pricing. For base model inference pricing without fine-tuning, see our LLM API pricing comparison.

What Stands Out

OpenAI's pricing spread remains enormous. GPT-4.1 Nano at $0.20/1M for training is 125x cheaper than GPT-4o at $25/1M. If you're still on GPT-4o fine-tuning for any task, migrating to GPT-4.1 is the easiest cost reduction available.

Fireworks expanding to models above 300B is a meaningful new tier. Their >300B LoRA pricing at $10/1M covers architectures like DeepSeek V3's 685B MoE - a capability bracket that was effectively inaccessible through managed fine-tuning APIs six months ago. You're still paying Fireworks-tier prices (more expensive than Together AI at equivalent sizes), but the option now exists.

Q2 2026: Frontier Models Join the Fine-Tuning Market

The most significant change since March isn't price movement on existing models - it's which models you can fine-tune at all.

Together AI opened fine-tuning access to several frontier families during Q2. Llama 4 Maverick is available at roughly $8/1M SFT, making it one of the few ways to fine-tune a 128-expert mixture-of-experts model without managing full-scale training infrastructure. DeepSeek V3 fine-tuning at $10-25/1M is relevant for organizations where DeepSeek's baseline performance on code or Chinese-language tasks is strong. Qwen 3.5 variants (multiple sizes from 7B to 122B) start at $3/1M.

These aren't necessarily the absolute newest models. They're frontier in the practical sense: they were unavailable through any managed fine-tuning API three months ago. If your use case benefits from an architecture outside the Llama 3.1 family, the options got substantially wider in Q2.

One thing to account for: all frontier model jobs on Together AI carry minimums ranging from $6-60 per run. Exploratory experiments are more expensive than the per-token rate implies. Budget for 3-5 runs before settling on a configuration that works.

For teams deciding between fine-tuning and retrieval-augmented generation for domain adaptation, our RAG vs fine-tuning guide covers where each approach makes sense.

API Fine-Tuning vs Self-Hosted Training

The API approach bundles infrastructure, tooling, and hosting into one per-token price. Self-hosted training means renting GPUs and running the training job yourself using frameworks like Hugging Face TRL, Axolotl, or LLaMA-Factory.

GPU server rack in a data center with blinking status lights Self-hosted fine-tuning requires renting cloud GPUs or on-premise hardware. Source: pexels.com

GPU Cloud Pricing for Training

H100 rental rates dropped across the board since Q1 2026. The table below reflects May 2026 pricing:

Provider	GPU	VRAM	Hourly Rate	Best For
Vast.ai	H100 SXM	80GB	$1.55-$1.87	Budget spot pricing
RunPod	H100 SXM	80GB	$1.50-$2.39	On-demand training
Jarvislabs	H100 SXM	80GB	$2.69	Per-minute billing
Lambda	H100 SXM	80GB	$2.99	Managed environment
RunPod	A100	80GB	$1.64	Budget LoRA training
Vast.ai	A100	80GB	$1.00-$1.80	Spot pricing
Lambda	B200 SXM	192GB	$6.69	Massive models

For a complete comparison of cloud GPU providers, see our GPU rental pricing breakdown.

When Self-Hosted Wins

The crossover point moved. With H100s on RunPod at $1.50/hr, fine-tuning a 7B model on 35M tokens costs about $22 in GPU time versus $16.80 through Together AI's API. The gap has closed enough that the break-even now sits around 35M processed tokens for a 7B model - down from roughly 50M in early 2026.

For 70B LoRA jobs, self-hosted typically wins above 20M tokens. Renting a 8xA100 cluster on Vast.ai at $8-10/hr for 3-4 hours usually undercuts Together AI's API by 30-50%, depending on spot pricing availability.

When APIs Win

API fine-tuning wins on setup time and serving convenience. Zero CUDA configuration, no distributed training debugging, and the fine-tuned model is available at the same endpoint once training finishes. For teams running occasional jobs - monthly or quarterly retraining - the engineering overhead is the real cost, not the per-token rate.

LoRA vs Full Fine-Tuning Cost Breakdown

Understanding the LoRA versus full fine-tuning cost gap matters because it's often the biggest lever available. For more technical background on these techniques, read our fine-tuning and distillation guide.

Financial analysis dashboard showing cost comparisons and data charts The real costs across methods and model sizes show where the savings live. Source: pexels.com

Cost by Model Size and Method

Model Size	LoRA (Together AI)	Full (Together AI)	Savings	Quality Retention
Up to 16B	$0.48/1M	$0.54/1M	11%	90-95%
17-69B	$1.50/1M	$1.65/1M	9%	85-95%
70-100B	$2.90/1M	$3.20/1M	9%	80-95%

Together AI's LoRA-to-full pricing gap is modest at 9-11% on training costs. The real savings with LoRA show up in self-hosted scenarios, where GPU memory requirements drop notably. A 7B model needs a single 24GB GPU for LoRA versus four 80GB GPUs for full fine-tuning - a hardware gap that translates to 4-10x cost reduction on self-hosted training.

DPO Training Costs

Direct Preference Optimization (DPO) - used for aligning models with human preferences - costs slightly more than standard supervised fine-tuning. Current Together AI DPO rates:

Up to 16B: $0.54/1M (LoRA), $1.35/1M (Full)
17-69B: $1.65/1M (LoRA), $4.12/1M (Full)
70-100B: $3.20/1M (LoRA), $8.00/1M (Full)

Fireworks also offers DPO, though at much higher rates across all size tiers:

≤16B: $1.00/1M (LoRA), $2.00/1M (Full)
16.1B-80B: $6.00/1M (LoRA), $12.00/1M (Full)
80.1B-300B: $12.00/1M (LoRA), $24.00/1M (Full)
300B: $20.00/1M (LoRA), $40.00/1M (Full)

DPO makes sense for explicit preference alignment tasks - safety tuning, ranking outputs, or adjusting tone when simple instruction following falls short. For style matching or domain adaptation, SFT is sufficient and substantially cheaper.

Hidden Costs Most Guides Skip

Data Preparation

Budget 10-15% of your total fine-tuning spend on data preparation. Cleaning, formatting, deduplication, and quality filtering take real engineering hours. OpenAI requires a minimum of 10 training examples, but effective fine-tuning normally needs 500-10,000 high-quality examples. Producing that dataset - especially for specialized domains - often costs more than the training itself.

Failed Experiments

Plan for 3-5 training runs before landing on a working configuration. That means your actual training cost is 3-5x the single-run estimate. Hyperparameter sweeps over learning rate, LoRA rank, and epoch count add up fast. Together AI and Fireworks charge per token processed, so every abandoned run still hits your bill.

Serving Cost Increases

If you're launching fine-tuned models on Fireworks, the May 1 price change matters. On-demand A100 deployment is now $2.90/hr (up from $1.90/hr), H100 is $6.00/hr (up from $5.00/hr), and B200 is $9.00/hr (up from $8.00/hr). Fine-tuning training rates didn't change - only hosted serving costs went up. For high-traffic fine-tuned models running on multiple GPUs, that $1/hr increase per GPU compounds quickly.

Inference Cost Multipliers

Fine-tuned models on OpenAI cost more to run than base models. GPT-4.1 base inference is $2/1M input, but fine-tuned GPT-4.1 inference runs $3/1M - a 50% premium. Google doesn't apply this markup: tuned Gemini 2.0 Flash inference stays at $0.15/$0.60, identical to the base model. For high-volume production workloads, that ongoing inference premium can exceed the one-time training cost in a few weeks.

Storage and Checkpoints

Self-hosted training produces checkpoint files at every evaluation step. A 70B model checkpoint occupies roughly 140GB uncompressed. Five checkpoints from a single run means 700GB of storage. Cloud storage at $0.023/GB/month (AWS S3 standard) adds $16/month per run's worth of checkpoints - small on its own, but building up across experiments.

OpenAI offers reduced inference pricing if you enable data sharing when creating the fine-tune job. This halves fine-tuned inference costs for both standard and batch tiers. The trade-off: your training data and completions may be used to improve OpenAI's models. For many teams, the privacy concern outweighs the cost savings.

Practical Cost Examples

Example 1: Customer Support Bot (Small Scale)

Goal: Fine-tune for domain-specific Q&A, 5,000 examples (~2M tokens), 3 epochs
API route: GPT-4.1 Mini on OpenAI = 6M tokens × $0.80/1M = $4.80 training
Inference: 500K tokens/month at $0.80/$3.20 = $2.00/month
Total first year: ~$29

Example 2: Code Assistant on Frontier Model (Medium Scale)

Goal: Fine-tune Llama 4 Scout for internal codebase, 50,000 examples (~25M tokens), 3 epochs
API route: Together AI LoRA = 75M tokens × $5/1M (mid-range estimate) = $375 training
Self-hosted: 8xA100 on Vast.ai at ~$10/hr for ~6 hours = $60
Savings from self-hosted: 84%

The gap is large because Llama 4 Scout's specialized pricing on Together AI sits at $3-8/1M - well above the standard A100 GPU time that doesn't care which model you're training.

Example 3: Enterprise Classification (Large Scale)

Goal: Fine-tune for document classification, 200,000 examples (~100M tokens), 4 epochs
GPT-4o route: OpenAI = 400M tokens × $25/1M = $10,000 training
GPT-4.1 migration: 400M × $3/1M = $1,200 (88% savings, same model family)
Open-source alternative: Llama 3.1 70B LoRA on Together AI = 400M × $1.50/1M = $600

For anyone still running GPT-4o fine-tuning for classification, switching to GPT-4.1 is the clearest cost reduction on the table right now.

Decision Framework

Choosing between fine-tuning approaches comes down to dataset size, serving needs, and how much ML infrastructure expertise you have. For general guidance on picking the right model in the first place, see our how to choose an LLM guide.

Use API fine-tuning when:

Dataset is under 35M tokens
Your team lacks ML infrastructure expertise
You need quick iteration cycles (hours, not days)
You want managed serving included

Use self-hosted training when:

Dataset tops 100M tokens
Data privacy requirements rule out third-party APIs
You're running many experimental iterations
You need full control over hyperparameters and training loop

Start with LoRA when:

You're fine-tuning for the first time
Budget is constrained
The task is style transfer, formatting, or domain adaptation

Use full fine-tuning when:

LoRA quality doesn't meet your threshold after testing with LoRA first
The task requires deep behavioral changes that LoRA can't reach

For a deeper explanation of what fine-tuning is and when it makes sense over prompt engineering, see the dedicated guide.

FAQ

Which fine-tuning provider is cheapest?

Together AI at $0.48/1M for LoRA on models up to 16B. OpenAI's GPT-4.1 Nano trains at $0.20/1M but inference costs are higher and minimum 10 examples are required.

Is LoRA good enough for production?

LoRA retains 80-95% of full fine-tuning quality depending on the task. For style matching, format compliance, and domain adaptation, LoRA performs nearly identically to full training.

How much data do I need for fine-tuning?

OpenAI requires a minimum of 10 examples. Practically, 500-5,000 high-quality examples produce meaningful improvements. Quality matters more than quantity past a certain threshold.

Does fine-tuning cost more than prompt engineering?

Training is a one-time cost. If fine-tuning reduces your prompt by 500 tokens per request and you run 1M requests/month, the token savings normally pay back training costs within weeks.

Can I fine-tune open-source models for free?

Training always requires compute. QLoRA on a consumer GPU (RTX 4090, ~$0.40/hr on Vast.ai) can fine-tune a 7B model for under $5.

How long does API fine-tuning take?

Most jobs under 10M tokens complete in 1-3 hours. Jobs above 100M tokens can take 6-24 hours depending on provider and model size.

Sources: