Fine-Tuning Costs Comparison - Train Your Own AI

TL;DR

Together AI offers the cheapest API fine-tuning at $0.48/1M tokens for LoRA on models up to 16B parameters
OpenAI's GPT-4o training runs $25/1M tokens - but the new GPT-4.1 drops that to $3/1M, a 88% cut
Self-hosted LoRA on a single H100 ($2.65-$3.99/hr) beats API pricing for datasets above roughly 50M tokens
LoRA achieves 80-95% of full fine-tuning quality at 70-90% lower cost, making it the default starting point

The Bottom Line

If you want the cheapest path to a custom model, Together AI's LoRA fine-tuning at $0.48 per million tokens on Llama 3.1 8B is hard to beat. For teams committed to OpenAI's ecosystem, the GPT-4.1 family changed the economics: training at $3/1M tokens (GPT-4.1) or $0.80/1M tokens (GPT-4.1-mini) makes fine-tuning accessible where GPT-4o's $25/1M was prohibitive. Google's Vertex AI sits in the middle at $3/1M tokens for Gemini 2.0 Flash tuning, with the bonus that inference pricing stays identical to the base model.

The real question isn't which provider is cheapest per token. It's whether API fine-tuning or self-hosted training makes more sense for your workload. We break down both paths below.

API Fine-Tuning Pricing Table

All prices in USD per million tokens. Training cost covers tokens processed during fine-tuning (dataset size multiplied by epochs). Inference costs apply when you call your fine-tuned model afterward. Prices verified against official documentation on March 26, 2026.

Provider	Model	Training (/1M)	Inference In (/1M)	Inference Out (/1M)	Min Examples	Method
Together AI	Llama 3.1 8B	$0.48	$0.18	$0.18	1	LoRA
Together AI	Mistral 7B	$0.48	$0.20	$0.20	1	LoRA
Together AI	Llama 3.1 8B	$0.54	$0.18	$0.18	1	Full
Fireworks	Llama 3.1 8B	$0.50	$0.20	$0.20	1	LoRA
OpenAI	GPT-4.1 Nano	$0.20	$0.20	$0.80	10	SFT
OpenAI	GPT-4o-mini	$0.30	$0.30	$1.20	10	SFT
OpenAI	GPT-4.1 Mini	$0.80	$0.80	$3.20	10	SFT
Mistral	Mistral 7B	$1.00	$0.25	$0.25	1	SFT
Google	Gemini 2.0 Flash Lite	$1.00	$0.075	$0.30	10	SFT
Together AI	Llama 3.1 70B	$1.50	$0.88	$0.88	1	LoRA
Together AI	Llama 3.1 70B	$1.65	$0.88	$0.88	1	Full
Mistral	Mistral Small	$2.00	$0.20	$0.60	1	SFT
Together AI	70-100B models	$2.90	varies	varies	1	LoRA
Cohere	Command R	$3.00	$0.30	$1.20	2	SFT
OpenAI	GPT-4.1	$3.00	$3.00	$12.00	10	SFT
Fireworks	Llama 3.1 70B	$3.00	$0.90	$0.90	1	LoRA
Google	Gemini 2.0 Flash	$3.00	$0.15	$0.60	10	SFT
OpenAI	GPT-3.5 Turbo	$8.00	$3.00	$6.00	10	SFT
OpenAI	GPT-4o	$25.00	$3.75	$15.00	10	SFT

For base model inference pricing without fine-tuning, see our LLM API pricing comparison.

What Stands Out

OpenAI's pricing spread is enormous. GPT-4.1 Nano training at $0.20/1M tokens is 125x cheaper than GPT-4o at $25/1M. If you were fine-tuning GPT-4o and haven't re-evaluated since GPT-4.1 launched, you're likely overpaying.

Together AI and Fireworks compete directly on open-source model training, with Together slightly cheaper and offering both LoRA and full fine-tuning options. Google's tuned model inference staying at base model prices is a meaningful advantage - OpenAI charges a premium for fine-tuned model inference (GPT-4.1 fine-tuned inference runs $3/1M input vs $2/1M for the base model).

API Fine-Tuning vs Self-Hosted Training

The API approach bundles infrastructure, tooling, and hosting into one per-token price. Self-hosted training means renting GPUs and running the training job yourself using frameworks like Hugging Face TRL, Axolotl, or LLaMA-Factory.

GPU server rack in a data center with blinking status lights Self-hosted fine-tuning requires renting cloud GPUs or maintaining on-premise hardware. Source: pexels.com

GPU Cloud Pricing for Training

Provider	GPU	VRAM	Hourly Rate	Best For
Vast.ai	H100 SXM	80GB	$1.49-$2.00	Budget training, spot
RunPod	H100 SXM	80GB	$2.65	Reliable on-demand
Lambda	H100 SXM	80GB	$3.99	Managed environment
Lambda	A100 SXM	80GB	$2.79	Budget large models
RunPod	A100	80GB	$1.64	Budget LoRA training
Vast.ai	A100	80GB	$1.00-$1.80	Spot pricing
Lambda	B200 SXM	192GB	$6.69	Massive models

For a deeper look at GPU cloud costs, see our open-source hosting costs breakdown.

When Self-Hosted Wins

Self-hosted training becomes cheaper when your dataset is large enough that the fixed overhead of setting up infrastructure amortizes across many tokens. A rough crossover point: if you're processing more than 50M training tokens on a 7B model, renting a single H100 on Vast.ai at $1.49/hr for a few hours costs less than Together AI's $0.48/1M API price.

The math on a concrete example: fine-tuning Llama 3.1 70B with LoRA on 10M tokens (3 epochs = 30M processed tokens) costs $43.50 through Together AI's API. The same job on a rented 8xA100 cluster takes roughly 2-4 hours at $13-$22/hr total, running $26-$88. At the low end (Vast.ai spot pricing), self-hosted is cheaper. At the high end (Lambda on-demand), the API wins on convenience.

When APIs Win

API fine-tuning wins on three fronts. First, zero setup - no configuring CUDA, managing dependencies, or debugging distributed training. Second, built-in evaluation and monitoring dashboards. Third, your fine-tuned model is right away available for serving at the same endpoint, with no separate deployment step.

For teams running occasional fine-tuning jobs (monthly or quarterly retraining), the engineering time saved by using an API almost always outweighs the per-token premium.

LoRA vs Full Fine-Tuning Cost Breakdown

Understanding the cost gap between LoRA (Low-Rank Adaptation) and full fine-tuning matters because it's often the single biggest lever you can pull. For more technical background on these techniques, read our fine-tuning and distillation guide.

Financial analysis dashboard showing cost comparisons and data charts Breaking down the real costs across methods and model sizes uncovers where the savings actually live. Source: pexels.com

Cost by Model Size and Method

Model Size	LoRA (Together AI)	Full (Together AI)	Savings	Quality Retention
Up to 16B	$0.48/1M	$0.54/1M	11%	90-95%
17-69B	$1.50/1M	$1.65/1M	9%	85-95%
70-100B	$2.90/1M	$3.20/1M	9%	80-95%

Together AI's LoRA-to-full pricing gap is modest at 9-11%. The real savings with LoRA show up in self-hosted scenarios, where GPU memory requirements drop dramatically. A 7B model needs a single 24GB GPU for LoRA versus 4x 80GB GPUs for full fine-tuning. That hardware gap translates to 4-10x cost reduction.

DPO Training Costs

Direct Preference Optimization (DPO) - used for aligning models with human preferences - costs clearly more than standard supervised fine-tuning. Together AI's DPO pricing runs 2.5x higher than SFT:

Up to 16B: $1.20/1M (LoRA), $1.35/1M (Full)
17-69B: $3.75/1M (LoRA), $4.12/1M (Full)
70-100B: $7.25/1M (LoRA), $8.00/1M (Full)

If your fine-tuning goal is style matching or domain knowledge, start with SFT. Reserve DPO for cases where you need explicit preference alignment, such as safety tuning or output ranking.

Hidden Costs Most Guides Skip

Data Preparation

Budget 10-15% of your total fine-tuning spend on data preparation. Cleaning, formatting, deduplication, and quality filtering take real engineering hours. OpenAI requires a minimum of 10 training examples, but effective fine-tuning normally needs 500-10,000 high-quality examples. Producing that dataset - especially for specialized domains - often costs more than the training itself.

Failed Experiments

Plan for 3-5 training runs before landing on a configuration that works. That means your actual training cost is 3-5x the single-run estimate. Hyperparameter sweeps over learning rate, LoRA rank, and epoch count add up fast. Together AI and Fireworks charge per token processed, so every abandoned run still hits your bill.

Inference Cost Multipliers

Fine-tuned models on OpenAI cost more to run than base models. GPT-4.1 base inference is $2/1M input, but fine-tuned GPT-4.1 inference jumps to $3/1M input - a 50% premium. Google doesn't charge this premium: tuned Gemini 2.0 Flash inference stays at the same $0.15/$0.60 as the base model. This ongoing inference premium can dwarf the one-time training cost for high-volume production workloads.

Storage and Checkpoints

Self-hosted training generates checkpoint files at every evaluation step. A 70B model checkpoint occupies roughly 140GB uncompressed. Five checkpoints from a single run means 700GB of storage. Cloud storage at $0.023/GB/month (AWS S3 standard) adds $16/month per run's worth of checkpoints - small individually, but it builds up across experiments.

OpenAI offers reduced inference pricing if you enable data sharing when creating the fine-tune job. This halves fine-tuned inference costs for both standard and batch tiers. The trade-off: your training data and completions may be used to improve OpenAI's models. For many teams, the privacy concern outweighs the savings.

Practical Cost Examples

Example 1: Customer Support Bot (Small Scale)

Goal: Fine-tune for domain-specific Q&A, 5,000 examples (~2M tokens), 3 epochs
API route: GPT-4.1 Mini on OpenAI = 6M tokens x $0.80/1M = $4.80 training
Inference: 500K tokens/month at $0.80/$3.20 = $2.00/month
Total first year: ~$29

Example 2: Code Assistant (Medium Scale)

Goal: Fine-tune Llama 3.1 70B for internal codebase, 50,000 examples (~25M tokens), 3 epochs
API route: Together AI LoRA = 75M tokens x $1.50/1M = $112.50 training
Self-hosted: 8xA100 on Vast.ai at ~$10/hr for ~6 hours = $60
Savings from self-hosted: 47%

Example 3: Enterprise Classification (Large Scale)

Goal: Fine-tune GPT-4o for document classification, 200,000 examples (~100M tokens), 4 epochs
API route: OpenAI = 400M tokens x $25/1M = $10,000 training
Alternative: GPT-4.1 at $3/1M = 400M x $3 = $1,200 (88% savings, same model family)
Open-source alternative: Llama 3.1 70B LoRA on Together AI = 400M x $1.50/1M = $600

The GPT-4.1 launch reshuffled the economics for OpenAI-locked teams. If you're still on GPT-4o fine-tuning, migrating to GPT-4.1 should be your first move.

Decision Framework

Choosing between fine-tuning approaches comes down to four variables. For general guidance on picking the right model in the first place, see our how to choose an LLM guide.

Use API fine-tuning when:

Your team doesn't have ML infrastructure expertise
Dataset is under 50M tokens
You need quick iteration cycles (hours, not days)
You want managed serving included in the price

Use self-hosted training when:

Dataset passes 100M tokens
You need full control over hyperparameters and training loop
Data privacy requirements rule out third-party APIs
You're running many experimental iterations and want to minimize per-run cost

Start with LoRA when:

You're fine-tuning for the first time
Budget is constrained
The task is style transfer, formatting, or domain adaptation
You want to swap adapters at inference time without redeploying

Use full fine-tuning when:

LoRA quality doesn't meet your threshold after testing
The task requires deep behavioral changes across the full weight space
You've validated the approach with LoRA first and need that final 5-10% quality

For a deeper explanation of what fine-tuning is and when it makes sense over prompt engineering, we have a dedicated guide.

FAQ

Which fine-tuning provider is cheapest?

Together AI at $0.48/1M tokens for LoRA on models up to 16B parameters. OpenAI's cheapest option is GPT-4.1 Nano at $0.20/1M, but inference costs are higher.

Is LoRA good enough for production?

LoRA retains 80-95% of full fine-tuning quality depending on the task. For style matching, format compliance, and domain adaptation, LoRA performs nearly identically to full training.

How much data do I need for fine-tuning?

OpenAI requires a minimum of 10 examples. Practically, 500-5,000 high-quality examples produce meaningful improvements. More data helps, but quality matters more than quantity.

Does fine-tuning cost more than prompt engineering?

Training is a one-time cost. If fine-tuning reduces your prompt by 500 tokens per request and you process 1M requests/month, the token savings pay for training within weeks.

Can I fine-tune open-source models for free?

Training is never truly free - you still need compute. But using QLoRA on a consumer GPU (RTX 4090, ~$0.40/hr on Vast.ai) can fine-tune a 7B model for under $5.

How long does API fine-tuning take?

Most jobs under 10M tokens complete within 1-3 hours. Larger jobs (100M+ tokens) can take 6-24 hours depending on the provider and model size.

Sources: