Guides

A Developer's Guide to Finetuning and Distilling Language Models

A practical, hands-on guide for software developers who want to finetune open-source LLMs and distill larger models into smaller, faster ones - covering techniques, tools, datasets, and cloud GPU options.

A Developer's Guide to Finetuning and Distilling Language Models

You've been using ChatGPT or Claude at work, maybe piping prompts through an API, and now you're wondering: can I make a model that just does my thing, faster and cheaper? The answer is yes - and it's more accessible than you think.

Finetuning takes an existing pretrained model and further trains it on your specific data so it learns your domain, your tone, or your task. Distillation takes a big, expensive model (the "teacher") and transfers its knowledge into a smaller, cheaper model (the "student") that you can actually deploy without burning through your budget.

This guide will walk you through both processes, from zero to a working model. No PhD required - just a willingness to run some Python scripts and rent a GPU for a few hours.

Why Finetune? Why Distill?

Before we get into the how, let's clarify the when.

Finetune when:

  • You need a model that follows a specific output format (JSON schemas, code style, medical reports)
  • You have domain-specific knowledge the base model lacks (legal jargon, internal APIs, niche scientific data)
  • You want to reduce latency and cost by using a smaller model that performs well on your task
  • Prompt engineering alone isn't cutting it - you've tried few-shot examples and system prompts and the model still doesn't behave

Distill when:

  • You're paying too much for GPT-4 or Claude API calls and want a local or cheaper alternative
  • You need to run inference on edge devices or with strict latency requirements
  • You want the quality of a large model but the speed and cost of a small one
  • You need a model you fully control, with no API dependency

Often you'll do both: distill knowledge from a powerful teacher model into a synthetic dataset, then finetune a small open-source model on that dataset.

The Three Finetuning Approaches

A visualization of neural network architecture layers, representing the different approaches to model finetuning

Not all finetuning is created equal. Here are the three main approaches, ranked from most to least resource-intensive:

Full Finetuning

This updates every single parameter in the model. A 7B-parameter model has 7 billion weights, and full finetuning adjusts all of them.

Pros:

  • Maximum learning capacity - the model can adapt deeply to your data
  • Best results when you have a lot of training data (10K+ examples)

Cons:

  • Requires enormous VRAM: a 7B model needs ~60GB just for training (optimizer states, gradients, activations)
  • Slow: hours to days even on expensive hardware
  • Risk of catastrophic forgetting - the model can "forget" general knowledge while learning your task
  • Produces a full-size model copy (14GB+ for a 7B model)

When to use: Almost never for most developers. Reserved for when you have massive datasets, big budgets, and need every last drop of performance.

LoRA (Low-Rank Adaptation)

LoRA is the technique that democratized finetuning. Instead of updating all parameters, it freezes the original model and injects small trainable matrices (called "adapters") into specific layers.

Think of it this way: the base model is a textbook, and LoRA adds sticky notes in the margins. The textbook stays unchanged, but the sticky notes modify the model's behavior for your task.

Pros:

  • Dramatically less VRAM: a 7B model can be finetuned with ~16GB VRAM
  • Fast: minutes to a couple of hours on a single GPU
  • Produces tiny adapter files (10-100MB) instead of full model copies
  • You can merge the adapter back into the base model for deployment, or swap adapters for different tasks
  • Minimal catastrophic forgetting - the base weights don't change

Cons:

  • Slightly less expressive than full finetuning (rarely matters in practice)
  • Quality depends on choosing the right hyperparameters (rank, alpha, target modules)

When to use: This is the default choice for 90% of finetuning tasks. Start here.

QLoRA (Quantized LoRA)

QLoRA combines LoRA with 4-bit quantization. The base model is loaded in 4-bit precision (shrinking a 7B model from ~14GB to ~4GB in memory), and LoRA adapters are trained on top in full precision.

Pros:

  • Even less VRAM: finetune a 7B model with 6-8GB VRAM (a consumer GPU like an RTX 3060)
  • Finetune 70B+ models on a single 24GB GPU (RTX 4090 or A100)
  • Results are surprisingly close to full LoRA

Cons:

  • Slightly slower than standard LoRA (quantization/dequantization overhead)
  • Small accuracy loss from quantization (usually negligible)

When to use: When your GPU doesn't have enough VRAM for standard LoRA, or when you want to finetune larger models without renting expensive multi-GPU setups.

Quick Comparison

Full FinetuneLoRAQLoRA
VRAM for 7B model~60GB~16GB~6GB
Training speedHours-daysMinutes-hoursMinutes-hours
Output size14GB+ (full copy)10-100MB (adapter)10-100MB (adapter)
QualityBestNear-bestVery good
Min. GPUA100 80GBRTX 3090 / A100 40GBRTX 3060 12GB
Best forLarge org, big dataMost use casesBudget/consumer GPU

Building Your Dataset

Data visualization representing the structured datasets needed for finetuning language models

Your model is only as good as its training data. The dataset is where most finetuning projects succeed or fail, so let's get this right.

Dataset Formats

There are two dominant formats in the finetuning world:

Alpaca Format (Single-Turn)

Best for instruction-following tasks. Each example has an instruction, optional input, and expected output:

{
  "instruction": "Convert this SQL query to a MongoDB aggregation pipeline",
  "input": "SELECT department, AVG(salary) FROM employees GROUP BY department HAVING AVG(salary) > 50000",
  "output": "db.employees.aggregate([\n  { $group: { _id: \"$department\", avgSalary: { $avg: \"$salary\" } } },\n  { $match: { avgSalary: { $gt: 50000 } } },\n  { $project: { department: \"$_id\", avgSalary: 1, _id: 0 } }\n])"
}

ShareGPT / Conversational Format (Multi-Turn)

Best for chatbot-style interactions. Each example is a multi-turn conversation:

{
  "conversations": [
    {"from": "human", "value": "I have a Python FastAPI app and I'm getting CORS errors when calling it from my React frontend."},
    {"from": "gpt", "value": "You need to add CORS middleware to your FastAPI app. Here's how:\n\n```python\nfrom fastapi.middleware.cors import CORSMiddleware\n\napp.add_middleware(\n    CORSMiddleware,\n    allow_origins=[\"http://localhost:3000\"],\n    allow_credentials=True,\n    allow_methods=[\"*\"],\n    allow_headers=[\"*\"],\n)\n```\n\nReplace `http://localhost:3000` with your actual frontend URL. In production, avoid using `*` for origins."},
    {"from": "human", "value": "What if I have multiple frontend URLs?"},
    {"from": "gpt", "value": "Pass them as a list:\n\n```python\nallow_origins=[\n    \"http://localhost:3000\",\n    \"https://myapp.com\",\n    \"https://staging.myapp.com\",\n]\n```\n\nEach origin that needs access should be explicitly listed."}
  ]
}

How Many Examples Do You Need?

This depends on the complexity of your task:

Task TypeExamples NeededNotes
Format compliance (JSON output, specific template)50-200Model already knows the content, just needs format
Tone/style transfer200-500Company voice, technical writing style
Domain knowledge500-2,000Medical, legal, domain-specific terminology
Complex reasoning2,000-10,000Multi-step tasks, code generation
General-purpose assistant10,000+Broad coverage across many topics

Creating Your Dataset: Three Approaches

1. Manual Curation

Write examples by hand or adapt existing data. This gives you the highest quality but is slow.

Tip: Start with 50-100 hand-crafted examples. Finetune on those, test the model, identify failure cases, and write more examples targeting those weaknesses. This iterative approach is much more efficient than trying to write thousands of examples upfront.

2. Synthetic Data Generation (Distillation)

Use a powerful model (GPT-4, Claude, DeepSeek-R1) to generate training examples. This is the most common approach and bridges finetuning with distillation:

import openai

client = openai.OpenAI()

system_prompt = """You are a senior software engineer.
Given a coding question, provide a clear, concise answer
with a working code example. Use Python unless otherwise specified."""

questions = [
    "How do I implement rate limiting in Flask?",
    "How do I set up WebSocket connections in FastAPI?",
    "How do I write a custom pytest fixture?",
    # ... hundreds more questions
]

dataset = []
for question in questions:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question}
        ],
        temperature=0.7
    )

    dataset.append({
        "instruction": question,
        "input": "",
        "output": response.choices[0].message.content
    })

Quality filtering: Not every generated example will be good. Filter by:

  • Length (too short = probably incomplete)
  • Format compliance (does it actually contain code?)
  • Deduplication (remove near-duplicates)
  • Optionally: have a second model score the outputs (rejection sampling)

3. Existing Public Datasets

Great for getting started or supplementing your own data:

Browse thousands more on Hugging Face Datasets.

Preparing Your Dataset File

Most finetuning tools expect a JSONL file (one JSON object per line):

# dataset.jsonl
{"instruction": "What is dependency injection?", "input": "", "output": "Dependency injection is a design pattern..."}
{"instruction": "Convert this to async", "input": "def fetch_data(): ...", "output": "async def fetch_data(): ..."}

Or for conversational format:

# dataset.jsonl
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}

Always split your data: Keep 10-15% as a validation set to monitor for overfitting during training.

The Finetuning Toolbox

Developer tools and terminal environment for running finetuning jobs

The ecosystem has matured rapidly. Here are the tools the community actually uses and loves:

Unsloth - The Speed King

Unsloth (35K+ GitHub stars) has become the go-to tool for most developers. It's a drop-in optimization layer that makes finetuning 2x faster while using 70% less memory - with zero accuracy loss.

Why developers love it:

  • Works on a single GPU (no multi-GPU headaches)
  • Free and open-source
  • Supports Llama, Mistral, Gemma, Qwen, Phi, and most popular architectures
  • Has ready-made Google Colab notebooks - you can finetune without owning a GPU
  • The API is dead simple

Here's a complete Unsloth finetuning script:

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# 1. Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,       # QLoRA
)

# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                     # LoRA rank
    lora_alpha=32,            # LoRA scaling factor
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj",
                    "o_proj", "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",  # memory optimization
)

# 3. Load your dataset
dataset = load_dataset("json", data_files="dataset.jsonl", split="train")

# 4. Define the chat template
def format_example(example):
    return tokenizer.apply_chat_template(
        [
            {"role": "user", "content": example["instruction"]},
            {"role": "assistant", "content": example["output"]},
        ],
        tokenize=False,
    )

# 5. Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    formatting_func=format_example,
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        output_dir="./output",
        save_strategy="epoch",
    ),
)

trainer.train()

# 6. Save - choose your format
model.save_pretrained("my-finetuned-model")        # LoRA adapter only
model.save_pretrained_merged("merged-model", tokenizer)  # Full merged model
model.save_pretrained_gguf("model.gguf", tokenizer)      # GGUF for llama.cpp

That's it. With ~500 quality examples, this typically trains in 20-40 minutes on a single GPU and produces a model noticeably better at your task than the base model.

Axolotl - The Swiss Army Knife

Axolotl (8K+ GitHub stars) is the power-user's choice. It's more configurable than Unsloth and supports advanced features like multi-GPU training, DPO/RLHF, and mixture-of-adapters.

Why developers love it:

  • Everything is configured via a single YAML file
  • Supports every finetuning technique (full, LoRA, QLoRA, DPO, ORPO, reward modeling)
  • Enterprise-ready: used by several AI companies in production
  • Excellent documentation and active Discord community

Example Axolotl config (config.yml):

base_model: meta-llama/Llama-3.2-3B-Instruct
model_type: LlamaForCausalLM
load_in_4bit: true

adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules:
 - q_proj
 - k_proj
 - v_proj
 - o_proj

datasets:
 - path: my_dataset.jsonl
    type: alpaca

sequence_len: 2048
micro_batch_size: 4
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 0.0002
optimizer: adamw_torch
lr_scheduler: cosine
warmup_steps: 10

output_dir: ./output

Then run: accelerate launch -m axolotl.cli.train config.yml

LLaMA-Factory - The Zero-Code Option

LLaMA-Factory (45K+ GitHub stars) is the most popular finetuning tool overall, especially for its web UI that lets you configure and launch training runs without writing any code.

Why developers love it:

  • Beautiful web interface - point and click to finetune
  • Supports 100+ model architectures
  • Built-in dataset management and preprocessing
  • One-click deployment after training
# Install
pip install llamafactory

# Launch the web UI
llamafactory-cli webui

# Or train via CLI
llamafactory-cli train config.yaml

Which Tool Should You Pick?

ScenarioRecommended Tool
First time finetuning, want it simpleUnsloth (Colab notebook)
Single GPU, want maximum speedUnsloth
Need a web UI, don't want to write codeLLaMA-Factory
Multi-GPU, advanced techniques (DPO/RLHF)Axolotl
Production pipeline, need YAML configAxolotl

My recommendation for beginners: Start with Unsloth. Open one of their Google Colab notebooks, swap in your dataset, and run. You'll have a finetuned model in 30 minutes with zero setup.

Understanding LoRA Hyperparameters

When finetuning with LoRA, there are a few key numbers you need to understand:

Rank (r)

The rank controls how much the LoRA adapters can learn. Higher rank = more capacity = larger adapter.

RankAdapter Size (7B model)Best For
8~20MBSimple format/style changes
16~40MBMost tasks (good default)
32~80MBComplex domain knowledge
64~160MBNear full-finetune quality
128+~320MB+Rarely needed

Start with r=16 and increase only if your validation loss plateaus.

Alpha (lora_alpha)

The alpha parameter controls the scaling of the LoRA updates. A common rule of thumb: set alpha to 2× the rank.

  • r=16, alpha=32 - standard, works well
  • r=16, alpha=16 - more conservative updates (if training is unstable)

Learning Rate

The learning rate controls how aggressively the model updates its weights. Too high and it overshoots; too low and it barely learns.

MethodRecommended LR
Full Finetune1e-5 to 5e-5
LoRA1e-4 to 3e-4
QLoRA1e-4 to 2e-4

Start with 2e-4 for LoRA/QLoRA. If training loss spikes or oscillates, lower it.

Epochs

How many times the model sees your entire dataset.

  • 1-3 epochs for large datasets (10K+ examples)
  • 3-5 epochs for medium datasets (1K-10K examples)
  • 5-10 epochs for small datasets (< 1K examples)

Watch for overfitting: If training loss keeps dropping but validation loss starts climbing, you've trained too long. Use early stopping or just use fewer epochs.

Running Locally vs. Cloud

GPU hardware and cloud computing resources for training language models

Local GPU Training

If you have a gaming GPU, you might be able to finetune locally:

GPUVRAMCan Finetune (QLoRA)
RTX 306012GBUp to 7B models
RTX 3090 / 409024GBUp to 13B models
RTX 409024GBUp to 13B models, faster
2× RTX 309048GBUp to 34B models

Setup (Linux):

# Install CUDA toolkit (Ubuntu/Debian)
sudo apt install nvidia-cuda-toolkit

# Create a virtual environment
python -m venv finetune-env
source finetune-env/bin/activate

# Install dependencies
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install unsloth transformers datasets trl

Cloud GPU Services

For most developers, renting a cloud GPU for a few hours is the practical choice. Here's what's available:

ServiceGPUPrice/HourBest For
Google ColabT4 (free), A100 (Pro)Free / $10/moGetting started, experiments
RunPodH100 80GB~$3.49/hrSerious training, great UI
Lambda LabsH100 80GB~$2.49/hrResearch, on-demand
Vast.aiVaries (marketplace)~$1.50/hr (H100)Budget, if you don't mind variability
Together AIManaged~$5/hrFinetuning API, no GPU management
ModalH100, A100Pay-per-secondServerless, great DX

My recommendation: Start with Google Colab (free T4 GPU - enough for 3B-7B QLoRA). When you need more power, RunPod offers the best balance of price, reliability, and user experience.

Finetuning-as-a-Service

If you don't want to manage GPUs at all, several platforms offer finetuning APIs:

These are the simplest option but give you the least control (and your data goes to a third party).

Model Distillation: Making Big Models Small

A scientific visualization representing the concept of knowledge distillation - transferring knowledge from a larger to a smaller model

Now let's talk about the other half of this guide: distillation. The idea is simple - take the intelligence of a 70B or 400B parameter model and compress it into something you can actually deploy.

How Distillation Works

The core concept:

  1. Teacher model: A large, powerful model (GPT-4o, Claude, Llama 3.1 405B, DeepSeek-R1)
  2. Student model: A smaller model you want to make smarter (Llama 3.2 3B, Qwen 2.5 7B, Phi-4)
  3. Knowledge transfer: The teacher generates high-quality outputs, and you finetune the student on those outputs

There are two main approaches:

Approach 1: Synthetic Data Distillation

This is the most common and practical approach. You use the teacher to generate a training dataset, then finetune the student on it. We already covered this in the dataset section - it's the same process:

Teacher generates answers → Filter for quality → Finetune student on the data

The key insight: when training on teacher outputs, the student doesn't just learn the answers - it learns the teacher's reasoning patterns. This is why distilled models often punch far above their weight class.

Chain-of-Thought Distillation: A particularly powerful technique. Ask the teacher to show its reasoning step-by-step, then train the student on those reasoning traces:

# Generate training data with chain-of-thought reasoning
prompt = """Solve this step by step, showing your reasoning:

Question: A company has 3 servers. Each handles 1000 requests/second.
They want to add caching that reduces load by 40%.
How many requests per second can the system handle after caching?

Think through this step by step, then give the final answer."""

# Teacher response includes reasoning:
# "Step 1: Current total capacity = 3 × 1000 = 3,000 req/s
#  Step 2: Caching reduces load by 40%, meaning each server
#           only needs to handle 60% of requests
#  Step 3: Effective capacity = 3,000 / 0.6 = 5,000 req/s
#  Answer: 5,000 requests per second"

The student model trained on this data learns not just the answer but how to reason - and generalizes that reasoning to new problems.

This is exactly how the DeepSeek-R1 distilled models were created. DeepSeek used their massive R1 model (671B parameters with mixture-of-experts) as a teacher to generate 800,000 reasoning samples, then finetuned smaller models (1.5B to 70B) on that data. The distilled 7B model achieved reasoning performance rivaling much larger models.

Approach 2: Logit-Based Distillation

This is a more technical approach where the student learns to match the teacher's probability distributions (soft labels), not just its final outputs. It's more efficient but requires access to the teacher's logits (output probabilities), which means you need to run the teacher model yourself.

Frameworks for this:

  • DistillKit by Arcee AI - supports both logit-based and synthetic data distillation
  • EasyDistill - Part of MS-Swift, designed for on-device model compression

For most developers, synthetic data distillation (Approach 1) is the way to go. It's simpler, doesn't require running the teacher model locally, and the results are excellent.

A Complete Distillation Pipeline

Let's walk through a real-world example: distilling GPT-4o's coding knowledge into a Llama 3.2 3B model.

Step 1: Generate seed questions

# Create a diverse set of coding questions
import random

topics = ["Python", "JavaScript", "SQL", "Docker", "Git", "APIs", "Testing"]
difficulties = ["beginner", "intermediate", "advanced"]

seed_questions = [
    "How do I handle errors in {topic}?",
    "What's the best way to structure a {topic} project?",
    "Write a {difficulty} {topic} function that {task}",
    # ... generate hundreds of variations
]

Step 2: Generate teacher responses

import openai
import json

client = openai.OpenAI()
dataset = []

for question in seed_questions:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an expert software engineer. "
             "Provide clear, practical answers with working code examples. "
             "Think through problems step by step."},
            {"role": "user", "content": question}
        ],
        temperature=0.7,
        max_tokens=2048,
    )

    answer = response.choices[0].message.content

    # Basic quality filter
    if len(answer) > 100 and "```" in answer:  # Has substance and code
        dataset.append({
            "instruction": question,
            "input": "",
            "output": answer
        })

# Save dataset
with open("coding_dataset.jsonl", "w") as f:
    for item in dataset:
        f.write(json.dumps(item) + "\n")

print(f"Generated {len(dataset)} training examples")

Step 3: Quality filtering with a second model

# Use a second model to score the teacher's outputs
scored_dataset = []

for item in dataset:
    score_response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper model for scoring
        messages=[
            {"role": "system", "content": "Rate the following coding answer "
             "from 1-10 on accuracy, completeness, and clarity. "
             "Reply with just the number."},
            {"role": "user", "content": f"Question: {item['instruction']}\n\n"
             f"Answer: {item['output']}"}
        ],
        temperature=0,
    )

    try:
        score = int(score_response.choices[0].message.content.strip())
        if score >= 7:  # Only keep high-quality examples
            scored_dataset.append(item)
    except ValueError:
        continue

print(f"Kept {len(scored_dataset)} / {len(dataset)} examples after filtering")

Step 4: Finetune the student model

Use the Unsloth script from earlier, but point it at your distilled dataset:

dataset = load_dataset("json", data_files="coding_dataset.jsonl", split="train")

That's the entire pipeline. In practice, you'd generate 1,000-5,000 examples, filter to the best 70-80%, and finetune for 2-3 epochs. Total cost: maybe $20-50 in API calls for dataset generation, plus $5-10 in GPU time for training.

Evaluating Your Model

After training, you need to check that your model actually got better. Here's a practical evaluation approach:

1. Manual Spot-Checking

Run 20-30 test prompts through both the base model and your finetuned model. Compare outputs side by side. This is surprisingly effective for catching obvious issues.

2. Automated Evaluation

Use a strong model (GPT-4o, Claude) as a judge:

def evaluate_response(question, base_response, finetuned_response):
    judge_prompt = f"""Compare these two responses to the question:

Question: {question}

Response A (base model): {base_response}

Response B (finetuned model): {finetuned_response}

Which is better and why? Rate each from 1-10 on:
- Accuracy
- Completeness
- Clarity
- Formatting"""

    # Send to GPT-4o or Claude for evaluation
    ...

3. Task-Specific Metrics

If your task has measurable outcomes, use them:

  • Code generation: Pass rate on test cases
  • Classification: Accuracy, F1 score
  • Format compliance: Percentage of outputs matching your schema
  • Factual Q&A: Match rate against known answers

Common Problems and Fixes

ProblemSymptomFix
OverfittingPerfect on training data, bad on new inputsReduce epochs, add more data, increase dropout
UnderfittingModel barely changed from baseIncrease epochs, increase rank, check data format
Catastrophic forgettingLost general capabilitiesReduce learning rate, use fewer epochs, mix in general data
Repetitive outputsModel repeats phrases/sentencesLower temperature at inference, add more diverse training data
Wrong formatOutput doesn't match expected structureAdd more format-specific examples, use stricter system prompts

Putting It All Together: A Step-by-Step Checklist

Here's the complete workflow, from idea to deployed model:

1. Define your task clearly

  • What inputs will the model receive?
  • What outputs do you expect?
  • What does "success" look like?

2. Create your dataset

  • Start with 100-500 examples
  • Use synthetic generation from a teacher model if needed
  • Filter for quality
  • Split into 85% train / 15% validation

3. Choose your base model

  • 3B parameters (Llama 3.2 3B, Phi-4-mini): Fast, runs on laptops, good for simple tasks
  • 7B parameters (Qwen 2.5 7B, Llama 3.1 8B): Best balance of quality and speed
  • 14B+ parameters (Qwen 2.5 14B): When you need more capability and have the hardware

4. Finetune with QLoRA + Unsloth

  • Start with r=16, alpha=32, lr=2e-4, epochs=3
  • Monitor training and validation loss
  • Save checkpoints every epoch

5. Evaluate and iterate

  • Compare against base model
  • Identify failure cases
  • Add targeted examples and retrain

6. Export and deploy

  • Merge LoRA into base model (or use GGUF for local deployment)
  • Deploy via vLLM, llama.cpp, or Ollama
  • Monitor quality in production

Further Resources


Finetuning and distillation aren't magic - they're just training with focus. You're taking a general-purpose model and sharpening it for your specific use case. The tools are free, the cloud GPUs are cheap, and the barrier to entry has never been lower. Start small: take a 3B model, finetune it on 200 examples of your task, and compare it to the base model. You'll be surprised how much better it gets - and once you see the results, you'll understand why everyone's doing this.

A Developer's Guide to Finetuning and Distilling Language Models
About the author AI Education & Guides Writer

Priya is an AI educator and technical writer whose mission is making artificial intelligence approachable for everyone - not just engineers.