A Developer's Guide to Finetuning and Distilling Language Models

You've been using ChatGPT or Claude at work, maybe piping prompts through an API, and now you're wondering: can I make a model that just does my thing, faster and cheaper? The answer is yes - and it's more accessible than you think.

Finetuning takes an existing pretrained model and further trains it on your specific data so it learns your domain, your tone, or your task. Distillation takes a big, expensive model (the "teacher") and transfers its knowledge into a smaller, cheaper model (the "student") that you can actually deploy without burning through your budget.

This guide will walk you through both processes, from zero to a working model. No PhD required - just a willingness to run some Python scripts and rent a GPU for a few hours.

Why Finetune? Why Distill?

Before we get into the how, let's clarify the when.

Finetune when:

You need a model that follows a specific output format (JSON schemas, code style, medical reports)
You have domain-specific knowledge the base model lacks (legal jargon, internal APIs, niche scientific data)
You want to reduce latency and cost by using a smaller model that performs well on your task
Prompt engineering alone isn't cutting it - you've tried few-shot examples and system prompts and the model still doesn't behave

Distill when:

You're paying too much for GPT-4 or Claude API calls and want a local or cheaper alternative
You need to run inference on edge devices or with strict latency requirements
You want the quality of a large model but the speed and cost of a small one
You need a model you fully control, with no API dependency

Often you'll do both: distill knowledge from a powerful teacher model into a synthetic dataset, then finetune a small open-source model on that dataset.

The Three Finetuning Approaches

A visualization of neural network architecture layers, representing the different approaches to model finetuning

Not all finetuning is created equal. Here are the three main approaches, ranked from most to least resource-intensive:

Full Finetuning

This updates every single parameter in the model. A 7B-parameter model has 7 billion weights, and full finetuning adjusts all of them.

Pros:

Maximum learning capacity - the model can adapt deeply to your data
Best results when you have a lot of training data (10K+ examples)

Cons:

Requires enormous VRAM: a 7B model needs ~60GB just for training (optimizer states, gradients, activations)
Slow: hours to days even on expensive hardware
Risk of catastrophic forgetting - the model can "forget" general knowledge while learning your task
Produces a full-size model copy (14GB+ for a 7B model)

When to use: Almost never for most developers. Reserved for when you have massive datasets, big budgets, and need every last drop of performance.

LoRA (Low-Rank Adaptation)

LoRA is the technique that democratized finetuning. Instead of updating all parameters, it freezes the original model and injects small trainable matrices (called "adapters") into specific layers.

Think of it this way: the base model is a textbook, and LoRA adds sticky notes in the margins. The textbook stays unchanged, but the sticky notes modify the model's behavior for your task.

Pros:

Dramatically less VRAM: a 7B model can be finetuned with ~16GB VRAM
Fast: minutes to a couple of hours on a single GPU
Produces tiny adapter files (10-100MB) instead of full model copies
You can merge the adapter back into the base model for deployment, or swap adapters for different tasks
Minimal catastrophic forgetting - the base weights don't change

Cons:

Slightly less expressive than full finetuning (rarely matters in practice)
Quality depends on choosing the right hyperparameters (rank, alpha, target modules)

When to use: This is the default choice for 90% of finetuning tasks. Start here.

QLoRA (Quantized LoRA)

QLoRA combines LoRA with 4-bit quantization. The base model is loaded in 4-bit precision (shrinking a 7B model from ~14GB to ~4GB in memory), and LoRA adapters are trained on top in full precision.

Pros:

Even less VRAM: finetune a 7B model with 6-8GB VRAM (a consumer GPU like an RTX 3060)
Finetune 70B+ models on a single 24GB GPU (RTX 4090 or A100)
Results are surprisingly close to full LoRA

Cons:

Slightly slower than standard LoRA (quantization/dequantization overhead)
Small accuracy loss from quantization (usually negligible)

When to use: When your GPU doesn't have enough VRAM for standard LoRA, or when you want to finetune larger models without renting expensive multi-GPU setups.

Quick Comparison

	Full Finetune	LoRA	QLoRA
VRAM for 7B model	~60GB	~16GB	~6GB
Training speed	Hours-days	Minutes-hours	Minutes-hours
Output size	14GB+ (full copy)	10-100MB (adapter)	10-100MB (adapter)
Quality	Best	Near-best	Very good
Min. GPU	A100 80GB	RTX 3090 / A100 40GB	RTX 3060 12GB
Best for	Large org, big data	Most use cases	Budget/consumer GPU

Building Your Dataset

Data visualization representing the structured datasets needed for finetuning language models

Your model is only as good as its training data. The dataset is where most finetuning projects succeed or fail, so let's get this right.

Dataset Formats

There are two dominant formats in the finetuning world:

Alpaca Format (Single-Turn)

Best for instruction-following tasks. Each example has an instruction, optional input, and expected output:

{
  "instruction": "Convert this SQL query to a MongoDB aggregation pipeline",
  "input": "SELECT department, AVG(salary) FROM employees GROUP BY department HAVING AVG(salary) > 50000",
  "output": "db.employees.aggregate([\n  { $group: { _id: \"$department\", avgSalary: { $avg: \"$salary\" } } },\n  { $match: { avgSalary: { $gt: 50000 } } },\n  { $project: { department: \"$_id\", avgSalary: 1, _id: 0 } }\n])"
}

ShareGPT / Conversational Format (Multi-Turn)

Best for chatbot-style interactions. Each example is a multi-turn conversation:

{
  "conversations": [
    {"from": "human", "value": "I have a Python FastAPI app and I'm getting CORS errors when calling it from my React frontend."},
    {"from": "gpt", "value": "You need to add CORS middleware to your FastAPI app. Here's how:\n\n```python\nfrom fastapi.middleware.cors import CORSMiddleware\n\napp.add_middleware(\n    CORSMiddleware,\n    allow_origins=[\"http://localhost:3000\"],\n    allow_credentials=True,\n    allow_methods=[\"*\"],\n    allow_headers=[\"*\"],\n)\n```\n\nReplace `http://localhost:3000` with your actual frontend URL. In production, avoid using `*` for origins."},
    {"from": "human", "value": "What if I have multiple frontend URLs?"},
    {"from": "gpt", "value": "Pass them as a list:\n\n```python\nallow_origins=[\n    \"http://localhost:3000\",\n    \"https://myapp.com\",\n    \"https://staging.myapp.com\",\n]\n```\n\nEach origin that needs access should be explicitly listed."}
  ]
}

How Many Examples Do You Need?

This depends on the complexity of your task:

Task Type	Examples Needed	Notes
Format compliance (JSON output, specific template)	50-200	Model already knows the content, just needs format
Tone/style transfer	200-500	Company voice, technical writing style
Domain knowledge	500-2,000	Medical, legal, domain-specific terminology
Complex reasoning	2,000-10,000	Multi-step tasks, code generation
General-purpose assistant	10,000+	Broad coverage across many topics

Creating Your Dataset: Three Approaches

1. Manual Curation

Write examples by hand or adapt existing data. This gives you the highest quality but is slow.

Tip: Start with 50-100 hand-crafted examples. Finetune on those, test the model, identify failure cases, and write more examples targeting those weaknesses. This iterative approach is much more efficient than trying to write thousands of examples upfront.

2. Synthetic Data Generation (Distillation)

Use a powerful model (GPT-4, Claude, DeepSeek-R1) to generate training examples. This is the most common approach and bridges finetuning with distillation:

import openai

client = openai.OpenAI()

system_prompt = """You are a senior software engineer.
Given a coding question, provide a clear, concise answer
with a working code example. Use Python unless otherwise specified."""

questions = [
    "How do I implement rate limiting in Flask?",
    "How do I set up WebSocket connections in FastAPI?",
    "How do I write a custom pytest fixture?",
    # ... hundreds more questions
]

dataset = []
for question in questions:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question}
        ],
        temperature=0.7
    )

    dataset.append({
        "instruction": question,
        "input": "",
        "output": response.choices[0].message.content
    })

Quality filtering: Not every generated example will be good. Filter by:

Length (too short = probably incomplete)
Format compliance (does it actually contain code?)
Deduplication (remove near-duplicates)
Optionally: have a second model score the outputs (rejection sampling)

3. Existing Public Datasets

Great for getting started or supplementing your own data:

OpenHermes 2.5 - 1M high-quality instruction examples
SlimOrca - Cleaned subset of OpenOrca, 518K examples
Code-Feedback - 66K code instruction examples
UltraChat 200K - 200K multi-turn conversations

Browse thousands more on Hugging Face Datasets.

Preparing Your Dataset File

Most finetuning tools expect a JSONL file (one JSON object per line):

# dataset.jsonl
{"instruction": "What is dependency injection?", "input": "", "output": "Dependency injection is a design pattern..."}
{"instruction": "Convert this to async", "input": "def fetch_data(): ...", "output": "async def fetch_data(): ..."}

Or for conversational format:

# dataset.jsonl
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}

Always split your data: Keep 10-15% as a validation set to monitor for overfitting during training.

The Finetuning Toolbox

Developer tools and terminal environment for running finetuning jobs

The ecosystem has matured rapidly. Here are the tools the community actually uses and loves:

Unsloth - The Speed King

Unsloth (35K+ GitHub stars) has become the go-to tool for most developers. It's a drop-in optimization layer that makes finetuning 2x faster while using 70% less memory - with zero accuracy loss.

Why developers love it:

Works on a single GPU (no multi-GPU headaches)
Free and open-source
Supports Llama, Mistral, Gemma, Qwen, Phi, and most popular architectures
Has ready-made Google Colab notebooks - you can finetune without owning a GPU
The API is dead simple

Here's a complete Unsloth finetuning script:

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# 1. Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,       # QLoRA
)

# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                     # LoRA rank
    lora_alpha=32,            # LoRA scaling factor
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj",
                    "o_proj", "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",  # memory optimization
)

# 3. Load your dataset
dataset = load_dataset("json", data_files="dataset.jsonl", split="train")

# 4. Define the chat template
def format_example(example):
    return tokenizer.apply_chat_template(
        [
            {"role": "user", "content": example["instruction"]},
            {"role": "assistant", "content": example["output"]},
        ],
        tokenize=False,
    )

# 5. Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    formatting_func=format_example,
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        output_dir="./output",
        save_strategy="epoch",
    ),
)

trainer.train()

# 6. Save - choose your format
model.save_pretrained("my-finetuned-model")        # LoRA adapter only
model.save_pretrained_merged("merged-model", tokenizer)  # Full merged model
model.save_pretrained_gguf("model.gguf", tokenizer)      # GGUF for llama.cpp

That's it. With ~500 quality examples, this typically trains in 20-40 minutes on a single GPU and produces a model noticeably better at your task than the base model.

Axolotl - The Swiss Army Knife

Axolotl (8K+ GitHub stars) is the power-user's choice. It's more configurable than Unsloth and supports advanced features like multi-GPU training, DPO/RLHF, and mixture-of-adapters.

Why developers love it:

Everything is configured via a single YAML file
Supports every finetuning technique (full, LoRA, QLoRA, DPO, ORPO, reward modeling)
Enterprise-ready: used by several AI companies in production
Excellent documentation and active Discord community

Example Axolotl config (config.yml):

base_model: meta-llama/Llama-3.2-3B-Instruct
model_type: LlamaForCausalLM
load_in_4bit: true

adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules:
 - q_proj
 - k_proj
 - v_proj
 - o_proj

datasets:
 - path: my_dataset.jsonl
    type: alpaca

sequence_len: 2048
micro_batch_size: 4
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 0.0002
optimizer: adamw_torch
lr_scheduler: cosine
warmup_steps: 10

output_dir: ./output

Then run: accelerate launch -m axolotl.cli.train config.yml

LLaMA-Factory - The Zero-Code Option

LLaMA-Factory (45K+ GitHub stars) is the most popular finetuning tool overall, especially for its web UI that lets you configure and launch training runs without writing any code.

Why developers love it:

Beautiful web interface - point and click to finetune
Supports 100+ model architectures
Built-in dataset management and preprocessing
One-click deployment after training

# Install
pip install llamafactory

# Launch the web UI
llamafactory-cli webui

# Or train via CLI
llamafactory-cli train config.yaml

Which Tool Should You Pick?

Scenario	Recommended Tool
First time finetuning, want it simple	Unsloth (Colab notebook)
Single GPU, want maximum speed	Unsloth
Need a web UI, don't want to write code	LLaMA-Factory
Multi-GPU, advanced techniques (DPO/RLHF)	Axolotl
Production pipeline, need YAML config	Axolotl

My recommendation for beginners: Start with Unsloth. Open one of their Google Colab notebooks, swap in your dataset, and run. You'll have a finetuned model in 30 minutes with zero setup.

Understanding LoRA Hyperparameters

When finetuning with LoRA, there are a few key numbers you need to understand:

Rank (`r`)

The rank controls how much the LoRA adapters can learn. Higher rank = more capacity = larger adapter.

Rank	Adapter Size (7B model)	Best For
8	~20MB	Simple format/style changes
16	~40MB	Most tasks (good default)
32	~80MB	Complex domain knowledge
64	~160MB	Near full-finetune quality
128+	~320MB+	Rarely needed

Start with r=16 and increase only if your validation loss plateaus.

Alpha (`lora_alpha`)

The alpha parameter controls the scaling of the LoRA updates. A common rule of thumb: set alpha to 2× the rank.

r=16, alpha=32 - standard, works well
r=16, alpha=16 - more conservative updates (if training is unstable)

Learning Rate

The learning rate controls how aggressively the model updates its weights. Too high and it overshoots; too low and it barely learns.

Method	Recommended LR
Full Finetune	1e-5 to 5e-5
LoRA	1e-4 to 3e-4
QLoRA	1e-4 to 2e-4

Start with 2e-4 for LoRA/QLoRA. If training loss spikes or oscillates, lower it.

Epochs

How many times the model sees your entire dataset.

1-3 epochs for large datasets (10K+ examples)
3-5 epochs for medium datasets (1K-10K examples)
5-10 epochs for small datasets (< 1K examples)

Watch for overfitting: If training loss keeps dropping but validation loss starts climbing, you've trained too long. Use early stopping or just use fewer epochs.

Running Locally vs. Cloud

GPU hardware and cloud computing resources for training language models

Local GPU Training

If you have a gaming GPU, you might be able to finetune locally:

GPU	VRAM	Can Finetune (QLoRA)
RTX 3060	12GB	Up to 7B models
RTX 3090 / 4090	24GB	Up to 13B models
RTX 4090	24GB	Up to 13B models, faster
2× RTX 3090	48GB	Up to 34B models

Setup (Linux):

# Install CUDA toolkit (Ubuntu/Debian)
sudo apt install nvidia-cuda-toolkit

# Create a virtual environment
python -m venv finetune-env
source finetune-env/bin/activate

# Install dependencies
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install unsloth transformers datasets trl

Cloud GPU Services

For most developers, renting a cloud GPU for a few hours is the practical choice. Here's what's available:

Service	GPU	Price/Hour	Best For
Google Colab	T4 (free), A100 (Pro)	Free / $10/mo	Getting started, experiments
RunPod	H100 80GB	~$3.49/hr	Serious training, great UI
Lambda Labs	H100 80GB	~$2.49/hr	Research, on-demand
Vast.ai	Varies (marketplace)	~$1.50/hr (H100)	Budget, if you don't mind variability
Together AI	Managed	~$5/hr	Finetuning API, no GPU management
Modal	H100, A100	Pay-per-second	Serverless, great DX

My recommendation: Start with Google Colab (free T4 GPU - enough for 3B-7B QLoRA). When you need more power, RunPod offers the best balance of price, reliability, and user experience.

Finetuning-as-a-Service

If you don't want to manage GPUs at all, several platforms offer finetuning APIs:

Together AI - Upload your dataset, pick a model, get a finetuned endpoint
OpenAI Fine-tuning API - Finetune GPT-4o-mini or GPT-4o with your data
Fireworks AI - Fine-tuning with deployment included

These are the simplest option but give you the least control (and your data goes to a third party).

Model Distillation: Making Big Models Small

A scientific visualization representing the concept of knowledge distillation - transferring knowledge from a larger to a smaller model

Now let's talk about the other half of this guide: distillation. The idea is simple - take the intelligence of a 70B or 400B parameter model and compress it into something you can actually deploy.

How Distillation Works

The core concept:

Teacher model: A large, powerful model (GPT-4o, Claude, Llama 3.1 405B, DeepSeek-R1)
Student model: A smaller model you want to make smarter (Llama 3.2 3B, Qwen 2.5 7B, Phi-4)
Knowledge transfer: The teacher generates high-quality outputs, and you finetune the student on those outputs

There are two main approaches:

Approach 1: Synthetic Data Distillation

This is the most common and practical approach. You use the teacher to generate a training dataset, then finetune the student on it. We already covered this in the dataset section - it's the same process:

Teacher generates answers → Filter for quality → Finetune student on the data

The key insight: when training on teacher outputs, the student doesn't just learn the answers - it learns the teacher's reasoning patterns. This is why distilled models often punch far above their weight class.

Chain-of-Thought Distillation: A particularly powerful technique. Ask the teacher to show its reasoning step-by-step, then train the student on those reasoning traces:

# Generate training data with chain-of-thought reasoning
prompt = """Solve this step by step, showing your reasoning:

Question: A company has 3 servers. Each handles 1000 requests/second.
They want to add caching that reduces load by 40%.
How many requests per second can the system handle after caching?

Think through this step by step, then give the final answer."""

# Teacher response includes reasoning:
# "Step 1: Current total capacity = 3 × 1000 = 3,000 req/s
#  Step 2: Caching reduces load by 40%, meaning each server
#           only needs to handle 60% of requests
#  Step 3: Effective capacity = 3,000 / 0.6 = 5,000 req/s
#  Answer: 5,000 requests per second"

The student model trained on this data learns not just the answer but how to reason - and generalizes that reasoning to new problems.

This is exactly how the DeepSeek-R1 distilled models were created. DeepSeek used their massive R1 model (671B parameters with mixture-of-experts) as a teacher to generate 800,000 reasoning samples, then finetuned smaller models (1.5B to 70B) on that data. The distilled 7B model achieved reasoning performance rivaling much larger models.

Approach 2: Logit-Based Distillation

This is a more technical approach where the student learns to match the teacher's probability distributions (soft labels), not just its final outputs. It's more efficient but requires access to the teacher's logits (output probabilities), which means you need to run the teacher model yourself.

Frameworks for this:

DistillKit by Arcee AI - supports both logit-based and synthetic data distillation
EasyDistill - Part of MS-Swift, designed for on-device model compression

For most developers, synthetic data distillation (Approach 1) is the way to go. It's simpler, doesn't require running the teacher model locally, and the results are excellent.

A Complete Distillation Pipeline

Let's walk through a real-world example: distilling GPT-4o's coding knowledge into a Llama 3.2 3B model.

Step 1: Generate seed questions

# Create a diverse set of coding questions
import random

topics = ["Python", "JavaScript", "SQL", "Docker", "Git", "APIs", "Testing"]
difficulties = ["beginner", "intermediate", "advanced"]

seed_questions = [
    "How do I handle errors in {topic}?",
    "What's the best way to structure a {topic} project?",
    "Write a {difficulty} {topic} function that {task}",
    # ... generate hundreds of variations
]

Step 2: Generate teacher responses

import openai
import json

client = openai.OpenAI()
dataset = []

for question in seed_questions:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an expert software engineer. "
             "Provide clear, practical answers with working code examples. "
             "Think through problems step by step."},
            {"role": "user", "content": question}
        ],
        temperature=0.7,
        max_tokens=2048,
    )

    answer = response.choices[0].message.content

    # Basic quality filter
    if len(answer) > 100 and "```" in answer:  # Has substance and code
        dataset.append({
            "instruction": question,
            "input": "",
            "output": answer
        })

# Save dataset
with open("coding_dataset.jsonl", "w") as f:
    for item in dataset:
        f.write(json.dumps(item) + "\n")

print(f"Generated {len(dataset)} training examples")

Step 3: Quality filtering with a second model

# Use a second model to score the teacher's outputs
scored_dataset = []

for item in dataset:
    score_response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper model for scoring
        messages=[
            {"role": "system", "content": "Rate the following coding answer "
             "from 1-10 on accuracy, completeness, and clarity. "
             "Reply with just the number."},
            {"role": "user", "content": f"Question: {item['instruction']}\n\n"
             f"Answer: {item['output']}"}
        ],
        temperature=0,
    )

    try:
        score = int(score_response.choices[0].message.content.strip())
        if score >= 7:  # Only keep high-quality examples
            scored_dataset.append(item)
    except ValueError:
        continue

print(f"Kept {len(scored_dataset)} / {len(dataset)} examples after filtering")

Step 4: Finetune the student model

Use the Unsloth script from earlier, but point it at your distilled dataset:

dataset = load_dataset("json", data_files="coding_dataset.jsonl", split="train")

That's the entire pipeline. In practice, you'd generate 1,000-5,000 examples, filter to the best 70-80%, and finetune for 2-3 epochs. Total cost: maybe $20-50 in API calls for dataset generation, plus $5-10 in GPU time for training.

Evaluating Your Model

After training, you need to check that your model actually got better. Here's a practical evaluation approach:

1. Manual Spot-Checking

Run 20-30 test prompts through both the base model and your finetuned model. Compare outputs side by side. This is surprisingly effective for catching obvious issues.

2. Automated Evaluation

Use a strong model (GPT-4o, Claude) as a judge:

def evaluate_response(question, base_response, finetuned_response):
    judge_prompt = f"""Compare these two responses to the question:

Question: {question}

Response A (base model): {base_response}

Response B (finetuned model): {finetuned_response}

Which is better and why? Rate each from 1-10 on:
- Accuracy
- Completeness
- Clarity
- Formatting"""

    # Send to GPT-4o or Claude for evaluation
    ...

3. Task-Specific Metrics

If your task has measurable outcomes, use them:

Code generation: Pass rate on test cases
Classification: Accuracy, F1 score
Format compliance: Percentage of outputs matching your schema
Factual Q&A: Match rate against known answers

Common Problems and Fixes

Problem	Symptom	Fix
Overfitting	Perfect on training data, bad on new inputs	Reduce epochs, add more data, increase dropout
Underfitting	Model barely changed from base	Increase epochs, increase rank, check data format
Catastrophic forgetting	Lost general capabilities	Reduce learning rate, use fewer epochs, mix in general data
Repetitive outputs	Model repeats phrases/sentences	Lower temperature at inference, add more diverse training data
Wrong format	Output doesn't match expected structure	Add more format-specific examples, use stricter system prompts

Putting It All Together: A Step-by-Step Checklist

Here's the complete workflow, from idea to deployed model:

1. Define your task clearly

What inputs will the model receive?
What outputs do you expect?
What does "success" look like?

2. Create your dataset

Start with 100-500 examples
Use synthetic generation from a teacher model if needed
Filter for quality
Split into 85% train / 15% validation

3. Choose your base model

3B parameters (Llama 3.2 3B, Phi-4-mini): Fast, runs on laptops, good for simple tasks
7B parameters (Qwen 2.5 7B, Llama 3.1 8B): Best balance of quality and speed
14B+ parameters (Qwen 2.5 14B): When you need more capability and have the hardware

4. Finetune with QLoRA + Unsloth

Start with r=16, alpha=32, lr=2e-4, epochs=3
Monitor training and validation loss
Save checkpoints every epoch

5. Evaluate and iterate

Compare against base model
Identify failure cases
Add targeted examples and retrain

6. Export and deploy

Merge LoRA into base model (or use GGUF for local deployment)
Deploy via vLLM, llama.cpp, or Ollama
Monitor quality in production

Further Resources

Unsloth GitHub - Start here. The Colab notebooks are the fastest way to get hands-on experience.
Axolotl GitHub - For when you need more control and advanced techniques.
LLaMA-Factory GitHub - The web UI makes finetuning accessible to everyone.
Hugging Face PEFT documentation - The library that powers LoRA under the hood.
TRL documentation - Transformer Reinforcement Learning - SFTTrainer and beyond.
DistillKit - When you want to explore logit-based distillation.

Finetuning and distillation aren't magic - they're just training with focus. You're taking a general-purpose model and sharpening it for your specific use case. The tools are free, the cloud GPUs are cheap, and the barrier to entry has never been lower. Start small: take a 3B model, finetune it on 200 examples of your task, and compare it to the base model. You'll be surprised how much better it gets - and once you see the results, you'll understand why everyone's doing this.

Why Finetune? Why Distill?

The Three Finetuning Approaches

Full Finetuning

LoRA (Low-Rank Adaptation)

QLoRA (Quantized LoRA)

Quick Comparison

Building Your Dataset

Dataset Formats

Alpaca Format (Single-Turn)

ShareGPT / Conversational Format (Multi-Turn)

How Many Examples Do You Need?

Creating Your Dataset: Three Approaches

1. Manual Curation

2. Synthetic Data Generation (Distillation)

3. Existing Public Datasets

Preparing Your Dataset File

The Finetuning Toolbox

Unsloth - The Speed King

Axolotl - The Swiss Army Knife

LLaMA-Factory - The Zero-Code Option

Which Tool Should You Pick?

Understanding LoRA Hyperparameters

Rank (r)

Alpha (lora_alpha)

Learning Rate

Epochs

Running Locally vs. Cloud

Local GPU Training

Cloud GPU Services

Finetuning-as-a-Service

Model Distillation: Making Big Models Small

How Distillation Works

Approach 1: Synthetic Data Distillation

Approach 2: Logit-Based Distillation

A Complete Distillation Pipeline

Evaluating Your Model

1. Manual Spot-Checking

2. Automated Evaluation

3. Task-Specific Metrics

Common Problems and Fixes

Putting It All Together: A Step-by-Step Checklist

Further Resources

Google Analytics

Rank (`r`)

Alpha (`lora_alpha`)