News

A Group of College Students Distilled Claude, GPT, and Gemini Into Open-Source Models for $52

TeichAI, a four-person non-profit, generated 250 reasoning samples from Claude Opus 4.5, fine-tuned open-weight models on the result, and racked up 67,000 downloads. The legal and technical implications are more interesting than the benchmarks.

A Group of College Students Distilled Claude, GPT, and Gemini Into Open-Source Models for $52

Four people. 250 training samples. $52.30 in API costs. That is what it took for TeichAI, a non-profit group of college students, to distill Claude Opus 4.5's reasoning patterns into open-weight models that have been downloaded over 67,000 times on Hugging Face.

The group has published 102 models and 41 datasets, systematically distilling outputs from Claude 4.5 Opus, GPT-5.2, Gemini 3 Pro, DeepSeek V3.2, and Kimi K2 into smaller open-source bases like Qwen3, GLM-4.7, Nemotron, and Devstral. Their tagline is "Collect. Distill. Release." They mean it literally.

How It Works

TeichAI's process is not knowledge distillation in the traditional ML sense - there are no logits, no teacher-student training loops, no access to model internals. What they do is closer to behavior cloning through synthetic data.

The pipeline:

  1. Generate prompts - Using PromptGen, an open-source tool one of their members built, they generate hundreds of complex reasoning problems spanning coding, science, and research tasks
  2. Collect responses - Send those prompts to a frontier model's API (Claude Opus 4.5, GPT-5.2, etc.) with reasoning effort set to high, capturing the full chain-of-thought output including <think> tags
  3. Package the dataset - The raw prompt-response pairs become a JSONL training dataset. Their Claude dataset contains 250 samples totaling 2.13 million tokens, generated for $52.30
  4. Fine-tune - Using Unsloth for supervised fine-tuning (SFT), they train open-weight base models on this data. Convergence happens fast - 400 to 600 training steps

The key insight is that they are not trying to make a small model as smart as Claude. They are trying to make it reason like Claude - adopting the structured thinking patterns, the step-by-step problem decomposition, the self-correction habits that show up in extended thinking outputs.

Developer armand0e, one of the four members, put it plainly in a Hugging Face discussion: "not a knowledge distillation or a full scale distillation by any means." It is reasoning pattern transfer on a shoestring budget.

The Model Inventory

TeichAI has distilled across nearly every major frontier model into a range of open-weight architectures:

ModelBaseParametersSourceMonthly Downloads
GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-DistillGLM-4.730BClaude 4.5 Opus67,330
Qwen3-14B-Claude-4.5-Opus-High-Reasoning-DistillQwen314BClaude 4.5 Opus8,740
Devstral-Small-2505-Deepseek-V3.2-Speciale-DistillDevstral24BDeepSeek V3.25,990
Qwen3-32B-Kimi-K2-Thinking-DistillQwen332BKimi K24,920
Qwen3-4B-Thinking-2507-GPT-5.1-Codex-Max-DistillQwen34BGPT-5.11,400

The Claude 4.5 Opus collection alone spans 15 releases across GLM, Qwen3 (4B, 8B, 14B), and Nemotron architectures. A GPT 5.2 collection is in progress.

Every model ships in GGUF format with quantizations from 2-bit to 16-bit, designed for local deployment on consumer hardware. The 14B Qwen3 variant fits in 9 GB at Q4 quantization - comfortably within the reach of a 16 GB laptop.

What the Benchmarks Actually Show

The GLM-4.7 Claude distillation is the only model with published benchmarks, evaluated at 4-bit quantization using LM Evaluation Harness:

BenchmarkBase ModelDistilledChange
GPQA Diamond0.26260.2929+11.5%
Winogrande0.46880.5043+7.6%
MMLU0.22950.2407+4.9%
IFEval0.10910.1128+3.4%
ARC Challenge0.22440.2176-3.0%
HellaSwag0.25780.2567-0.4%
TruthfulQA0.46760.4668-0.2%

Seven wins, zero losses in the head-to-head against the base model. The average improvement is modest - about one percentage point across the board. The GPQA Diamond result is the standout: an 11.5% improvement on graduate-level reasoning questions, which is exactly where you would expect reasoning distillation to have the most impact.

But let us be honest about what these numbers mean. An MMLU score of 0.24 is not competitive with frontier models. A GPQA score of 0.29 is not either. The distillation improves the base model's reasoning ability in measurable ways, but it does not transform a 30B open-weight model into Claude. It transfers style and structure, not capability.

MMLU subject-level breakdowns show where the gains concentrate: anatomy (+6.7%), clinical knowledge (+5.3%), philosophy (+4.1%), jurisprudence (+4.6%). These are domains where structured reasoning - the thing Claude's thinking traces demonstrate - matters most.

What Users Report

The 67,000 downloads have generated enough real-world feedback to paint an honest picture.

The positive case: users report the model produces more structured, thoughtful responses than the base GLM-4.7. The reasoning traces give it a distinctive "Claude-like" quality in how it approaches problems - laying out assumptions, considering edge cases, self-correcting.

The negative case is instructive. Multiple users on Hugging Face reported the model producing extended, irrelevant monologues when given simple inputs. One user described "verbal diarrhea" - the model launching into lectures about unrelated topics when asked basic questions. The cause is straightforward: the training data consists entirely of complex reasoning problems with long, detailed responses. The model learned to always produce that kind of output, even when a two-word answer would suffice.

Coding performance is another weak spot. One user reported the model "does pretty well at first when coding but it always misses many small silly things." The developer confirmed there is no coding-specific data in the training set: "Not any data in the dataset to advance or reinforce agentic coding capability." The model likely regressed in areas where the base model was competent because the fine-tuning overwrote those behaviors.

TeichAI is planning a v2 that includes coding data and simpler Q&A pairs to address both issues.

Here is where it gets interesting. What TeichAI is doing almost certainly violates the terms of service of every frontier model provider they are distilling.

Anthropic's terms are explicit: using Claude outputs to train AI models is prohibited without written permission. The banned uses specifically include "general-purpose chatbots," "open-ended text generation models," and "using Claude Outputs as training targets." TeichAI's work is a textbook example of all three.

OpenAI, Google, and xAI have similar clauses. Anthropic has already enforced theirs - they revoked OpenAI's API access in August 2025 for terms of service violations, and restricted access for Chinese-controlled companies globally in September.

But enforcement against a four-person non-profit releasing free models is a different calculus than enforcement against well-funded competitors. And even if terms of service are breached, the underlying copyright claims are weak. AI outputs generally lack copyrightability because copyright requires human authorship. The student model has a completely different architecture than the teacher. Copyright protects expression, not numerical weight values. Legal analysts at Fenwick and Winston & Strawn have both concluded that terms of service violations are the strongest available claim - not copyright, not trade secrets.

The result is a situation where the practice is clearly against the rules but the rules may not be enforceable in any meaningful way. DeepSeek demonstrated this at scale. TeichAI is demonstrating it can be done by anyone with an API key and $52.

The Bigger Picture

TeichAI is not unique. They are just unusually transparent about what the entire open-source model ecosystem has been doing quietly for years.

DeepSeek's R1 was widely believed to have been trained partly on reasoning traces from OpenAI's models. Microsoft's Phi series has been called "GPT-4 distillation with extra steps." The open-source community routinely fine-tunes models on ShareGPT datasets - conversation logs scraped from ChatGPT interactions. What TeichAI did differently is publish the cost ($52.30), the dataset size (250 samples), the exact source (Claude Opus 4.5 with high reasoning), and the results (modest but real improvements).

The uncomfortable truth this reveals is that reasoning patterns - the structured thinking, self-correction, and problem decomposition that make frontier models valuable - can be partially transferred with remarkably little data. Two hundred fifty examples is not a lot. The fact that a 30B model can meaningfully improve its GPQA performance from exposure to 250 Claude reasoning traces suggests that what frontier models are doing is not magic. It is a learnable pattern, and the pattern can be taught cheaply.

This does not mean the gap between frontier and open-source models is closing. A distilled 30B model scoring 0.29 on GPQA is not competing with Claude scoring above 0.65. But it means the floor is rising. Every time a frontier model gets better at reasoning, its outputs become better training data for the open-source models chasing it. The gap may stay constant in absolute terms while shrinking in practical terms for everyday tasks.

What TeichAI Built

Beyond the models themselves, TeichAI open-sourced their tooling:

  • DataGen - A TypeScript CLI that reads prompts from a text file, sends them to any OpenRouter-compatible API, and outputs structured JSONL training data. Configurable for model, concurrency, system prompt, and reasoning effort. This is the tool they used to generate their datasets.
  • Model-Benchmark-Suite - A Streamlit UI for running lm_eval benchmarks against their models.

The entire pipeline from prompt generation to dataset creation to model training to benchmark evaluation is open source. Anyone can replicate what TeichAI did. That is arguably more significant than the models themselves.

What This Means

TeichAI's work is a proof of concept that reasoning distillation is accessible to anyone. Four students, $52, 250 examples, and a weekend is all it takes to produce a model that demonstrably reasons better than its base. The improvements are modest. The models are not frontier-competitive. The legal ground is shaky.

But the point is not whether these specific models are good enough to replace Claude. The point is that the barrier to extracting and transferring reasoning capabilities from proprietary models is essentially zero. The frontier labs are producing outputs that, by design, demonstrate their most valuable capability - structured reasoning - and anyone with API access can capture that capability and give it away for free.

For the open-source community, this is validation. For frontier labs building moats around reasoning capability, it is a problem that terms of service alone cannot solve.


Sources:

About the author AI Infrastructure & Open Source Reporter

Sophie is a journalist and former systems engineer who covers AI infrastructure, open-source models, and the developer tooling ecosystem.