Claude Opus Reasoning Distilled Into Open 27B Model
A community fine-tune distills Claude Opus 4.6 chain-of-thought reasoning into Qwen3.5-27B via LoRA, racking up 4,000+ downloads in days. No benchmarks yet - but the approach raises familiar questions.

A community developer has distilled Claude Opus 4.6's reasoning patterns into an open-weight 27B model, and the AI community is paying attention. Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled - built by HuggingFace user Jackrong - uses supervised fine-tuning with LoRA on Qwen3.5-27B to replicate Claude's chain-of-thought reasoning behavior. The model collection has racked up an estimated 57,000+ downloads across 13 variants (the GGUF version alone hit 20,500) and 134 likes in its first days on HuggingFace.
Key Specs
| Spec | Value |
|---|---|
| Base Model | Qwen3.5-27B |
| Parameters | 28B |
| Distilled From | Claude Opus 4.6 reasoning traces |
| Method | SFT + LoRA (rank 64) |
| Context | 8,192 tokens |
| License | Apache 2.0 |
| Benchmarks | None published |
How It Works
Training Method
The model uses supervised fine-tuning with LoRA (rank 64) via the Unsloth framework. The key design choice is train_on_responses_only - the loss function only operates on <think> reasoning sequences and final solutions, ignoring instruction tokens. This focuses the model's learning specifically on Claude's reasoning patterns rather than general instruction following.
Training Data
Three datasets power the distillation:
| Dataset | Size | Content |
|---|---|---|
| Opus-4.6-Reasoning-3000x-filtered | 2,330 samples | Claude 4.6 Opus reasoning traces (refusals filtered) |
| claude-4.5-opus-high-reasoning-250x | 250 samples | High-intensity structured reasoning |
| Qwen3.5-reasoning-700x | 700 samples | Curated Qwen reasoning samples |
The total training set is modest - roughly 3,280 samples. The model learns to produce outputs in Claude's <think>...</think> format, showing its reasoning chain before delivering a final answer.
What You Lose
The base Qwen3.5-27B has a 262K native context window (extendable to 1M with YaRN), supports multimodal inputs (text, image, video), and covers 201 languages. The distilled version drops to 8,192 tokens of context, loses multimodal capabilities completely, and the creator describes it as a "preview version" with potential bugs and integration edge cases.
What To Watch
No Benchmarks
The most conspicuous absence: there are no published benchmark scores. No MMLU, no GPQA, no coding evaluations. The model card describes capabilities in qualitative terms - "modular and structured thinking," "extended context support" - but provides no numbers to verify these claims. With 3,950 training samples and a LoRA rank of 64, the distillation is fairly lightweight. Whether that's enough to meaningfully transfer Claude-level reasoning into a 27B model is an open question that only benchmarks can answer.
The Distillation Debate
This model follows a pattern that started with DeepSeek's controversial use of proprietary model outputs for training. Anthropic has publicly objected to distillation of its models, calling it a violation of terms of service. A HuggingFace discussion thread already flagged the TOS issue - one commenter pointed out the creator used pre-existing public datasets rather than personally querying Claude at scale, but that distinction may not hold legally. Anthropic's terms explicitly prohibit using Claude outputs to "train or develop AI models without written permission," and the Apache 2.0 license on the dataset doesn't override those terms.
A Hacker News thread drew the same distinction - "some rando fine-tuning a model on Claude reasoning traces collected by other randos" is a different scale of operation than what Anthropic accused DeepSeek of (24,000 fraudulent accounts, 16 million+ interactions).
Community Reception
The numbers tell the story: 57,000+ downloads across 13 variants, with the GGUF quantization pulling 20,500 alone. The creator - HuggingFace user Jackrong (JIRONG), an individual contributor with 97 models and a focus on Chinese-language CoT data - has built a following in the local inference community. Seven quantized variants are available, fitting on consumer GPUs with 4-bit quantization.
For context, DeepSeek's distilled Qwen variants used 800,000 high-quality reasoning samples and full fine-tuning. Jackrong used ~3,280 samples and LoRA rank 64 - orders of magnitude less data and a much lighter training approach. Whether that's enough to meaningfully transfer Claude-level reasoning is the question that only benchmarks can answer.
See the full model page for specifications and our comparison with the base Qwen3.5-27B for a detailed analysis of what distillation adds and what it costs.
Sources:
