Claude Opus Reasoning Distilled Into Open 27B Model

A community developer has distilled Claude Opus 4.6's reasoning patterns into an open-weight 27B model, and the AI community is paying attention. Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled - built by HuggingFace user Jackrong - uses supervised fine-tuning with LoRA on Qwen3.5-27B to replicate Claude's chain-of-thought reasoning behavior. The model collection has racked up an estimated 57,000+ downloads across 13 variants (the GGUF version alone hit 20,500) and 134 likes in its first days on HuggingFace.

Key Specs

Spec	Value
Base Model	Qwen3.5-27B
Parameters	28B
Distilled From	Claude Opus 4.6 reasoning traces
Method	SFT + LoRA (rank 64)
Context	8,192 tokens
License	Apache 2.0
Benchmarks	None published

How It Works

Training Method

The model uses supervised fine-tuning with LoRA (rank 64) via the Unsloth framework. The key design choice is train_on_responses_only - the loss function only operates on <think> reasoning sequences and final solutions, ignoring instruction tokens. This focuses the model's learning specifically on Claude's reasoning patterns rather than general instruction following.

Training Data

Three datasets power the distillation:

Dataset	Size	Content
Opus-4.6-Reasoning-3000x-filtered	2,330 samples	Claude 4.6 Opus reasoning traces (refusals filtered)
claude-4.5-opus-high-reasoning-250x	250 samples	High-intensity structured reasoning
Qwen3.5-reasoning-700x	700 samples	Curated Qwen reasoning samples

The total training set is modest - roughly 3,280 samples. The model learns to produce outputs in Claude's <think>...</think> format, showing its reasoning chain before delivering a final answer.

What You Lose

The base Qwen3.5-27B has a 262K native context window (extendable to 1M with YaRN), supports multimodal inputs (text, image, video), and covers 201 languages. The distilled version drops to 8,192 tokens of context, loses multimodal capabilities completely, and the creator describes it as a "preview version" with potential bugs and integration edge cases.

What To Watch

No Benchmarks

The most conspicuous absence: there are no published benchmark scores. No MMLU, no GPQA, no coding evaluations. The model card describes capabilities in qualitative terms - "modular and structured thinking," "extended context support" - but provides no numbers to verify these claims. With 3,950 training samples and a LoRA rank of 64, the distillation is fairly lightweight. Whether that's enough to meaningfully transfer Claude-level reasoning into a 27B model is an open question that only benchmarks can answer.

The Distillation Debate

This model follows a pattern that started with DeepSeek's controversial use of proprietary model outputs for training. Anthropic has publicly objected to distillation of its models, calling it a violation of terms of service. A HuggingFace discussion thread already flagged the TOS issue - one commenter pointed out the creator used pre-existing public datasets rather than personally querying Claude at scale, but that distinction may not hold legally. Anthropic's terms explicitly prohibit using Claude outputs to "train or develop AI models without written permission," and the Apache 2.0 license on the dataset doesn't override those terms.

A Hacker News thread drew the same distinction - "some rando fine-tuning a model on Claude reasoning traces collected by other randos" is a different scale of operation than what Anthropic accused DeepSeek of (24,000 fraudulent accounts, 16 million+ interactions).

Community Reception

The numbers tell the story: 57,000+ downloads across 13 variants, with the GGUF quantization pulling 20,500 alone. The creator - HuggingFace user Jackrong (JIRONG), an individual contributor with 97 models and a focus on Chinese-language CoT data - has built a following in the local inference community. Seven quantized variants are available, fitting on consumer GPUs with 4-bit quantization.

For context, DeepSeek's distilled Qwen variants used 800,000 high-quality reasoning samples and full fine-tuning. Jackrong used ~3,280 samples and LoRA rank 64 - orders of magnitude less data and a much lighter training approach. Whether that's enough to meaningfully transfer Claude-level reasoning is the question that only benchmarks can answer.

See the full model page for specifications and our comparison with the base Qwen3.5-27B for a detailed analysis of what distillation adds and what it costs.

Sources: