StepFun's Step 3.5 Flash Uses Just 11B of Its 196B Parameters and Still Rivals GPT-5.2
Shanghai AI lab StepFun open-sources Step 3.5 Flash, a 196B sparse MoE model that activates only 11B parameters per token while matching frontier models on reasoning, coding, and agentic benchmarks.

A relatively unknown Shanghai-based lab has just dropped an open-source model that should make every frontier AI lab uncomfortable.
StepFun's Step 3.5 Flash is a 196-billion-parameter Mixture of Experts (MoE) model that activates only 11 billion parameters per token. It ships under the Apache 2.0 license, runs on a Mac Studio, and posts benchmark scores within striking distance of GPT-5.2, Claude Opus 4.5, and Gemini 3.0 Pro. The model has been trending on Hacker News and is already available on GitHub, HuggingFace, and OpenRouter.
This is not incremental progress. It is the clearest signal yet that the open-source MoE approach can close the gap with proprietary frontier systems while dramatically cutting inference costs.
The numbers that matter
Step 3.5 Flash's headline benchmark score is an overall intelligence average of 81.0 across eight major evaluations. For context, GPT-5.2 (xhigh compute) scores 82.2, Claude Opus 4.5 sits at 80.6, and Gemini 3.0 Pro lands at 80.7.
The model's reasoning results are particularly striking. On AIME 2025 it scores 97.3%, climbing to 99.9% with an enhanced parallel thinking mode. HMMT 2025 averages 96.2%. These are competition-level math scores that would have been unthinkable for an open-source model just a year ago.
On coding benchmarks, Step 3.5 Flash hits 74.4% on SWE-bench Verified - the standard test for real-world software engineering - and 51.0% on Terminal-Bench 2.0. LiveCodeBench-V6 comes in at 86.4%.
The agentic scores are where things get really interesting for anyone building AI-powered workflows. It posts 88.2 on the agent reliability benchmark and 69.0 on BrowseComp with a context manager, suggesting strong web navigation and tool-use capabilities. StepFun claims it integrates with over 80 Model Context Protocol (MCP) tools out of the box.
How it works: 288 experts, 8 chosen
The architecture is a sparse MoE transformer with 45 layers. Each layer contains 288 routed experts plus one shared expert, with a Top-8 routing strategy selecting eight specialists per token. The result is a model that carries 196B parameters of "knowledge" but only runs 11B worth of computation for each token generated.
Three technical choices stand out.
First, Multi-Token Prediction (MTP-3) enables the model to predict three tokens per forward pass, which combined with speculative decoding pushes throughput to 100-300 tokens per second in typical use. StepFun claims peaks of 350 tok/s for single-stream coding tasks. That is fast.
Second, a hybrid attention scheme mixes sliding-window attention with full attention in a 3:1 ratio. This keeps the 256K context window usable without the memory costs of full attention at every layer.
Third, the early transformer layers use dense computation rather than sparse routing, which StepFun calls "intelligence density" - front-loading capacity where it matters most for understanding the input.
Runs on your desk (sort of)
One of Step 3.5 Flash's most compelling claims is local deployment on consumer hardware. StepFun provides INT4 quantized GGUF weights (111.5 GB) that run via llama.cpp on machines with 128GB+ of unified or GPU memory. Validated platforms include Apple's Mac Studio with M4 Max, NVIDIA's DGX Spark, and AMD's AI Max+ 395 workstations.
For anyone interested in running open-source models locally, the barrier here is still high - these are not cheap machines. But getting frontier-class reasoning on hardware you physically own, with no API calls and no data leaving your network, is a meaningful shift for enterprises and researchers with privacy requirements.
On a DGX Spark with INT4 weights, StepFun reports 20 tok/s sustained generation with the full 256K context window. Not blazing, but usable.
Who is StepFun?
StepFun (Shanghai Jieyue Xingchen Intelligent Technology) was founded in April 2023 by Jiang Daxin, who spent 16 years as chief scientist at Microsoft Research Asia, where he led work on Bing, Cortana, and Azure cognitive services. The other co-founders also came from Microsoft.
The company has raised aggressively. Its B+ round in January 2026 pulled in over $718 million - a record for China's large-model sector over the past year. Backers include Tencent, state-backed Fortera Capital, China Life Private Equity, and Qiming Venture Partners. StepFun is considered one of China's "AI Tiger" startups alongside better-known names like DeepSeek and Moonshot.
Despite the funding, StepFun has stayed focused on multimodality from day one. It is the only company in the "six tigers" group to have maintained that focus consistently. Step 3.5 Flash is its most capable open-source release so far, and the company says training for Step 4 has already begun.
The competitive picture
Step 3.5 Flash drops into an increasingly crowded open-source landscape. Alibaba released Qwen 3.5 last week with 397B parameters and its own MoE architecture (17B active). DeepSeek continues to iterate. Meta's Llama 4 Maverick remains strong. The open-source LLM leaderboard is getting reshuffled almost weekly.
What sets Step 3.5 Flash apart is the efficiency story. With only 11B active parameters, it achieves results that GPT-5.2 gets with far more compute. StepFun claims a 1.0x baseline decoding cost where competing models range from 1.2x to 18.9x for equivalent tasks. If those numbers hold up in independent testing, this model could become the default choice for cost-sensitive agent deployments.
The model also ships with native integration into the OpenClaw agentic workflow platform, which is gaining traction as an open-source alternative for multi-step AI automation.
What to watch
Step 3.5 Flash is not without caveats. StepFun acknowledges that the model can exhibit "repetitive reasoning, mixed-language outputs, or inconsistencies in time and identity awareness" during long multi-turn conversations. It is optimized primarily for coding and professional tasks, not general chat. And the benchmarks are self-reported - independent evaluation will be critical.
The 128GB memory floor for local inference also limits who can actually run this thing at home. Most developers will still access it through cloud APIs or OpenRouter's free tier.
But none of that changes the core takeaway. A startup from Shanghai that most people in the West have never heard of just open-sourced a model that competes with GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro on reasoning, coding, and agentic tasks - while using a fraction of the compute at inference. The open-source MoE playbook is working, and it is working fast.
Training for Step 4 is underway. The frontier labs should be paying attention.
Sources:
- Step 3.5 Flash: Fast Enough to Think. Reliable Enough to Act. - StepFun official blog
- StepFun Releases Step 3.5 Flash, an Open-Source Foundation Model Built for AI Agents - Pandaily
- Step 3.5 Flash - GitHub - StepFun GitHub repository
- StepFun launched Step 3.5 Flash open-source model - TestingCatalog
- Shanghai AI unicorn StepFun raises over $718 million in B+ round - TechNode
- stepfun-ai/Step-3.5-Flash - Hugging Face
- step-3.5-flash Model by Stepfun-ai - NVIDIA NIM