Name: North Mini Code
Author: Cohere

North Mini Code is Cohere's first model built for developers rather than enterprise API customers. Released June 9, 2026, it's a 30-billion-parameter sparse mixture-of-experts architecture with just 3B parameters active per token - designed specifically for agentic software engineering tasks. It fits on a single H100 at FP8 precision and ships under an Apache 2.0 license with free API access.

TL;DR

Coding-specialist MoE that beats 120B+ dense models on the AA Coding Index (33.4) while activating only 3B parameters per inference step
256K context, 64K max output, free via Cohere API and OpenRouter, Apache 2.0 weights on HuggingFace
Trades evenly with Devstral Small 2 on SWE-Bench (67.6% vs ~68% pass@1) at 2.8x higher throughput

The model is the first entry in Cohere's new "North" family - a line aimed at developers who need to self-host or run coding agents on their own infrastructure. That's a different audience than Cohere Command A+, which targets enterprise API customers with retrieval and citation workflows. North Mini Code is built for terminal agents, code reviews, and multi-step software engineering workflows.

The efficiency story is the main talking point. At 3B active parameters, it matches or beats models activating 10-20x as many parameters on coding benchmarks. That ratio matters if you're running many agent instances in parallel or optimizing for cost per completed task.

Key Specifications

Specification	Details
Provider	Cohere
Model Family	North
Total Parameters	30B
Active Parameters	3B (per inference step)
Expert Count	128 experts, 8 activated per token
Context Window	256K tokens
Max Output	64K tokens
Input Price	Free
Output Price	Free
Release Date	June 9, 2026
License	Apache 2.0
Quantizations	BF16, FP8, W4A16
Min Hardware	1x H100 80GB (FP8) or 2x A100 40GB (BF16)

The architecture is a decoder-only Transformer with a sigmoid-routed MoE feed-forward. Attention layers use an interleaved 3:1 ratio of sliding-window local attention (with RoPE) to full global attention - a design that keeps long-context costs reasonable without sacrificing quality on longer sequences.

Code on a developer monitor with syntax highlighting The model targets terminal-based agentic tasks and multi-file software engineering workflows. Source: unsplash.com

Benchmark Performance

Cohere tested North Mini Code against open-weight models of similar and larger size. The clearest performance story is on the Artificial Analysis Coding Index, where the model reaches 33.4 - above Qwen3.5 35B-A3B, Gemma 4 26B-A4B, Devstral Small 2 (24B dense), and larger models like Nemotron 3 Super (120B-A12B). Qwen 3.6 35B-A3B edges it at 35.2.

Benchmark	North Mini Code	Devstral Small 2 (24B)	Devstral 2 (123B)	Qwen 3.6 35B-A3B
AA Coding Index	33.4	-	-	35.2
SWE-Bench Verified (pass@1)	67.6%	~68%	72.2%	-
SWE-Bench Verified (pass@10)	80.2%	-	-	-
Mini-SWE-Agent (pass@1)	61.0%	-	-	-
AA Intelligence Index	21	-	-	-
Output Speed (Cohere API)	~210 t/s	~75 t/s	-	-

SWE-Bench Verified is where the real comparison sits. At 67.6% pass@1, North Mini Code is functionally level with Devstral Small 2 (~68%), but it hits that while running 2.8x faster in throughput on the Cohere API. That gap compounds if you're running dozens of parallel agent instances.

The weaker numbers appear on non-coding agentic benchmarks: GDPval-AA at 14%, τ²-Bench Telecom at 37%, and an overall Agentic Index of 21.7. The model was clearly optimized for software engineering tasks, and it doesn't generalize well to other agentic domains. If you need a general-purpose agent, this isn't it. Check our SWE-bench coding agent leaderboard for a broader view of where this sits against proprietary models.

Cohere also flags that the model is verbose - it uses more output tokens to complete evaluations than comparable models. In practice that means higher context consumption per task, which is a relevant tradeoff when running multi-step agents.

Circuit board close-up illustrating the multi-expert routing in MoE models North Mini Code uses 128 experts with 8 activated per token - only a fraction of parameters fire on any given inference. Source: unsplash.com

Key Capabilities

Agentic Software Engineering

North Mini Code was trained specifically on agentic workflows. The two-phase training pipeline used 70K+ verifiable tasks across roughly 5K unique repositories, with Stage 1 data weighted toward code (70%), agentic tool-use (43%), and competitive/scientific programming (27%). Stage 2 added 4.5 billion tokens of agentic and reasoning-driven samples.

That training mix shows up in the model's strongest use cases: sub-agent orchestration, multi-file code reviews, architecture mapping, and terminal-based task execution. It handles structured tool-call outputs via JSON schema and supports interleaved reasoning with tool use.

Hardware Flexibility

The three quantization formats are a meaningful practical feature. FP8 on a single H100 is the target setup for most cloud inference providers. The W4A16 format (4-bit weights, 16-bit activations) cuts memory further for edge or constrained deployments, with Cohere reporting negligible quality loss over BF16. Running the full BF16 weights requires two A100 40GB cards.

For anyone building on top of North Mini Code locally, vLLM has native support for the 128-expert routing, so there's no need for custom inference code.

Tool-Use and Multi-Turn Reasoning

The model supports structured tool-call outputs and multi-step reasoning chains. In the training data, agentic tool-use accounted for a large share of Stage 1, which means the model has been explicitly trained to reason across tool invocations rather than treating each call as isolated. That makes it better suited to orchestration roles where an agent needs to plan across several steps before calling any individual tool.

Pricing and Availability

North Mini Code is free. Cohere has published no per-token pricing - access via the Cohere API is free at launch, and the model is also available on OpenRouter under the slug cohere/north-mini-code:free. That won't last forever, but it makes the model easy to evaluate without budget risk.

Weights are on HuggingFace at CohereLabs/North-Mini-Code-1.0 in three formats:

BF16: Full precision, 2x A100 40GB minimum
FP8: Halved memory, 1x H100 80GB, recommended for most deployments
W4A16: Further compressed, lowest hardware requirement

The Apache 2.0 license permits commercial use without restrictions. For comparison, Devstral 2's 123B version uses a modified MIT license; North Mini Code's clean Apache 2.0 is cleaner for enterprise legal reviews.

"North Mini Code was built to run where you need it - on your infrastructure, under your control." - Cohere, launch post

See our open-source LLM leaderboard and coding benchmarks leaderboard for updated rankings across the open-weight coding model space.

Strengths and Weaknesses

Strengths

Beats models 4-5x its active parameter size on the AA Coding Index
2.8x faster throughput than Devstral Small 2 under identical hardware
Clean Apache 2.0 license, free API access at launch
Runs on a single H100 at FP8 - accessible hardware for self-hosting
Trained on 70K+ verifiable coding tasks for genuine agentic capability

Weaknesses

Non-coding agentic benchmarks are weak (Agentic Index: 21.7, GDPval-AA: 14%)
Text-only - no vision input
Verbose: uses more output tokens per task than comparable models
Qwen 3.6 35B-A3B still edges it on the AA Coding Index (35.2 vs 33.4)
No disclosed pricing post-launch; free tier may not persist

Devstral 2 model profile - Mistral's competing agentic coding model
Cohere Command A+ model profile - Cohere's general-purpose MoE flagship
SWE-bench Coding Agent Leaderboard
Coding Benchmarks Leaderboard
Open-Source LLM Leaderboard