Name: Arcee Trinity
Author: Arcee AI

Arcee AI shipped Trinity-Large-Thinking on April 1, 2026 - the reasoning-capable variant of its 400B sparse Mixture-of-Experts model. The base Trinity-Large-Preview had launched in late January, but the Thinking release is what put Arcee on the short list of labs building frontier-tier open models. It holds the #2 spot on PinchBench, an agent-focused evaluation run by Kilo that tests real-world agentic task performance, trailing only Claude Opus 4.6 among all public models.

TL;DR

#2 on PinchBench (91.9), seconds behind Claude Opus 4.6's 93.3 - at 28x lower cost
398B params, 13B active per token; 256K context (512K extended); Apache 2.0 license
Designed for long-horizon agents - better instruction coherence over multi-turn tool calls than the Preview variant

Arcee is a Miami-based startup of roughly 39 people, founded in 2023 by Jacob Solawetz, Brian Benedict, and Mark McQuade. They trained Trinity entirely in-house, spending around $20 million total on compute, salaries, and data - a figure that looks implausible for a 400B model until you factor in the sparse MoE architecture. With only 4 of 256 experts active per token, effective compute per forward pass is closer to a 13B dense model than a 400B one. The result is inference throughput 2-3x faster than comparably-sized dense models on identical hardware.

Key Specifications

Specification	Details
Provider	Arcee AI
Model Family	Trinity
Parameters	398B total, 13B active per token
Architecture	Sparse MoE, 256 experts, 4 active per token
Context Window	256K tokens (512K extended)
Input Price	$0.22/M tokens
Output Price	$0.85/M tokens
Release Date	2026-01-27 (Preview), 2026-04-01 (Thinking)
License	Apache 2.0
Training Data	17 trillion tokens

The model runs on vLLM, SGLang, llama.cpp, LM Studio, and HuggingFace Transformers. Weights are on Hugging Face under arcee-ai/Trinity-Large-Thinking. OpenRouter also hosts it at $0.22/$0.85 per million tokens with a 262K context window in that deployment.

Benchmark Performance

The two Trinity variants serve different use cases: Preview is the lightly post-trained baseline suitable for general deployment; Thinking adds internal chain-of-thought reasoning before responding and is the right choice for complex agents and tasks.

Trinity-Large-Preview vs Llama 4 Maverick

Benchmark	Trinity-Large-Preview	Llama 4 Maverick
MMLU	87.2	85.5
MMLU-Pro	75.2	80.5
GPQA Diamond	63.3	69.8
AIME 2025	24.4	19.3

The preview model lands roughly at Llama 4 Maverick level on academic evals, trading wins and losses depending on the task. Arcee's own framing was accurate: "roughly in line with Llama-4-Maverick's Instruct model." The MMLU edge is real; the GPQA and MMLU-Pro gaps are also real.

Trinity benchmark comparison across MMLU-Pro, AIME 2025, and GPQA Diamond Trinity-Large-Preview benchmark scores across key academic evaluations. Source: arcee.ai

Trinity-Large-Thinking vs Frontier Models

Benchmark	Trinity-Large-Thinking	Claude Opus 4.6	GLM-5	MiniMax-M2.5	Kimi-K2.5
PinchBench	91.9	93.3	86.4	89.8	84.8
GPQA Diamond	76.3	89.2	81.6	86.2	86.9
MMLU-Pro	83.4	89.1	85.8	80.8	87.1
AIME 2025	96.3	99.8	93.3	80.0	96.3
SWE-bench Verified	63.2	75.6	72.8	75.4	70.8
LiveCodeBench	98.2	N/A	N/A	N/A	N/A
IFBench	52.3	53.1	72.3	75.7	70.2

The PinchBench result is the headline number and it's genuine. On agentic task completion, Trinity-Thinking sits within 1.4 points of Claude Opus 4.6 while costing 28x less per million output tokens. That gap matters for anyone running agent workloads at volume.

The weaknesses are equally real. GPQA Diamond at 76.3 is 13 points behind Claude Opus 4.6, and SWE-bench Verified at 63.2 is a full 12 points back. Instruction-following (IFBench: 52.3) is the biggest gap - both GLM-5 and Kimi-K2.5 beat it by more than 18 points there. If your workload involves precise format adherence or structured output at high volume, the Thinking model will frustrate.

Check the agentic AI benchmarks leaderboard for how Trinity ranks across the full agent benchmark suite.

Key Capabilities

Trinity-Large-Thinking was built for agents, not chat. Arcee trained it with long-horizon planning in mind: stable context coherence across hundreds of tool calls, clean multi-turn interaction without state drift, and structured reasoning that stays consistent through extended agent loops. The Thinking prefix keeps the model's intermediate steps in context so downstream tool calls can reference earlier reasoning rather than starting cold each turn.

Trinity inference throughput comparison Trinity hits 2-3x higher inference throughput than comparably-sized dense models. Source: arcee.ai

The context window is 256K in standard deployments, but Arcee's technical report shows 0.976 accuracy on needle-in-a-haystack tests at 512K tokens - beyond the training distribution. That extended window is relevant for agent tasks where tool outputs build up across many steps. OpenRouter caps it at 262K; self-hosted deployments on vLLM can go further.

LiveCodeBench at 98.2 is the model's strongest single number. Competitive coding performance is consistent with the training emphasis on synthetic code and math data - over 8 trillion of the 17 trillion training tokens were synthetic. The model trained across three phases with the Muon optimizer for hidden layers, a choice Arcee's report notes helps with MoE training stability.

"Getting here took difficult technical work, hard calls, and more than a few moments where the easy thing would have been to lower the ambition." - Arcee AI

From launch through early March 2026, OpenRouter served 3.37 trillion Trinity tokens, making it the most-used open model in the US and #4 globally by traffic. That adoption rate at free-preview pricing isn't directly transferable to paid-tier economics, but it confirms the model runs reliably at scale.

Pricing and Availability

Trinity comes in three variants with different tradeoffs:

Trinity-Large-Preview: Free on OpenRouter (128K context, 8-bit quantization). Good for evaluation and low-stakes workloads.
Trinity-Large-Base: Full pretraining checkpoint at 17T tokens. No post-training, aimed at researchers building their own fine-tunes.
Trinity-Large-Thinking: $0.22/M input, $0.85/M output on OpenRouter. This is the production reasoning model.

Weights for all three are on Hugging Face under Apache 2.0 - you can download and run them commercially with no restrictions. vLLM is the recommended serving framework. For the Thinking variant, Arcee recommends temperature 0.3 and keeping thinking tokens in context for multi-turn conversations.

The cost-per-token story is straightforward. At $0.90/M output (the figure Arcee uses in its own comparisons), Trinity-Thinking costs roughly 96% less than Claude Opus 4.6 at $25/M. That's not a marginal difference - it's an order of magnitude. For agentic workloads where output token counts run into the tens of millions per day, that gap determines whether a product is financially viable.

See the cost efficiency leaderboard for a full comparison of price-per-point across major models.

Strengths and Weaknesses

Strengths

#2 on PinchBench among all public models - competitive with closed frontier models on agentic tasks
28x lower cost per output token than Claude Opus 4.6
Apache 2.0 license with full weight access - no usage restrictions
2-3x faster inference throughput than comparable dense models
256K context with reliable 512K extension
LiveCodeBench at 98.2 is among the highest scores for any open model

Weaknesses

GPQA Diamond at 76.3 is 13 points behind Claude Opus 4.6 - not a substitute for hard science reasoning
SWE-bench Verified at 63.2 trails the frontier by 12+ points - coding agents hitting complex real-world repos will see the gap
Instruction-following (IFBench: 52.3) is a genuine weak spot - GLM-5 and Kimi-K2.5 are substantially better
No official fine-tuning infrastructure yet - downstream customization is community-driven
OpenRouter preview pricing may not reflect long-term production costs

Llama 4 Maverick - Meta's comparable open-weight MoE and Trinity's primary base-model benchmark target
Open-Source LLM Leaderboard - full rankings across open models including Trinity
Agentic AI Benchmarks Leaderboard - PinchBench and agent eval context
Coding Benchmarks Leaderboard - SWE-bench and LiveCodeBench context

FAQ

What is Trinity-Large-Thinking best for?

Long-horizon AI agents requiring multi-turn tool use, planning, and context coherence. Its #2 PinchBench score reflects real agentic capability at 28x lower cost than Claude Opus 4.6, making it practical for high-volume agent deployments.

How does Trinity compare to Llama 4 Maverick?

The Preview variant lands roughly at Maverick level on academic benchmarks. Trinity-Thinking is substantially stronger than Maverick on reasoning tasks, though Llama 4 Maverick has better instruction following.

Can Trinity be self-hosted?

Yes. All three variants (Preview, Base, Thinking) are available on Hugging Face under Apache 2.0. Compatible with vLLM, SGLang, llama.cpp, and LM Studio. Arcee's technical report recommends using Muon-based serving configs for best MoE routing efficiency.

What's the difference between the Preview and Thinking variants?

Preview is a lightly post-trained instruction model suitable for general tasks. Thinking adds internal chain-of-thought reasoning before responding, improving complex reasoning and multi-step planning at the cost of higher output token counts.

Is Trinity free to use commercially?

Yes. The Apache 2.0 license allows commercial use, modification, and distribution with no royalties or access restrictions. This applies to all Trinity variants including the full base checkpoint.

Sources: