Name: ZAYA1-8B
Author: Zyphra

TL;DR

8.4B total parameters, 760M active - MoE efficiency that runs math and code benchmarks competitive with models 10-100x larger
Apache 2.0 license, free serverless endpoint on Zyphra Cloud, weights on Hugging Face
With Markovian RSA test-time compute, reaches 91.9 on AIME'25 and 89.6 on HMMT'25, surpassing Claude 4.5 Sonnet (79.2%) on the latter

Overview

ZAYA1-8B is a small mixture-of-experts language model released by Zyphra on May 6-7, 2026. The headline number is the active parameter count: 760 million. On AIME'26 it scores 89.1, outrunning Mistral-Small-4-119B (86.4) and Qwen3-4B (77.5) using less than 1 billion active parameters per forward pass. That combination - frontier-class reasoning scores from a sub-billion-parameter model - is what Zyphra's team means when they say the model maximizes "intelligence density per parameter."

760M active parameters. 89.1 on AIME'26. Less compute than any comparably scoring model in its weight class.

The model wasn't trained on the usual Nvidia stack. Zyphra built ZAYA1-8B entirely on a cluster of 1,024 AMD Instinct MI300X GPUs connected via AMD Pensando Pollara interconnect on IBM Cloud, as part of a multi-year collaboration announced in late 2025. That's a notable production test of AMD's ability to support pretraining at scale - and the results hold up against models trained on Nvidia H100/H200 infrastructure.

Three architectural changes separate ZAYA1-8B from a standard MoE transformer. First, Compressed Convolutional Attention (CCA) performs sequence mixing in a compressed latent space, delivering a 8x reduction in KV-cache size compared to full multi-head attention. Second, a MLP-based expert router with PID-controller bias balancing improves routing stability over standard linear routers. Third, learned residual scaling controls residual-norm growth through model depth with minimal extra parameters. Together these fit under the "MoE++" label Zyphra uses for their architecture.

Key Specifications

Specification	Details
Provider	Zyphra
Model Family	ZAYA
Architecture	MoE++ with Compressed Convolutional Attention (CCA)
Total Parameters	8.4B
Active Parameters	~760M per token
Context Window	Not disclosed
Input Price	Free (Zyphra Cloud serverless)
Output Price	Free (Zyphra Cloud serverless)
Release Date	May 6, 2026
License	Apache 2.0
Training Hardware	1,024 AMD Instinct MI300X GPUs
Tensor Type	BF16
Input Modalities	Text
Output Modality	Text

Benchmark Performance

ZAYA1-8B was evaluated across math, coding, reasoning, and instruction-following benchmarks. The base scores below reflect standard inference without extended test-time compute.

In-Class Comparison (vs. 4B-class models)

Benchmark	ZAYA1-8B	Qwen3-4B	Qwen3.5-4B	Gemma-4-E4B
AIME'26	89.1	77.5	84.5	50.3
HMMT Feb'26	71.6	60.8	63.6	32.1
IMO-AnswerBench	59.3	50.9	48.7	27.3
LiveCodeBench-v6	65.8	54.2	-	54.2
GPQA-Diamond	71.0	66.5	76.2	57.4
MMLU-Pro	74.2	74.3	79.1	70.2
IFEval	85.6	86.8	89.8	88.5

Against Larger Open Models

Model	Active Params	Total Params	AIME'26	HMMT'26	LCB-v6
ZAYA1-8B	0.7B	8B	89.1	71.6	63.8
Nemotron-3-Nano-30B-A3B	3B	30B	90.1	75.5	64.6
Qwen3-Next-80B	3B	80B	90.2	79.3	67.8

ZAYA1-8B's base scores sit just behind NVIDIA's Nemotron-3-Nano-30B-A3B - a model with 4x the active parameters and nearly 4x the total size. The gap is 1 point on AIME, 3.9 on HMMT, and less than 1 on LCB. That's a ratio of 4x compute for single-digit percentage gains, which is where efficiency arguments become real.

With Markovian RSA (Extended Test-Time Compute)

When Markovian RSA is applied, scores improve substantially. On HMMT'25, the model reaches 89.6 - above Claude 4.5 Sonnet's 79.2 and GPT-5-High's 88.3. On AIME'25, the RSA configuration scores 91.9. Zyphra also reports that on APEX-shortlist under extra-high compute, ZAYA1-8B surpasses DeepSeek V3.2 on a demanding math benchmark.

These numbers need context. The Markovian RSA figures involve substantially more computation per answer than standard inference. A 760M active parameter model can match a frontier model if you're willing to run it many more times and aggregate. The efficiency gain is real - the per-call compute is much cheaper - but total wall-clock cost per hard problem isn't zero.

Key Capabilities

Markovian RSA. The novel test-time compute method combines parallel trace generation with fixed-length Markovian chunking. Instead of extending context indefinitely as a model reasons, only the tail of the previous reasoning chunk is passed forward. This keeps context size bounded regardless of how long the model reasons, which keeps per-step memory costs constant. The practical result is that performance scales with compute budget rather than hitting a wall when context fills up.

KV-Cache efficiency. CCA's 8x KV-cache compression matters most in deployment scenarios where you're batching many requests or running on limited VRAM. A model that fits more sequences into the same memory budget is directly cheaper to serve at scale. For teams looking at self-hosting on the small language model tier, this means ZAYA1-8B can serve more concurrent requests than comparable-quality dense models.

Post-training depth. The training pipeline uses a five-stage approach: supervised fine-tuning on chat, code, math, and reasoning data; a reasoning warmup phase; large-scale RLVE-Gym with a 400-task curriculum and dynamic difficulty; targeted math and code RL; and lightweight RLHF/RLAIF for behavior refinement. The staged RL cascade is a more deliberate pipeline than most open-source models of this size receive, and the reasoning benchmark results reflect it.

Instruction following. IFEval scores (85.6) land slightly below the comparison models (Qwen3.5-4B at 89.8, Gemma-4-E4B at 88.5). For structured output and tool-calling workloads, the slightly weaker instruction adherence is worth testing before committing. The model isn't weak here - it's just not the standout it's on math and coding.

Pricing and Availability

ZAYA1-8B is free to test through a serverless endpoint at cloud.zyphra.com. Model weights are available on Hugging Face under an Apache 2.0 license, meaning unrestricted commercial and derivative use.

No token-based pricing has been announced for Zyphra Cloud today. The free tier appears to be the current offering, likely to drive adoption and gather usage data.

For self-hosting, the BF16 weights come in SafeTensor format. The 8x KV-cache compression from CCA makes the memory footprint more manageable than a comparable dense model at the same quality tier. Recommended inference settings per the model card: temperature 1.0, top-p 0.95 for general use; temperature 0.6, top-p 0.95 for code and agentic tasks.

For context on where this sits in the cost-efficiency range, see our cost efficiency leaderboard and small language model leaderboard.

Strengths

760M active parameters with math scores that beat models using 4x more compute
8x KV-cache compression from CCA makes serving cheaper than comparable quality models
Markovian RSA enables compute-scalable reasoning without blowing up context costs
Apache 2.0 license - fully open for commercial use and fine-tuning
Trained completely on AMD hardware, proving MI300X viability for pretraining at scale
Free serverless access on Zyphra Cloud, weights on Hugging Face

Weaknesses

Context window size not publicly disclosed - limits production planning
No API pricing announced yet - unclear cost at scale once free tier ends
IFEval (85.6) trails comparison models - structured output tasks need testing
GPQA-Diamond (71.0) lags behind Qwen3.5-4B (76.2) on knowledge retrieval
Markovian RSA compute gains require running the model repeatedly - higher latency per hard problem
Text-only - no multimodal input support
No independent third-party evaluation yet; all benchmark numbers from Zyphra's own report

Nemotron-3-Nano-30B-A3B - NVIDIA's competing efficient MoE, which ZAYA1-8B nearly matches with 4x fewer active params
Mistral Small 4 - another small-class model in the same benchmark competitive tier
DeepSeek V3.2 - ZAYA1-8B surpasses it on math under extended compute (APEX-shortlist)
Small Language Model Leaderboard - current rankings for the sub-10B class
Reasoning Benchmarks Leaderboard - AIME, GPQA-Diamond, and HMMT rankings across all model sizes
Cost Efficiency Leaderboard - performance-per-dollar comparisons