ZAYA1-8B

Zyphra's ZAYA1-8B is an 8.4B-parameter MoE reasoning model with only 760M active parameters that matches DeepSeek-R1-0528 on math and coding benchmarks while running at a fraction of the compute cost.

ZAYA1-8B

TL;DR

  • 8.4B total parameters, 760M active - MoE efficiency that runs math and code benchmarks competitive with models 10-100x larger
  • Apache 2.0 license, free serverless endpoint on Zyphra Cloud, weights on Hugging Face
  • With Markovian RSA test-time compute, reaches 91.9 on AIME'25 and 89.6 on HMMT'25, surpassing Claude 4.5 Sonnet (79.2%) on the latter

Overview

ZAYA1-8B is a small mixture-of-experts language model released by Zyphra on May 6-7, 2026. The headline number is the active parameter count: 760 million. On AIME'26 it scores 89.1, outrunning Mistral-Small-4-119B (86.4) and Qwen3-4B (77.5) using less than 1 billion active parameters per forward pass. That combination - frontier-class reasoning scores from a sub-billion-parameter model - is what Zyphra's team means when they say the model maximizes "intelligence density per parameter."

760M active parameters. 89.1 on AIME'26. Less compute than any comparably scoring model in its weight class.

The model wasn't trained on the usual Nvidia stack. Zyphra built ZAYA1-8B entirely on a cluster of 1,024 AMD Instinct MI300X GPUs connected via AMD Pensando Pollara interconnect on IBM Cloud, as part of a multi-year collaboration announced in late 2025. That's a notable production test of AMD's ability to support pretraining at scale - and the results hold up against models trained on Nvidia H100/H200 infrastructure.

Three architectural changes separate ZAYA1-8B from a standard MoE transformer. First, Compressed Convolutional Attention (CCA) performs sequence mixing in a compressed latent space, delivering a 8x reduction in KV-cache size compared to full multi-head attention. Second, a MLP-based expert router with PID-controller bias balancing improves routing stability over standard linear routers. Third, learned residual scaling controls residual-norm growth through model depth with minimal extra parameters. Together these fit under the "MoE++" label Zyphra uses for their architecture.

Key Specifications

SpecificationDetails
ProviderZyphra
Model FamilyZAYA
ArchitectureMoE++ with Compressed Convolutional Attention (CCA)
Total Parameters8.4B
Active Parameters~760M per token
Context WindowNot disclosed
Input PriceFree (Zyphra Cloud serverless)
Output PriceFree (Zyphra Cloud serverless)
Release DateMay 6, 2026
LicenseApache 2.0
Training Hardware1,024 AMD Instinct MI300X GPUs
Tensor TypeBF16
Input ModalitiesText
Output ModalityText

Benchmark Performance

ZAYA1-8B was evaluated across math, coding, reasoning, and instruction-following benchmarks. The base scores below reflect standard inference without extended test-time compute.

In-Class Comparison (vs. 4B-class models)

BenchmarkZAYA1-8BQwen3-4BQwen3.5-4BGemma-4-E4B
AIME'2689.177.584.550.3
HMMT Feb'2671.660.863.632.1
IMO-AnswerBench59.350.948.727.3
LiveCodeBench-v665.854.2-54.2
GPQA-Diamond71.066.576.257.4
MMLU-Pro74.274.379.170.2
IFEval85.686.889.888.5

Against Larger Open Models

ModelActive ParamsTotal ParamsAIME'26HMMT'26LCB-v6
ZAYA1-8B0.7B8B89.171.663.8
Nemotron-3-Nano-30B-A3B3B30B90.175.564.6
Qwen3-Next-80B3B80B90.279.367.8

ZAYA1-8B's base scores sit just behind NVIDIA's Nemotron-3-Nano-30B-A3B - a model with 4x the active parameters and nearly 4x the total size. The gap is 1 point on AIME, 3.9 on HMMT, and less than 1 on LCB. That's a ratio of 4x compute for single-digit percentage gains, which is where efficiency arguments become real.

With Markovian RSA (Extended Test-Time Compute)

When Markovian RSA is applied, scores improve substantially. On HMMT'25, the model reaches 89.6 - above Claude 4.5 Sonnet's 79.2 and GPT-5-High's 88.3. On AIME'25, the RSA configuration scores 91.9. Zyphra also reports that on APEX-shortlist under extra-high compute, ZAYA1-8B surpasses DeepSeek V3.2 on a demanding math benchmark.

These numbers need context. The Markovian RSA figures involve substantially more computation per answer than standard inference. A 760M active parameter model can match a frontier model if you're willing to run it many more times and aggregate. The efficiency gain is real - the per-call compute is much cheaper - but total wall-clock cost per hard problem isn't zero.

Key Capabilities

Markovian RSA. The novel test-time compute method combines parallel trace generation with fixed-length Markovian chunking. Instead of extending context indefinitely as a model reasons, only the tail of the previous reasoning chunk is passed forward. This keeps context size bounded regardless of how long the model reasons, which keeps per-step memory costs constant. The practical result is that performance scales with compute budget rather than hitting a wall when context fills up.

KV-Cache efficiency. CCA's 8x KV-cache compression matters most in deployment scenarios where you're batching many requests or running on limited VRAM. A model that fits more sequences into the same memory budget is directly cheaper to serve at scale. For teams looking at self-hosting on the small language model tier, this means ZAYA1-8B can serve more concurrent requests than comparable-quality dense models.

Post-training depth. The training pipeline uses a five-stage approach: supervised fine-tuning on chat, code, math, and reasoning data; a reasoning warmup phase; large-scale RLVE-Gym with a 400-task curriculum and dynamic difficulty; targeted math and code RL; and lightweight RLHF/RLAIF for behavior refinement. The staged RL cascade is a more deliberate pipeline than most open-source models of this size receive, and the reasoning benchmark results reflect it.

Instruction following. IFEval scores (85.6) land slightly below the comparison models (Qwen3.5-4B at 89.8, Gemma-4-E4B at 88.5). For structured output and tool-calling workloads, the slightly weaker instruction adherence is worth testing before committing. The model isn't weak here - it's just not the standout it's on math and coding.

Pricing and Availability

ZAYA1-8B is free to test through a serverless endpoint at cloud.zyphra.com. Model weights are available on Hugging Face under an Apache 2.0 license, meaning unrestricted commercial and derivative use.

No token-based pricing has been announced for Zyphra Cloud today. The free tier appears to be the current offering, likely to drive adoption and gather usage data.

For self-hosting, the BF16 weights come in SafeTensor format. The 8x KV-cache compression from CCA makes the memory footprint more manageable than a comparable dense model at the same quality tier. Recommended inference settings per the model card: temperature 1.0, top-p 0.95 for general use; temperature 0.6, top-p 0.95 for code and agentic tasks.

For context on where this sits in the cost-efficiency range, see our cost efficiency leaderboard and small language model leaderboard.

Strengths

  • 760M active parameters with math scores that beat models using 4x more compute
  • 8x KV-cache compression from CCA makes serving cheaper than comparable quality models
  • Markovian RSA enables compute-scalable reasoning without blowing up context costs
  • Apache 2.0 license - fully open for commercial use and fine-tuning
  • Trained completely on AMD hardware, proving MI300X viability for pretraining at scale
  • Free serverless access on Zyphra Cloud, weights on Hugging Face

Weaknesses

  • Context window size not publicly disclosed - limits production planning
  • No API pricing announced yet - unclear cost at scale once free tier ends
  • IFEval (85.6) trails comparison models - structured output tasks need testing
  • GPQA-Diamond (71.0) lags behind Qwen3.5-4B (76.2) on knowledge retrieval
  • Markovian RSA compute gains require running the model repeatedly - higher latency per hard problem
  • Text-only - no multimodal input support
  • No independent third-party evaluation yet; all benchmark numbers from Zyphra's own report

Sources

✓ Last verified May 8, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.