Cohere's Tiny Aya Fits 70 Languages Into 3.35 Billion Parameters and Runs on a Phone

Most small language models treat multilingual support as a checkbox. Train on English, sprinkle in some Chinese and Spanish data, and call it "multilingual." Cohere Labs just shipped something different: a 3.35 billion parameter model that truly focuses on the languages that frontier models treat as afterthoughts.

Tiny Aya, released February 17, is an open-weight family of multilingual models that covers 70+ languages - including Bengali, Hindi, Punjabi, Tamil, Telugu, Urdu, Gujarati, and Marathi with dozens of African and Southeast Asian languages that barely register in most training corpora. It runs locally on consumer hardware, clocking 32 tokens per second on an iPhone 17 Pro.

TL;DR

Spec	Value
Parameters	3.35 billion
Languages	70+ (67 in post-training)
Context window	8K tokens
On-device speed	32 tokens/sec (iPhone 17 Pro)
Training infra	Single 64x H100 GPU cluster
Variants	Base, Global, Earth, Fire, Water
License	Open weights
Available on	HuggingFace, Kaggle, Ollama

Five models, one architecture

Cohere didn't release a single model. Tiny Aya is a family of five:

TinyAya-Base - the pretrained foundation (3.35B parameters)
TinyAya-Global - instruction-tuned for balanced performance across all 67 post-training languages
TinyAya-Earth - optimized for African and West Asian languages
TinyAya-Fire - optimized for South Asian languages
TinyAya-Water - optimized for Asia-Pacific and European languages

The regional variants share the same base architecture but receive targeted fine-tuning for specific language communities. The idea is that a researcher working on Yoruba or Tamil shouldn't have to use a model that allocated 90% of its capacity to English.

How it compares

The meaningful comparison set at this parameter range is Google's Gemma 3 4B, Alibaba's Qwen 3 4B, and Mistral's Ministral 3B. Here's where Tiny Aya stands:

Translation: WMT24++

TinyAya-Global beats Gemma 3 4B in 46 of 61 languages on the WMT24++ translation benchmark. That's not a narrow lead on a few cherry-picked pairs - it's a decisive win across the majority of assessed language directions.

The advantage is most pronounced in low-resource languages where Gemma and Qwen models have thinner training data coverage. Tiny Aya demonstrates stable performance even for languages with minimal Common Crawl representation, suggesting the tokenizer and training data pipeline were specifically engineered for these cases.

Mathematical reasoning: GlobalMGSM

On the GlobalMGSM benchmark for African languages, the gap is dramatic:

Model	African language accuracy
TinyAya	39.2%
Gemma 3 4B	17.6%
Qwen 3 4B	6.25%

A 3.35B model more than doubling a 4B competitor on math reasoning - in languages those competitors barely handle - is a significant result. It suggests that thoughtful data curation and tokenizer design can matter more than raw parameter count for underserved languages.

Where it trails

Tiny Aya is not the best model at this scale for everything. When averaging across all 66 assessed languages on general quality, Gemma 3 27B remains the top performer - but that model is 8x larger. At comparable sizes, Gemma 3's advantage comes primarily from its 128K context window (vs. Tiny Aya's 8K) and its 140+ language count on paper, though many of those languages receive minimal practical support.

For English-centric tasks, coding, or long-context workloads, Gemma 3 and Qwen 3 remain stronger choices. Tiny Aya's advantage is specifically in deep multilingual capability per parameter.

The tokenizer is the real innovation

The technical detail that matters most isn't the parameter count - it's the tokenizer. Tiny Aya reaches the most efficient tokenization across most assessed languages, meaning it needs fewer tokens to represent the same sentence in Hindi, Swahili, or Vietnamese than its competitors do.

This has cascading effects. Fewer tokens per sentence means:

Lower compute cost per inference
More content fits in the 8K context window
Faster generation on constrained hardware
Better quality because the model sees more meaningful units per token budget

Tokenization fragmentation has been one of the biggest quiet problems in multilingual AI. A model might technically "support" Bengali, but if its tokenizer splits every Bengali word into 5 sub-tokens while English words get 1-2, the effective context window for Bengali users is a fraction of what English users get. Tiny Aya addresses this directly.

On-device and offline

The entire post-training pipeline ran on a single 64x H100 GPU cluster - modest by frontier model standards. The resulting model is small enough to run on phones via Ollama, enabling offline translation and local AI in regions where reliable internet access isn't guaranteed.

This matters for the use cases Cohere is explicitly targeting: a health worker in rural India who needs real-time translation between Tamil and Hindi, a researcher in Nigeria analyzing Yoruba-language documents, or a developer building a voice assistant in Urdu. These aren't hypothetical scenarios - they're the specific gaps that frontier models running in U.S. data centers don't address.

The business context

Tiny Aya lands as Cohere positions itself for a reported 2026 IPO, with multilingual and sovereign AI as its core differentiator against OpenAI, Google, and Anthropic. The company is also launching Expedition Tiny Aya - a mentor-supported research challenge designed to catalyze community projects built on the model family.

Open-weighting a capable multilingual model at this size is a strategic play: seed the ecosystem, build community goodwill, and create a pipeline of multilingual AI developers who might eventually need Cohere's commercial products.

Tiny Aya isn't trying to be the best model at everything. It's trying to be the best model for the 5 billion people whose languages have been an afterthought in AI development. At 3.35B parameters running on a phone, it's making a credible argument that multilingual AI doesn't need to be a frontier-scale problem.

Sources: