Cohere's Tiny Aya Fits 70 Languages Into 3.35 Billion Parameters and Runs on a Phone
Cohere Labs releases Tiny Aya, a 3.35B open-weight multilingual model that beats Gemma 3 4B in 46 of 61 languages on translation and runs at 32 tokens/sec on an iPhone.

Most small language models treat multilingual support as a checkbox. Train on English, sprinkle in some Chinese and Spanish data, and call it "multilingual." Cohere Labs just shipped something different: a 3.35 billion parameter model that genuinely prioritizes the languages that frontier models treat as afterthoughts.
Tiny Aya, released February 17, is an open-weight family of multilingual models that covers 70+ languages - including Bengali, Hindi, Punjabi, Tamil, Telugu, Urdu, Gujarati, and Marathi alongside dozens of African and Southeast Asian languages that barely register in most training corpora. It runs locally on consumer hardware, clocking 32 tokens per second on an iPhone 17 Pro.
TL;DR
| Spec | Value |
|---|---|
| Parameters | 3.35 billion |
| Languages | 70+ (67 in post-training) |
| Context window | 8K tokens |
| On-device speed | 32 tokens/sec (iPhone 17 Pro) |
| Training infra | Single 64x H100 GPU cluster |
| Variants | Base, Global, Earth, Fire, Water |
| License | Open weights |
| Available on | HuggingFace, Kaggle, Ollama |
Five models, one architecture
Cohere didn't release a single model. Tiny Aya is a family of five:
- TinyAya-Base - the pretrained foundation (3.35B parameters)
- TinyAya-Global - instruction-tuned for balanced performance across all 67 post-training languages
- TinyAya-Earth - optimized for African and West Asian languages
- TinyAya-Fire - optimized for South Asian languages
- TinyAya-Water - optimized for Asia-Pacific and European languages
The regional variants share the same base architecture but receive targeted fine-tuning for specific language communities. The idea is that a researcher working on Yoruba or Tamil shouldn't have to use a model that allocated 90% of its capacity to English.
How it compares
The meaningful comparison set at this parameter range is Google's Gemma 3 4B, Alibaba's Qwen 3 4B, and Mistral's Ministral 3B. Here's where Tiny Aya stands:
Translation: WMT24++
TinyAya-Global outperforms Gemma 3 4B in 46 of 61 languages on the WMT24++ translation benchmark. That's not a narrow lead on a few cherry-picked pairs - it's a decisive win across the majority of evaluated language directions.
The advantage is most pronounced in low-resource languages where Gemma and Qwen models have thinner training data coverage. Tiny Aya demonstrates stable performance even for languages with minimal Common Crawl representation, suggesting the tokenizer and training data pipeline were specifically engineered for these cases.
Mathematical reasoning: GlobalMGSM
On the GlobalMGSM benchmark for African languages, the gap is dramatic:
| Model | African language accuracy |
|---|---|
| TinyAya | 39.2% |
| Gemma 3 4B | 17.6% |
| Qwen 3 4B | 6.25% |
A 3.35B model more than doubling a 4B competitor on math reasoning - in languages those competitors barely handle - is a significant result. It suggests that thoughtful data curation and tokenizer design can matter more than raw parameter count for underserved languages.
Where it trails
Tiny Aya is not the best model at this scale for everything. When averaging across all 66 evaluated languages on general quality, Gemma 3 27B remains the top performer - but that model is 8x larger. At comparable sizes, Gemma 3's advantage comes primarily from its 128K context window (vs. Tiny Aya's 8K) and its 140+ language count on paper, though many of those languages receive minimal practical support.
For English-centric tasks, coding, or long-context workloads, Gemma 3 and Qwen 3 remain stronger choices. Tiny Aya's advantage is specifically in deep multilingual capability per parameter.
The tokenizer is the real innovation
The technical detail that matters most isn't the parameter count - it's the tokenizer. Tiny Aya achieves the most efficient tokenization across the vast majority of evaluated languages, meaning it needs fewer tokens to represent the same sentence in Hindi, Swahili, or Vietnamese than its competitors do.
This has cascading effects. Fewer tokens per sentence means:
- Lower compute cost per inference
- More content fits in the 8K context window
- Faster generation on constrained hardware
- Better quality because the model sees more meaningful units per token budget
Tokenization fragmentation has been one of the biggest quiet problems in multilingual AI. A model might technically "support" Bengali, but if its tokenizer splits every Bengali word into 5 sub-tokens while English words get 1-2, the effective context window for Bengali users is a fraction of what English users get. Tiny Aya addresses this directly.
On-device and offline
The entire post-training pipeline ran on a single 64x H100 GPU cluster - modest by frontier model standards. The resulting model is small enough to run on phones via Ollama, enabling offline translation and local AI in regions where reliable internet access isn't guaranteed.
This matters for the use cases Cohere is explicitly targeting: a health worker in rural India who needs real-time translation between Tamil and Hindi, a researcher in Nigeria analyzing Yoruba-language documents, or a developer building a voice assistant in Urdu. These aren't hypothetical scenarios - they're the specific gaps that frontier models running in U.S. data centers don't address.
The business context
Tiny Aya lands as Cohere positions itself for a reported 2026 IPO, with multilingual and sovereign AI as its core differentiator against OpenAI, Google, and Anthropic. The company is also launching Expedition Tiny Aya - a mentor-supported research challenge designed to catalyze community projects built on the model family.
Open-weighting a capable multilingual model at this size is a strategic play: seed the ecosystem, build community goodwill, and create a pipeline of multilingual AI developers who might eventually need Cohere's commercial products.
Tiny Aya isn't trying to be the best model at everything. It's trying to be the best model for the 5 billion people whose languages have been an afterthought in AI development. At 3.35B parameters running on a phone, it's making a credible argument that multilingual AI doesn't need to be a frontier-scale problem.
Sources:
