Best Open-Source TTS Models for Self-Hosting in 2026

The gap between open-source TTS and commercial APIs closed fast in 2026. Resemble AI's Chatterbox beat ElevenLabs in blind preference tests. Mistral's Voxtral runs at 90ms latency on a single 16GB GPU. Nari Labs shipped a 1.6B-parameter dialogue model built by two undergraduates with zero external funding. The open-source options are no longer compromise choices.

TL;DR

Kokoro-82M (Apache 2.0) is the best starting point - 82M parameters, runs on CPU, 210x real-time on GPU, no voice cloning
Chatterbox-Turbo (MIT) beat ElevenLabs in 63.75% of blind preference tests and supports zero-shot cloning with a 5-20 second clip
Voxtral (Mistral, CC license) hits 90ms time-to-first-audio at 4B parameters and needs 16GB VRAM
Dia (Apache 2.0) is the best open-source option for multi-speaker dialogue with nonverbal cues like laughter
Fish Speech S2 uses a research license - check commercial terms before production use

Every model below is free to download and self-host. The real costs are GPU compute, engineering time to stand up an inference server, and the licensing review your legal team will want before production use. Free to run isn't the same as free of constraints.

For comparison against paid APIs including ElevenLabs, PlayHT, and Cartesia, see the best AI voice generators roundup and the best AI voice cloning tools comparison.

Kokoro-82M - Best for Low-Resource Deployment

Kokoro is the model that changed how practitioners think about size vs. quality tradeoffs in TTS. At 82M parameters, it runs on CPU with 3-5x real-time speed, reaches 50x real-time on a GTX 1660 Super, and hits 210x real-time on a modern data center GPU. VRAM requirements are under 2.5GB at the base configuration.

The output quality surprises people who equate parameter count with fidelity. Kokoro outperforms Coqui XTTS v2 and MetaVoice in blind listening tests despite being a fraction of the size. The architecture is decoder-only, which avoids diffusion sampling delays and produces low, consistent latency.

The main limitation: Kokoro doesn't clone voices. You pick from 54 preset voices across English, French, Korean, Japanese, Mandarin, Spanish, Hindi, and Brazilian Portuguese. If your application needs arbitrary voice cloning from a reference clip, Kokoro isn't the right model.

License: Apache 2.0 - fully commercial. Install via pip (pip install kokoro-onnx) or run from the Hugging Face model page. No registration or API key required.

Hardware minimum: 2.5GB VRAM for GPU inference; CPU inference is viable on any modern system.

Abstract waveform visualization representing audio signal processing Open-source TTS models in 2026 span many parameter counts, from 82M models that run on CPU to 4B models that need a single high-end GPU. Source: unsplash.com

Resemble AI released Chatterbox in early 2026 under the MIT license, making it one of the few zero-shot voice cloning models you can legally use in commercial products without negotiating a separate agreement. The family now includes three variants: the base Chatterbox, Chatterbox-Turbo at 350M parameters, and Chatterbox-Multilingual covering 23+ languages.

The quality numbers come from Resemble's own blind evaluation: 63.75% of testers preferred Chatterbox over ElevenLabs Flash v2.5 in audio quality assessments. Independent evaluations on findskill.ai put that figure slightly higher at 65.3%. Resemble AI funded the study, so treat the specific percentages as directional rather than final - but the subjective quality gap between Chatterbox and leading commercial APIs is real.

Chatterbox-Turbo ships with PerTh watermarking built into every generation. Every output is tagged at the bit level so deepfake attribution remains possible even after post-processing.

Voice cloning requires a 5-20 second reference clip. The zero-shot approach needs no fine-tuning. Emotion control is a distinguishing feature: you can set intensity on a continuous scale rather than picking from categorical presets. Latency is around 200ms, which matches or slightly beats commercial APIs.

Install:

pip install chatterbox-tts

License: MIT - commercial use permitted. Built-in PerTh watermarking is mandatory on every generation (it's embedded, not optional to remove).

Hardware: Single consumer GPU. Runs in air-gapped environments without outbound network calls.

Voxtral - Best Quality at Production Latency

Mistral AI shipped Voxtral TTS on March 26, 2026 - a 4B-parameter model that reaches 70-90ms time-to-first-audio, clones voices from 3-5 seconds of reference audio, and supports 9 languages. The blind evaluation put Voxtral ahead of ElevenLabs Flash v2.5 in 68.4% of test cases, which makes it one of the highest-quality open-weight models available.

The hardware requirement is the thing to plan around: Voxtral needs a single GPU with 16GB VRAM to run comfortably. A consumer RTX 4080 or A4000 handles it. A 3090 with 24GB has headroom. Most laptop GPUs don't. If your deployment target is edge hardware or a CPU-only server, look at Kokoro or Piper instead.

Voxtral is available via Mistral's API (0.016 per 1,000 characters) or as downloadable weights for self-hosting. The API option makes sense for low-volume applications where standing up an inference server isn't worth the operational overhead.

License: Creative Commons - review specific terms for your commercial use case before deploying in production.

Hardware minimum: 16GB VRAM GPU for the full 4B model. Runs on a single H100, A100, or consumer RTX 4090/4080.

Dia - Best for Multi-Speaker Dialogue

Dia is a 1.6B-parameter model from Korean startup Nari Labs, built by two undergraduate students using Google's TPU Research Cloud access. It ships under Apache 2.0, which means commercial use is permitted without negotiating additional terms.

The capability that makes Dia different from every other model on this list: multi-speaker dialogue generation in a single pass. You write a transcript with speaker tags ([S1], [S2]) and Dia creates the full conversation, including nonverbal cues. The supported nonverbal tags include laughter, coughing, sighing, and screaming. No other open-source model handles those naturally at this quality level.

The application fit is narrow but the fit is excellent where it applies. Podcast-style audio, interactive game dialogue, audio drama production - workflows that require two speakers reacting to each other in real time. Dia isn't a general narration model; the multi-speaker architecture is its reason to exist.

Hardware: ~10GB VRAM; runs on a NVIDIA A4000 at roughly 40 tokens/second. Streaming is supported from first tokens.

Languages: English only.

Person wearing headphones listening to audio in a quiet room Dia's multi-speaker output produces full dialogue - including nonverbal cues like laughter - in a single inference pass, which no other open-source model matches at this quality level. Source: unsplash.com

Fish Speech S2 - Highest Benchmark Scores, License Caveats

Fish Speech S2 from Fish Audio is technically the strongest model in this comparison on benchmark metrics. The Dual-Autoregressive architecture pairs a 4B-parameter "slow" model for semantic quality with a 400M-parameter "fast" model for acoustic detail. The result: 81.88% win rate on EmergentTTS-Eval, WER of 0.54% on Chinese and 0.99% on English, and sub-150ms latency on a H200.

Training ran on over 10 million hours of audio across approximately 80 languages, including cross-lingual voice cloning - clone an English speaker and read Japanese copy. Over 15,000 natural language tags cover emotion, tone, volume, and rhythm. The voice cloning pipeline needs 10-30 seconds of reference audio.

The catch is licensing. Fish Speech ships under the Fish Audio Research License, not MIT or Apache 2.0. The specific terms around commercial use are in the LICENSE file in the repository. Your legal team needs to review before any production deployment. If you need a clean commercial license without that review step, Chatterbox or Kokoro are better starting points.

Hardware: H200-class GPU for best performance. Consumer GPUs with 12-16GB VRAM can run it at reduced throughput.

CosyVoice 3 - Best Multilingual Open-Source Option

CosyVoice is Alibaba's open-source TTS family. CosyVoice 3.0, released December 15, 2025, covers 9 languages with 18+ dialects and adds pronunciation inpainting - the ability to re-synthesize specific words in an existing audio clip without regenerating the full sentence. The 0.5B-parameter variant is the most deployable: roughly 8GB VRAM, streaming at ~150ms latency.

CosyVoice 2.0 reduced pronunciation errors by 30-50% compared to version 1.0. CosyVoice 3 extends that with dialect coverage (regional variants across Chinese, Arabic, and Spanish) and finer prosody control. Among the r/LocalLLaMA community, CosyVoice 2 is a consistent top-3 recommendation for multilingual self-hosted TTS.

License: Open source (Apache 2.0 for the main codebase - verify specific model weights terms before commercial use).

Hardware: ~8GB VRAM for the 0.5B variant.

Piper - Best for Edge and CPU-Only Deployment

Piper is the right model if your target is a Raspberry Pi, a headless server without GPU, or an offline embedded device. It runs on a single CPU core with 1-2GB of RAM, supports 30+ languages, and processes text at real-time speeds on low-power hardware. The Rhasspy voice assistant project built Piper specifically for this constraint.

Quality is below every GPU-powered model in this comparison. Piper sounds robotic on prosody-heavy text and doesn't handle emotional nuance. For smart home voice responses, accessibility tools, audiobook narration on edge hardware, or privacy-critical offline deployments where the audio going through a cloud API is a non-starter, Piper is the practical choice.

License: MIT.

Hardware: 1GB RAM, single CPU core. GPU optional (CUDA for faster processing).

Languages: 30+ including English, French, German, Spanish, Russian, Chinese, and others.

Comparison Table

Model	Parameters	License	Min VRAM	Voice Cloning	Languages	Latency
Kokoro-82M	82M	Apache 2.0	2.5GB	No	8	210x realtime
Chatterbox-Turbo	350M	MIT	Consumer GPU	Yes (5-20s)	23+	~200ms
Voxtral	4B	Creative Commons	16GB	Yes (3-5s)	9	70-90ms
Dia	1.6B	Apache 2.0	~10GB	No	1 (EN)	Streaming
Fish Speech S2	4B + 400M	Research License	12-16GB	Yes (10-30s)	80+	<150ms
CosyVoice 3	0.5B+	Apache 2.0	~8GB	Yes	9 + dialects	~150ms
Piper	Small	MIT	CPU only	No	30+	Real-time

Which Model Fits Which Use Case

Narration and content creation (clean license, no cloning needed): Kokoro-82M at Apache 2.0 is the most practical starting point. Install it, pick a preset voice, run inference locally. No VRAM budget pressure, no licensing conversation required.

Voice cloning for production applications: Chatterbox-Turbo under MIT is the cleanest path. The benchmark numbers are favorable, the license requires no additional negotiation, and the 350M parameter footprint is manageable on a consumer GPU.

Highest possible output quality, GPU available: Voxtral at 4B parameters and 70-90ms latency is the strongest open-weight model if you have a 16GB+ GPU and can work within the Creative Commons license terms. Fish Speech S2 has higher benchmark scores but requires a licensing review.

Multi-speaker dialogue or podcast-style audio: Dia (Apache 2.0) has no real open-source competitor for this use case. The nonverbal cue support is a unique capability.

Multilingual at scale: CosyVoice 3 with 9 languages and 18+ dialects covers the broadest language surface of any model with an open-weights release.

Offline, CPU-only, or edge hardware: Piper. No other model in this list runs comfortably on a Raspberry Pi without modification.

The engineering overhead of self-hosting any of these models is real. Getting Kokoro running takes an afternoon. Getting Voxtral or Fish Speech S2 into production with proper batching, streaming, and fault tolerance takes longer. That gap between "runs on my machine" and "handles production load" is where the cost of open-source TTS actually lives. For teams without dedicated ML infrastructure, a paid API is often cheaper once engineering time is included.