Best AI Voice Cloning Tools 2026
A hands-on comparison of the best AI voice cloning tools in 2026 - covering ElevenLabs, Resemble AI, Cartesia, PlayHT, open-source alternatives, and consent requirements.

Voice cloning is not the same thing as text-to-speech. Generic TTS gives you a choice from a catalog of pre-built voices. Voice cloning takes an audio recording of a specific person - a voice actor narrating their own material, a dubbing artist recording a consent script, a developer running a self-clone for an accessibility app - and synthesizes new speech in that voice from arbitrary text. One-shot and few-shot cloning has gotten remarkably good in 2025-2026; a 10-30 second sample is now enough for commercial-grade output from the best providers.
That capability gap is also why this article is separate from Best AI Voice Generators 2026, which covers generic TTS with pre-built voice catalogs. This piece is specifically about tools that let you replicate an input voice, with all the technical, pricing, and ethical dimensions that entails.
For normalized pricing across TTS APIs (including per-1M-character rates for every provider below), see Text-to-Speech API Pricing Compared. For quality benchmarks across voice models, see the AI Voice and Speech Leaderboard.
TL;DR - Best picks by category
- Best professional quality: ElevenLabs Professional Voice Clone - highest fidelity for voice actors, narrators, brands
- Best API for developers: PlayHT 2.0 Turbo or Cartesia Sonic - clean API, sub-500ms streaming, voice clone included
- Best open-source self-host: F5-TTS or Spark-TTS - no commercial restrictions, 3-10 second clone from a sample
- Best for dubbing and multilingual: Camb.ai - purpose-built dubbing pipeline, 100+ language lip-sync
- Best free option: ElevenLabs free tier (10K chars/month with instant clone) or Coqui xTTS v2 self-hosted at $0 API cost
What voice cloning actually means in 2026
There are three meaningfully different things that go by "voice cloning":
Instant / zero-shot cloning - You provide a short audio sample (5-60 seconds) at inference time. The model conditions on that sample and generates new speech in the same voice, without any fine-tuning. This is what ElevenLabs Instant Clone, PlayHT Ultra Instant, Cartesia, and most open-source models (xTTS v2, F5-TTS, Spark-TTS) do. Quality ranges from acceptable to excellent depending on the model and sample quality.
Few-shot cloning with fine-tuning - You provide minutes to hours of clean recordings, and the provider runs a training job to adapt a base model to the specific voice. Output is more stable, consistent, and accurate to the source voice than zero-shot methods, especially on edge cases. ElevenLabs Professional Voice Clone, Microsoft Custom Neural Voice, and WellSaid Labs custom voices use this approach.
On-device / embedded cloning - Models small enough to run on edge hardware (phones, embedded systems). Spark-TTS and Kokoro-82M fall into this category for inference; full voice cloning on-device remains early-stage.
The quality hierarchy follows this order: fine-tuned professional > instant/zero-shot commercial > instant/zero-shot open-source. But that gap is narrowing fast.
Feature comparison table
| Tool | Open source | Clone samples needed | Languages | Streaming | Consent verification | Starting price | Commercial rights |
|---|---|---|---|---|---|---|---|
| ElevenLabs Instant Clone | No | 30 sec | 32 | Yes | TOS only | Free (10K chars/mo) | Creator plan+ |
| ElevenLabs Pro Voice Clone | No | 30+ min | 32 | Yes | Required submission | Enterprise | Enterprise |
| Resemble AI | No | 3-10 sec | 10+ | Yes | TOS only | $0.006/sec | Yes (paid) |
| WellSaid Labs Custom Voice | No | Studio session | English | Yes (API) | Partnership required | Contact sales | Yes |
| Microsoft Custom Neural Voice | No | Hours (studio) | 60+ | Yes | Application + consent docs | Contact sales | Yes (enterprise) |
| Descript Overdub | No | 10-30 min | English | No | In-product video consent | Hobbyist $12/mo | Creator plan+ |
| PlayHT 2.0 Turbo | No | 3-10 sec | 142 | Yes | TOS only | Creator $39/mo | Creator plan+ |
| OpenAI Voice Engine | No | 15 sec | 57 | Yes | Gated access | Gated | Gated |
| Cartesia Sonic | No | Seconds | 15+ | Yes | TOS only | $5 credit (PAYG) | Yes |
| Camb.ai | No | Short clip | 100+ | No | TOS only | Contact sales | Yes |
| Coqui xTTS v2 | Yes (MPL-2.0) | 3-6 sec | 17 | Yes | None (self-hosted) | $0 API cost | Review MPL-2.0 |
| StyleTTS 2 | Yes (MIT) | Fine-tuning needed | English | No | None | $0 API cost | Yes (MIT) |
| OpenVoice v2 | Yes (MIT) | 1 short clip | 10+ | No | None | $0 API cost | Yes (MIT) |
| F5-TTS | Yes (MIT/CC-BY) | 3-10 sec | 10+ | Yes | None | $0 API cost | Check per model |
| MaskGCT | Yes (Apache-2.0) | Short clip | 12+ | No | None | $0 API cost | Yes (Apache) |
| Spark-TTS | Yes (Apache-2.0) | Short clip | 20+ | Yes | None | $0 API cost | Yes (Apache) |
| Kokoro-82M | Yes (Apache-2.0) | No cloning | English/Japanese | Yes | N/A | $0 API cost | Yes (Apache) |
| HeyGen Voice Clone | No | 2+ min | 40+ | No | In-product consent flow | Essentials $29/mo | Yes (paid) |
| Speechify Voice Clone | No | 30 sec | 30+ | No | In-product consent | Premium $139/yr | Yes (paid) |
| Murf.ai Voice Cloning | No | Studio session | English (custom) | No | Professional process | Contact sales | Yes |
Managed / high-quality services
ElevenLabs - Instant Clone and Professional Voice Clone
ElevenLabs is the reference point for commercial voice cloning quality in 2026. They offer two distinct products: Instant Clone and Professional Voice Clone, and the distinction matters.
Instant Clone conditions on a 30-second audio sample at inference time. No training job, no wait. You get a voice ID you can call via the API or use in the web studio. Available on Creator plan ($22/month) and above. The output is very good for narration and conversational content; it degrades a bit on long-form content with complex prosody or on source material with heavy background noise.
Professional Voice Clone is a separate, higher-fidelity product. You submit 30+ minutes of clean, labeled recordings - typically a voice actor reading a standardized script. ElevenLabs runs a full training job. Output is significantly more stable and consistent than the instant clone, especially on edge cases (whispering, shouting, long pauses, emotional range). This tier requires an enterprise agreement and carries substantially higher pricing.
ElevenLabs requires consent. Their terms of service prohibit cloning voices without permission. The Professional Voice Clone process includes a consent attestation. For instant clone, the consent responsibility sits with the user - ElevenLabs provides a permission-audit flow in the platform that logs voice creation events, useful for compliance documentation.
Quality metrics: ElevenLabs Flash v2.5 sits at Elo 1549 on the TTS Arena leaderboard, and Turbo v2.5 at 1545. Both are in the top 10 of the overall TTS Arena as of April 2026.
Pricing: Instant cloning is included on Creator ($22/month, 100K chars), Pro ($99/month, 500K chars), and Scale ($330/month, 2M chars) plans. Overage is $0.30/1K characters ($300/1M). Professional Voice Clone is enterprise-only - contact sales.
Best for: Voice actors creating a digital clone of their own voice, content studios that need consistent character voices across thousands of lines of dialogue, brands building custom voice assistants.
Source: ElevenLabs Pricing
Resemble AI
Resemble AI is a developer-facing cloning platform that prices by the second of audio output rather than by input characters - at $0.006/second, that works out to roughly $21.60/hour of audio produced. Voice cloning uses the same per-second rate, making the pricing model unusually transparent for clone synthesis.
The API handles voice creation from short sample uploads (3-10 seconds for instant mode, more for higher quality). Resemble supports streaming output. Language support covers 10+ languages but English is the primary quality target. The platform includes a voice designer tool and emotion controls via their API.
Resemble is self-serve - you can create a custom voice through their dashboard without going through an approval process. Consent enforcement is terms-based rather than technical. That simplicity is useful for developers moving fast; it also means the user carries all the compliance burden.
Best for: Developers building voice personalization into applications who want predictable per-second pricing and API-first design.
Source: Resemble AI Pricing
WellSaid Labs - Custom Voice Avatar
WellSaid Labs targets enterprise narration: corporate L&D, e-learning platforms, marketing video production. Their custom voice avatar process involves a partnership with WellSaid rather than self-serve upload - you arrange a recording session with WellSaid's team, they handle the training, and the resulting voice runs on their infrastructure. All voice talent is contracted under signed agreements.
This is not a developer API in the same sense as ElevenLabs or Resemble. It is closer to a managed service where you end up with a dedicated voice asset hosted by WellSaid. API access for programmatic synthesis is available on Teams/Enterprise plans; there is no real-time streaming on the base Maker plan ($49/month). English only.
Best for: Enterprise teams in regulated industries (finance, healthcare, HR training) where legally vetted voice talent and a documented consent chain are required.
Source: WellSaid Labs Pricing
Microsoft Custom Neural Voice
Microsoft Custom Neural Voice (CNV) is the most rigorous consent-enforcement environment in the commercial managed API space. It is not self-serve. To begin a CNV project:
- Submit an application describing your use case
- Receive approval from Microsoft
- Provide signed consent documentation from the voice talent
- Record a curated script (typically hours of studio-quality audio)
- Submit for training
Synthesis from a trained CNV voice runs at Azure's $24.00/1M character rate (see TTS API Pricing for full breakdown). Training cost is separate and contact-sales. This process is expensive and slow compared to instant cloning services - and that is by design.
For enterprise deployments where a specific brand voice is required and the legal documentation must be airtight, CNV is the correct choice. For everything else, it is overkill.
Source: Azure Custom Neural Voice - Microsoft Learn
Descript Overdub
Descript's Overdub feature lets you create a voice clone that lives inside the Descript editor - you can correct a transcript and have Descript re-synthesize the corrected section in your voice. This is a podcast and video editing workflow tool, not a standalone API.
Setup requires recording a 10-30 minute consent script inside the Descript app. The consent flow is built into the product: you read statements explicitly acknowledging you are creating an AI voice of yourself. Cloning voices other than your own is prohibited.
Overdub quality is good for the use case - short phrase corrections in an existing recording session - but it is not designed for synthesizing new content from scratch. No streaming API. English only.
Best for: Podcasters and video creators who want seamless re-recording for editing mistakes, without the complexity of a standalone API integration.
Source: Descript Pricing
Developer APIs
PlayHT - 2.0 Turbo Voice Clone
PlayHT's 2.0 Turbo model delivers instant voice cloning from a 3-10 second audio sample via API, with 142 language support - the broadest multilingual coverage of any commercial instant-clone API. Their Ultra clone mode (Pro plan and above) runs a more thorough adaptation and produces more stable output on accent-heavy or emotionally expressive source material.
The API is REST + WebSocket. Streaming latency on 2.0 Turbo is ~300-500ms time-to-first-audio-byte. Creator plan ($39/month) includes instant cloning; Ultra clone requires Pro ($99/month).
PlayHT has a large library of pre-made voices alongside the cloning feature - 800+ across 142 languages. The combination makes it practical for multilingual products where you might use pre-built voices for most languages and cloned voices for primary-market languages.
Best for: Multilingual applications needing both instant voice cloning and a broad pre-built voice catalog in a single API.
Source: PlayHT API Docs
OpenAI Voice Engine (gated)
OpenAI's Voice Engine was previewed in March 2024 and remained in a closed beta through early 2026. As of April 2026 it is still gated - you can request access but there is no self-serve sign-up. OpenAI published a 15-second sample requirement for cloning and described a consent-first deployment policy.
The product is notable primarily because OpenAI wrote one of the clearest public explainers on the consent and safety challenges of synthetic voice, and they designed Voice Engine with consent verification, no-go lists for political figures, and built-in watermarking (SynthID audio). They have been unusually conservative about releasing it at scale.
When and if Voice Engine becomes broadly available, the integration story for teams already using OpenAI's API will be compelling. Until then, it is a roadmap item rather than a usable product for most developers.
Best for: Teams who are already in the gated program, or developers planning future pipeline architecture around OpenAI's ecosystem.
Cartesia Sonic - Voice Cloning
Cartesia Sonic is primarily benchmarked for latency - ~80-90ms time-to-first-audio-byte in streaming mode makes it the fastest option for real-time voice agents. Voice cloning is included via the API at no extra cost beyond the standard $50/1M character rate on pay-as-you-go.
The cloning process takes a short audio sample at inference time (zero-shot). Output quality is solid for conversational and agent use cases but does not reach ElevenLabs-tier naturalness on expressive content. Language coverage is 15+ languages, with English as the primary optimization target. The free tier includes $5 of credit on signup.
What Cartesia adds over competitors: nonverbal expressiveness features (laughter, breathing, sighs) in the voice model itself, SOC2 compliance, on-premises deployment option, and a 99.9% uptime SLA. For production voice agent infrastructure, these matter.
Best for: Real-time voice agent pipelines where sub-100ms TTS latency is required and voice quality needs to be adequate but not perfect.
Source: Cartesia Pricing
Camb.ai - Dubbing and Voice Cloning
Camb.ai is purpose-built for dubbing - taking existing video or audio content and re-rendering it in another language while preserving the source speaker's voice. The pipeline handles transcription, translation, voice clone, and (for video) lip-sync in one workflow.
Coverage is 100+ languages, which is the broadest in this comparison. The voice cloning used in dubbing is zero-shot from the source audio - no separate sample required. Output quality for dubbing use cases is strong; this is a different optimization target than a general-purpose TTS clone.
Camb.ai does not publish a self-serve pricing page as of April 2026 - the sales model is contact-based. This positions it as an enterprise/studio tool rather than a developer API for prototyping.
Best for: Content studios, streaming platforms, and video producers who need to dub content across many languages with voice consistency, and who are willing to work with an enterprise pricing model.
Source: Camb.ai
Open-source and self-hosted
Coqui xTTS v2 (historical - archived)
Coqui AI the company shut down in January 2024. The coqui-ai/TTS GitHub repository is archived. xTTS v2, their flagship multilingual voice clone model, remains the most widely deployed open-source voice cloning model in 2026 - the community has maintained forks and it runs on Hugging Face.
xTTS v2 supports 17 languages and clones a voice from a 3-6 second audio sample at inference time, no training required. Running on an A10G GPU (24GB VRAM) it produces audio at roughly 3-4x real-time. Infrastructure cost: approximately $0.53-$1.06/1M characters, far below any commercial API. Licensed under Mozilla Public License 2.0 - commercial use is permitted but requires careful review.
The catch: no SLA, no support, quality that trails commercial offerings on naturalness, and a license that places obligations on redistribution of modified versions. As a foundation for building a custom voice synthesis pipeline it is solid; as a drop-in for production customer-facing audio it shows its age against 2026 alternatives.
Source: coqui/XTTS-v2 on Hugging Face
StyleTTS 2
StyleTTS 2 (MIT license) takes a different approach to voice cloning than xTTS v2. Rather than zero-shot inference, it uses style diffusion and adversarial training on a reference audio to produce a voice that matches the target speaker's style - phonation, prosody, pitch range. The cloning quality, particularly for English, is excellent relative to its model size.
The tradeoff: StyleTTS 2 requires some fine-tuning to get the best clone quality, rather than pure zero-shot conditioning. This is more work than uploading a clip to ElevenLabs, but for developers building a product where they have hours of source audio (a voice actor's back catalog, for example), the output quality approaches commercial levels.
English only. MIT license - use commercially with no restrictions.
Source: StyleTTS2 on GitHub
OpenVoice v2 (MyShell)
OpenVoice v2, released by MyShell AI, is an MIT-licensed instant voice cloning model that achieves zero-shot voice cloning from a short audio clip and supports cross-lingual synthesis - meaning you can clone an English voice and synthesize Chinese output, or vice versa. Language support covers 10+ languages.
The key technical feature is a tone-color encoder that captures the timbre characteristics of the reference voice separately from other voice properties. This architecture makes cross-lingual transfer more reliable than many competing open-source models.
MIT license, self-hosted, no API cost. Quality for casual use cases is solid; it does not match ElevenLabs or commercial services on expressive or emotionally varied content.
Source: OpenVoice on GitHub
F5-TTS
F5-TTS (MIT-licensed, with CC-BY-4.0 on training data) is a recent open-source voice cloning model that achieves competitive zero-shot clone quality using a flow-matching approach. It clones from a 3-10 second reference audio and produces natural-sounding output in the source voice's style.
In community benchmarks, F5-TTS outperforms xTTS v2 on English naturalness and has better prosody transfer. It supports 10+ languages. The inference pipeline is faster than StyleTTS 2 for zero-shot use. A demo is available on Hugging Face.
The license on the model weights themselves should be verified at time of use - some fine-tuned variants carry different licenses from the base. The main repo indicates non-commercial restriction on some training data derivatives.
Source: F5-TTS on GitHub
MaskGCT
MaskGCT (Apache-2.0) is released by the Amphion team at CUHK and achieves zero-shot TTS and voice cloning using a masked generative codec transformer. It clones from a short reference clip and supports 12+ languages. Apache-2.0 license means no restrictions on commercial use.
MaskGCT ranked competitively against commercial APIs in the Amphion paper's evaluations on MOS (Mean Opinion Score) and speaker similarity metrics. In community testing it performs noticeably better than xTTS v2 on naturalness for English, though the inference speed is slower than F5-TTS.
Source: MaskGCT in Amphion on GitHub
Spark-TTS
Spark-TTS (Apache-2.0) from SparkAudio supports 20+ languages and is designed to run efficiently on consumer hardware. It uses a bicodec approach that separates semantic and acoustic tokens, enabling both voice cloning from short clips and voice parameter control (adjust pitch, speaking rate, energy) without re-recording.
The voice cloning quality is strong for an open-source model, with notably good cross-lingual transfer. The Apache-2.0 license covers both code and released checkpoints. For teams building multilingual voice products who want full infrastructure control, Spark-TTS is the most capable current open-source option.
Source: Spark-TTS on GitHub
Kokoro-82M (baseline reference - no cloning)
Kokoro is included here as a quality baseline, not a voice cloning tool. At 82M parameters, Apache-2.0, it runs at 96x real-time on a basic GPU and produces genuinely good English and Japanese TTS from pre-built voices. It does not support voice cloning.
If you need a fast, free, no-restrictions TTS model for fixed voices (narration, testing, accessibility), Kokoro is the answer. If you need voice cloning, use F5-TTS, Spark-TTS, or xTTS v2. The reason to mention Kokoro here is that developers often start evaluating voice cloning tools and find that a fixed-voice solution actually covers 80% of their actual use case.
Source: Kokoro on GitHub
Microsoft VibeVoice
Microsoft's VibeVoice (GitHub, research release) is a long-form voice synthesis system - designed specifically for generating hours-length audio (audiobooks, lectures) rather than short-form TTS. It maintains prosodic consistency over very long passages in a way that standard TTS models cannot.
VibeVoice is currently a research release rather than a production-ready product. It supports voice cloning via a reference audio approach but has higher quality requirements on the input sample than xTTS v2 or F5-TTS. Worth watching for audiobook production pipelines.
Source: VibeVoice on GitHub
Consumer-facing products
HeyGen - Voice Clone
HeyGen's voice clone feature is built into their AI video platform. You provide a 2+ minute recording, and HeyGen creates a voice model used to generate or dub video content - including synchronized lip movements on an AI avatar. The consent flow is built into the product and includes an explicit acknowledgment screen.
Voice clone on HeyGen requires the Essentials plan ($29/month) or above. The output works best for the HeyGen use case - short-to-medium avatar video with synchronized speech - and is not available as a standalone synthesis API. Language support covers 40+ languages for voice plus a subset for lip-sync.
Best for: Video content creators, marketing teams, and learning professionals who are already in the HeyGen ecosystem and need to maintain voice consistency across video content.
Source: HeyGen Pricing
Speechify - Voice Clone
Speechify's voice clone lets you record a short sample (around 30 seconds) and creates a voice used to read your documents, articles, and notes back to you in your own voice. The consent is captured during the in-app recording flow.
This is primarily a personal accessibility product. The synthesized voice is used within the Speechify app; there is no developer API for third-party use. Available on the Premium plan ($139/year). 30+ languages. Not relevant for building applications, but useful context for teams building accessibility tooling - Speechify shows the user experience bar for voice-first reading interfaces.
Source: Speechify Pricing
Murf.ai - Voice Cloning
Murf.ai provides AI voiceovers for marketing and e-learning content. Their voice cloning product creates a custom voice from a professional recording session. This is not a self-serve upload feature - it involves working with Murf's team, submitting clean recordings, and receiving a custom voice available in the Murf studio.
The custom voice lives within the Murf platform; API access at scale requires contacting sales. English primary, with limited multilingual options for custom voices. The target customer is the same as WellSaid Labs - enterprise content teams who need a consistent brand voice across high volumes of narrated content.
Best for: Marketing and L&D teams producing high volumes of narrated video content who want the voice consistency of a clone without managing self-hosted infrastructure.
Source: Murf.ai Pricing
Best for X - decision matrix
Best professional quality, budget not a constraint: ElevenLabs Professional Voice Clone (fine-tuned, enterprise). If budget is a concern but quality is still primary, ElevenLabs Instant Clone on the Pro plan gives ~80% of the quality at a fraction of the price.
Best developer API for a voice agent or real-time application: Cartesia Sonic. Sub-100ms TTFAB streaming with voice cloning included, SOC2 compliance, and a clean REST + WebSocket API. If multilingual coverage matters more than latency, switch to PlayHT 2.0 Turbo.
Best open-source self-hosted: Spark-TTS for multilingual coverage and parameter control, or F5-TTS for English-first zero-shot quality. Both are Apache-2.0 / MIT licensed with no commercial restrictions. xTTS v2 remains the most widely deployed and battle-tested but is showing age.
Best for dubbing and multilingual localization: Camb.ai, purpose-built for the dubbing pipeline with 100+ language coverage and lip-sync. For developers who want to build their own dubbing pipeline, PlayHT's 142-language clone API is the best self-serve option.
Best free option: ElevenLabs free tier (10K chars/month, instant clone included) is the best-managed-API free option for evaluation. For zero-cost production use, self-hosted F5-TTS or Spark-TTS.
Best for audiobooks and long-form narration: ElevenLabs Professional Voice Clone for commercial quality with a specific narrator's voice, or Microsoft VibeVoice (self-hosted) for long-form prosody consistency.
Best for legal defensibility and enterprise compliance: Microsoft Custom Neural Voice or WellSaid Labs Custom Voice - both enforce consent documentation before voice training begins, and the voice assets live in a managed, auditable environment.
Ethics, consent, and the legal landscape
Voice cloning is one of the fastest-moving legal and regulatory areas in AI. This section covers what you actually need to know if you are building a product.
Consent is the baseline - not a nice-to-have
Every legitimate voice cloning use case starts with explicit, informed consent from the person whose voice is being cloned. There are no exceptions worth taking seriously. The person must:
- Know their voice is being cloned
- Know what it will be used for
- Actively agree to that use
In practice, this means a written consent agreement before any recording begins, documentation of what the clone will be used for, and clear terms on what uses are prohibited (political content, deceptive impersonation, pornographic content, etc.). Voice actors, dubbing artists, and narrators should treat this documentation as they would a standard talent contract.
The article you are reading is meant to help legitimate voice cloning use cases - voice actors building a digital replica of their own voice, accessibility applications, dubbing with licensed talent, brand voice work. It does not provide and will not provide workarounds for consent checks.
Platform consent verification tiers
The major platforms vary significantly in how they enforce consent:
Technical enforcement + application review (strongest):
- Microsoft Custom Neural Voice and WellSaid Labs Custom Voice both require documentation review and approval before a voice project begins. You cannot create a clone without going through their process.
In-product consent flow (strong for self-cloning):
- Descript Overdub and HeyGen require you to complete an explicit consent recording where you read a statement acknowledging what you are creating. This works well for the "clone your own voice" use case. It cannot verify consent when uploading third-party recordings.
- ElevenLabs Professional Voice Clone includes a consent attestation in the enterprise process.
Terms-only enforcement (weakest):
- ElevenLabs Instant Clone, PlayHT, Cartesia, Resemble, and all open-source models depend entirely on ToS agreements and user self-certification. The technical pipeline will accept any audio sample. Users are contractually prohibited from cloning without consent, but there is no technical gate.
Open-source models have no enforcement layer at all - no terms, no accounts, no audit trail. The responsibility sits entirely with the deploying organization.
US and international law - current state (April 2026)
Tennessee ELVIS Act (Ensuring Likeness, Voice, and Image Security): Effective July 2024. Specifically covers AI voice cloning of musical artists without consent. Civil and criminal penalties. First US state law to specifically address AI-generated voice in the music context.
California AB 2602 and AB 1836: AB 2602 (effective January 2025) requires contracts covering AI-generated digital replicas of performers to specifically describe the intended use of the replica. AB 1836 covers deceased personalities. Both apply to professional engagements in the entertainment industry.
EU AI Act: Came into force in August 2024. AI-generated synthetic media (including voice) requires disclosure when deployed in high-risk applications or to audiences who might be deceived. Deepfake audio intended to deceive is addressed explicitly. Member states are implementing enforcement mechanisms through 2026.
Federal (US): The TAKE IT DOWN Act (2025) covers non-consensual intimate imagery including AI-generated but does not specifically address voice cloning. Broader federal AI legislation is pending but not enacted as of April 2026.
Deepfake audio in political contexts: Multiple US states have passed laws prohibiting AI-generated audio or video of political candidates without disclosure within a specified window before an election. If your use case touches political content, get a lawyer - this landscape changes frequently.
YouTube and platform disclosure requirements
YouTube requires creators to disclose when AI tools are used to generate realistic synthetic audio that viewers might mistake for real. Specifically, cloned voice content depicting real identifiable people. The disclosure requirement was formalized in the YouTube Creator guidelines update in early 2025 and is enforced via the advanced settings in video upload.
Platforms including TikTok, Instagram, and Spotify have implemented similar requirements. Violating these disclosure rules risks content removal and account penalties. This is separate from copyright or legal issues - it is a platform policy layer.
What ElevenLabs' permission-audit flow does
ElevenLabs operates a voice actor permission audit system where verified voice actors can lock their voice identity from being cloned by other users. When ElevenLabs receives a takedown request or flagged content, they can trace the voice ID to the creator account. This audit trail is valuable for copyright enforcement but does not prevent initial unauthorized uploads - it enables post-hoc remediation.
Their platform also includes a feature for voice actors to submit their voice for detection - so if someone else uploads a clone of a registered voice actor, the system can flag the match. Not perfect, but a meaningful improvement over pure terms-enforcement.
Quality metrics: what MOS and speaker similarity actually mean
Two metrics come up most often in voice cloning evaluations:
Mean Opinion Score (MOS): Human listeners rate audio quality on a 1-5 scale. Commercial TTS APIs typically publish MOS scores in the 4.0-4.5 range for English. The problem: MOS scores are heavily dependent on the test set, listener pool, and reference audio quality. A vendor's self-published MOS score is not directly comparable to an independently run evaluation.
Speaker Similarity Score: Measures how closely the cloned voice matches the target speaker's voice, typically using a speaker verification model (cosine similarity of speaker embeddings). A score of 0.80+ on a standard speaker verification model indicates a strong match. Zero-shot models typically achieve 0.70-0.85 on good quality reference audio; fine-tuned professional clones can reach 0.90+.
What these numbers miss: Neither MOS nor speaker similarity captures naturalness over long-form content, emotional expressiveness, or how well the voice handles domain-specific vocabulary (medical terminology, technical jargon). For production evaluation, run your actual content through the candidate models and have human raters score it blind.
The AI Voice and Speech Leaderboard tracks arena-style preference votes across commercial TTS and voice cloning models - preference testing on real content is more actionable than isolated benchmark scores.
FAQ
What is the difference between voice cloning and TTS?
Standard TTS (text-to-speech) synthesizes speech from a catalog of pre-built voices. Voice cloning creates a new synthesized voice that matches a specific person's vocal characteristics, derived from a sample recording of that person. The Best AI Voice Generators 2026 article covers the pre-built voice TTS side of the market.
How much audio do I need to clone a voice?
Zero-shot cloning tools (ElevenLabs Instant Clone, F5-TTS, xTTS v2, Spark-TTS) need as little as 3-30 seconds of clean audio. Fine-tuned professional cloning (ElevenLabs Professional Voice Clone, Microsoft CNV) typically requires 30+ minutes of studio-quality recordings. More audio generally improves stability and accuracy, especially on expressive or non-standard speech.
Is voice cloning legal?
Cloning your own voice: yes, in all major jurisdictions. Cloning someone else's voice with their consent and for agreed purposes: yes, subject to proper documentation. Cloning someone's voice without their consent: potentially illegal under multiple US state laws (ELVIS Act, California AB 2602), the EU AI Act, and general tort law on right of publicity. Cloning for deceptive political purposes: specifically regulated or prohibited in many jurisdictions. Get legal review if you are building a commercial product using voice cloning of third parties.
What are the best open-source voice cloning tools?
Spark-TTS (Apache-2.0, 20+ languages, voice parameter control) and F5-TTS (MIT/CC-BY, strong English zero-shot quality) are the most capable current options. MaskGCT (Apache-2.0, 12+ languages) is close behind. xTTS v2 is the most battle-tested but is showing its age against 2025-2026 alternatives. See the open-source section above for details and licensing notes.
How does voice cloning pricing compare to standard TTS?
For managed APIs that include cloning (ElevenLabs, PlayHT, Cartesia, Resemble), synthesis from a cloned voice is billed at the same rate as standard synthesis - there is usually no per-character premium for using a cloned voice versus a library voice. The cost difference is in setup: instant cloning has no additional cost; professional fine-tuned voice training can cost thousands of dollars. See Text-to-Speech API Pricing Compared for full per-provider pricing details.
Which provider is best for multilingual voice cloning?
PlayHT 2.0 Turbo with 142 language support is the broadest commercial option. Camb.ai covers 100+ languages for dubbing pipelines. Spark-TTS (self-hosted, open-source) supports 20+ languages with good cross-lingual clone transfer. ElevenLabs covers 32 languages but has notably higher quality on each.
Does voice cloning require streaming support?
It depends on your use case. For voice agents and real-time conversations, streaming (low time-to-first-audio) is required - use Cartesia (80-90ms) or ElevenLabs (200-300ms). For batch content production (audiobooks, course narration, voiceovers), streaming is irrelevant - optimize for quality instead. See the AI Voice and Speech Leaderboard for latency benchmarks across models.
Sources:
- ElevenLabs Pricing
- Resemble AI Pricing
- WellSaid Labs Pricing
- Azure Custom Neural Voice - Microsoft Learn
- Descript Pricing
- PlayHT API Docs
- Cartesia Pricing
- Camb.ai
- coqui/XTTS-v2 on Hugging Face
- StyleTTS2 on GitHub
- OpenVoice on GitHub
- F5-TTS on GitHub
- MaskGCT in Amphion on GitHub
- Spark-TTS on GitHub
- Kokoro on GitHub
- VibeVoice on GitHub
- HeyGen Pricing
- Speechify Pricing
- Murf.ai Pricing
Also see: Best AI Voice Generators 2026, Text-to-Speech API Pricing Compared, Best AI Transcription Tools 2026, AI Voice and Speech Leaderboard.
✓ Last verified April 19, 2026
