AI Music Generation Leaderboard 2026: Suno, Udio, More

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

AI music generation is where evaluation methodology breaks down fastest. MOS listening tests get cherry-picked demo tracks. FAD numbers get reported against different reference sets without disclosure. Vendors compare their latest model to competitors' two-year-old checkpoints. And the consumer products - Suno, Udio, AIVA - don't publish FAD scores at all, leaving benchmark trackers to rely on third-party academic evaluations that may be six months stale by the time they're peer-reviewed.

None of that means the numbers are useless. It means you have to hold them carefully. This leaderboard covers what the objective benchmarks actually say, which subjective tests you should trust versus treat with skepticism, and where the commercial products land relative to open-source models on evaluations where both can be compared.

TL;DR

MusicGen-Large (Meta) still posts the strongest published FAD scores on standard academic benchmarks, but it's an open research model - not a product
Suno v4 and Udio lead on subjective listening preferences in independent user tests, but neither publishes FAD or CLAP metrics
YuE is the open-source model to watch for lyric-to-song - it's the only open-weight system that meaningfully handles full-song structure with lyrics
MOS scores from vendor demos are almost always inflated - the only numbers worth citing come from blinded third-party evaluations
The MusicCaps benchmark has an overfitting problem: models trained on YouTube-adjacent data have a structural advantage over models that weren't

Methodology

FAD - Frechet Audio Distance

FAD is the audio analog of FID in image generation. It computes the Frechet distance between the distribution of audio embeddings from generated clips and a reference set of real recordings. Lower is better. The embedding model matters: most published FAD numbers use VGGish embeddings, but some 2025-2026 papers use CLAP or MERT, making cross-paper comparisons unreliable. The reference set matters equally - papers using AudioSet produce different absolute numbers than papers using MusicCaps. "FAD (VGGish, AudioSet)" and "FAD (CLAP, MusicCaps)" are different metrics with the same name.

CLAP Score - Text-Audio Alignment

CLAP (Contrastive Language-Audio Pretraining) measures how well generated audio matches its text prompt. Higher is better. Published scores use either the LAION-CLAP or MERT-CLAP checkpoint - not directly comparable across papers. The MusicBench paper (arXiv 2305.15243) provides the cleanest standardized CLAP evaluation because it holds the reference model constant across all compared systems.

MOS - Mean Opinion Score

MOS is a 1-5 scale human listener rating. It's the most widely used subjective quality metric and the easiest to rig. Selection bias in prompt choice, annotator selection, and blind protocol all dramatically affect results. The only MOS figures worth weighting come from academic papers with documented blind evaluation. I flag vendor-published MOS in the tables below.

KAD - Kernel Audio Distance

KAD replaces FAD's Gaussian distributional assumption with a kernel-based distance, using MERT embeddings rather than VGGish. The KAD paper (arXiv 2409.09203) argues it produces more stable rankings when the reference set is small. Adoption is still limited - most evaluations still report FAD - but it's appearing more frequently in recent work.

MusicCaps

MusicCaps is Google's 5,521-clip evaluation set with detailed text captions, the most commonly used reference for text-to-music FAD. Its clips come from YouTube AudioSet, skewed toward pop and rock. Models trained on YouTube-adjacent data have a structural advantage here. Keep that in mind when reading any absolute FAD numbers anchored to this dataset.

Text-to-Music Rankings

Scores come from the MusicBench paper, the MusicGen technical report (arXiv 2306.05284), the MusicLDM paper (arXiv 2311.11225), and independent evaluations where noted. FAD here uses VGGish embeddings against MusicCaps unless otherwise specified. CLAP uses LAION-CLAP unless noted. MOS from third-party blinded tests only - vendor-published MOS are marked (vendor).

Rank	Model	Provider	FAD (lower=better)	CLAP (higher=better)	MOS (1-5)	Notes
1	MusicGen-Large	Meta	2.82	0.51	4.1	Academic model; AudioCraft repo; best published objective scores
2	Stable Audio 2.0	Stability AI	4.10 (est)	0.49 (est)	4.0	44-second stereo output; latent diffusion
3	Suno v4	Suno AI	n/a	n/a	4.3 (vendor)	No published FAD/CLAP; strong in independent listening panels
4	Udio	Udio	n/a	n/a	4.2 (vendor)	No published FAD/CLAP; competitive vocal quality
5	YuE	m-a-p (open-source)	5.31 (est)	0.44 (est)	3.9	Full-song structure; best open-weight lyric-to-song
6	MusicGen-Medium	Meta	3.49	0.48	3.9	1.5B params; good cost-quality tradeoff
7	MusicLDM	Various	4.72	0.42	3.7	Latent diffusion; strong genre conditioning
8	ElevenLabs Music	ElevenLabs	n/a	n/a	3.8 (vendor)	Focused on background/ambient use cases
9	AIVA	AIVA Technologies	n/a	n/a	3.7 (vendor)	Classical/cinematic specialty; less pop coverage
10	Riffusion	Riffusion	6.80	0.38	3.3	Spectrogram diffusion; unique approach, lower objective quality
11	Mubert	Mubert	n/a	n/a	3.2 (vendor)	Generative radio; not a true generation model
12	Soundraw	Soundraw	n/a	n/a	3.1 (vendor)	Segment recombination; not end-to-end generative
13	Boomy	Boomy	n/a	n/a	2.9 (vendor)	Template-driven; not a neural generation model
14	Beatoven	Beatoven	n/a	n/a	3.0 (vendor)	Mood-based generation; limited prompt control
15	Loudly	Loudly	n/a	n/a	2.8 (vendor)	Stock library hybrid; partial generation

Estimated scores (marked est) are extrapolated from partial evaluations in published papers and are not direct reproductions of reported figures.

Lyric-to-Song Rankings

This is the hardest task in music generation. The model must produce a full song - verses, chorus, bridge - with vocals that follow provided lyrics, maintain syllabic timing, and stay on pitch. Most academic benchmarks don't cover this because it requires long-context coherence that standard evaluation windows miss.

Rank	Model	Provider	Song Structure	Lyric Alignment	Vocal Quality	Notes
1	Suno v4/v5	Suno AI	Excellent	Strong	High	Best available lyric-to-song product
2	Udio	Udio	Good	Strong	High	Competitive with Suno on vocal fidelity
3	YuE	m-a-p	Good	Good	Moderate	Only open-weight system with real lyric support
4	ElevenLabs Music	ElevenLabs	Limited	Basic	High (instrumental)	Strong voice but shallow song structure
5	AIVA	AIVA Technologies	Good	Weak	n/a	Primarily instrumental; lyric support is bolted on
6	Boomy	Boomy	Basic	Basic	Low	Template-based; lyric handling is minimal

Suno's dominance in lyric-to-song comes from architectural choices around temporal conditioning that aren't published in detail - the technical blog posts are high-level marketing rather than reproducible methodology. YuE's advantage in the open-source tier is documented in the YuE paper (arXiv 2503.08638), which covers the dual-track encoder approach that handles vocals and accompaniment in separate latent spaces before mixing.

Stem Separation and Remixing Rankings

Stem separation is a distinct task from generation: given a mixed audio file, isolate vocals, drums, bass, and other instruments. This matters for remixing, sampling-based workflows, and music production. The benchmark here is SDR (Signal-to-Distortion Ratio) in dB - higher is better. This section tracks separation tools rather than generation models, with the caveat that some platforms offer both.

Rank	Model / Tool	Vocals SDR (dB)	Drums SDR (dB)	Bass SDR (dB)	Other SDR (dB)	Notes
1	Demucs v4 (htdemucs)	8.13	8.24	8.76	6.40	Meta; open-source; current state of the art
2	Demucs v3	7.33	7.73	8.37	5.75	Still widely deployed; good baseline
3	Spleeter (2-stem)	6.55	4.23	-	-	Deezer; fast; 2-stem only mode
4	Moises AI	~7.8 (est)	~7.5 (est)	~7.8 (est)	-	Commercial; built on Demucs-class models
5	Adobe Podcast (Enhance)	Speech-optimized	n/a	n/a	n/a	Voice isolation only; not a full stem splitter

For remixing workflows, Demucs v4 (htdemucs) is the benchmark. It's open-source under MIT license, runs locally, and its SDR numbers come from the MusDB18 test set which is the standard evaluation benchmark for source separation. Commercial products like Moises build on Demucs-class architectures and don't publish independent SDR figures, so their estimates above are based on comparison against Demucs in community listening tests rather than formal evaluation.

Key Takeaways

The Benchmark Coverage Gap Is Real

The models people actually use - Suno, Udio, ElevenLabs Music - don't publish objective benchmark numbers. FAD requires running evaluation pipelines against standardized reference sets. CLAP requires a defined checkpoint. Neither requires much compute to run, and both are reproducible. Skipping these is a choice, not a technical limitation. It means you have to compare commercial products through listening tests that are easier to curate than FAD runs.

MusicGen-Large at FAD 2.82 looks best on paper, but blinded listening tests produce different rankings because FAD doesn't capture long-term coherence, originality, or whether a song has a beginning, middle, and end.

MOS Tests Are Mostly Compromised

Almost every MOS number published by a commercial music generation product was collected under conditions designed to maximize the score. Prompts are curated to showcase strengths. Annotators are often recruited without musical training. The comparison baseline is usually an older model or a competitor's year-old checkpoint.

The only MOS figures worth weighting come from academic papers with blind protocol documentation. The MusicGen evaluation uses this approach. Vendor blog posts don't.

Suno and Udio Lead User Preference - With Caveats

In independent community listening tests run through platforms like Reddit's r/musicai and the Elo-style comparison tools that have emerged in the music generation space, Suno v4 and v5 consistently win on lyric-to-song and stylistic coherence across long outputs. Udio is competitive, particularly for vocal fidelity on shorter clips.

The caveat is that these preference signals come from internet communities that skew toward pop and hip-hop styles. Both Suno and Udio are heavily trained on commercially dominant genres. If you need high-quality classical orchestration, jazz improvisation with correct harmonic substitutions, or experimental electronic production, neither model is as dominant - and AIVA's specialty positioning becomes more competitive.

MusicGen's Advantage Comes From Reproducibility

MusicGen-Large's FAD 2.82 reflects genuine quality from the transformer architecture in the AudioCraft paper, but also careful evaluation against a reference set Meta's team knows well. The AudioCraft codebase includes evaluation scripts and independent researchers have reproduced the numbers. That transparency earns credibility vendor demos can't.

The practical catch: MusicGen-Large caps at 30 seconds, is instrumental-only, and isn't a product for non-technical users. It's a research model that wins benchmarks.

YuE Fills the Open-Source Lyric Gap

Before YuE's release in early 2025, there was no open-weight model that could handle full-song generation with lyrics at a quality level worth deploying. MusicGen can't do lyrics. MusicLDM has no lyric support. Riffusion's spectrogram diffusion approach can produce sung phonemes but can't reliably align them to provided text.

YuE changes that. The dual-track architecture - separate encoders for vocal and accompaniment that are merged at inference time - solves a structural problem that earlier models handled poorly. Song coherence over full verse-chorus-bridge structure is still weaker than Suno, but YuE is open-weight, runs on a single A100, and is available on Hugging Face. For teams that can't use closed APIs, this is the only real option for lyric-to-song at this time.

Pop Music Overfitting Is a Problem

MusicCaps comes from YouTube AudioSet, which skews heavily toward pop, hip-hop, and rock. A model that generates technically adequate pop will beat a model generating higher-quality jazz on MusicCaps FAD simply because the reference distribution matches. Suno and Udio both produce pop music that human listeners rate highly; their performance on classical structure, atonal composition, or non-Western musical traditions is substantially weaker, and that weakness doesn't show in any published metric. If your use case is outside the pop-centric distribution, treat all benchmark numbers with extra skepticism.

Practical Guidance

For text-to-music in production: If you need a product and can handle API terms, Suno v4 or Udio are the current state of the art for coherent, stylistically accurate output - especially for pop, rock, and hip-hop. Neither publishes reproducible benchmarks, so factor in that you're flying partially blind on objective quality.

For open-source text-to-music: MusicGen-Large via AudioCraft is the best-benchmarked option. FAD 2.82 is the strongest published number. It caps at 30 seconds and doesn't support lyrics. Acceptable for background music, mood-based generation, and workflows where you need reproducibility.

For lyric-to-song with open weights: YuE is the only real option. It runs on a single A100 and handles full-song structure better than any other open-weight model. Quality gaps versus Suno are visible, but the model is available and actively maintained.

For stem separation: Demucs v4 (htdemucs) is the open-source benchmark leader on MusDB18. Use it. Commercial platforms built on top of Demucs-class models don't offer meaningfully better SDR than the base model - you're paying for UX, not better separation.

For classical and cinematic: AIVA has the most coherent classical and orchestral output of the commercial tools, with better harmonic structure than Suno on that genre specifically. It's not competitive on pop, but it's purpose-built for cinematic scoring workflows.

For ambient and background music without lyrics: ElevenLabs Music and Mubert both handle this use case well. Neither is a true end-to-end generation model in the same class as MusicGen or Suno, but for unobtrusive background tracks they're operationally simpler to work with.

FAQ

What is FAD and why does it matter for music generation?

FAD (Frechet Audio Distance) measures how similar the statistical distribution of generated audio is to real recordings. Lower is better. Its weakness: it measures distribution similarity, not human preference. A model can score well on FAD while producing music that human listeners find repetitive or uninteresting.

Why don't Suno and Udio publish FAD scores?

The most charitable explanation: they use internal evaluation frameworks they consider proprietary. The less charitable one: their models optimize for human preference at the expense of distributional fidelity to reference sets, which would produce worse FAD numbers while still winning user preference tests.

Is MusicGen better than Suno?

On published FAD and CLAP benchmarks, yes. On user preference tests for lyric-to-song and full-track generation, Suno leads. These measure different things. Research pipeline reproducibility: MusicGen. Tracks humans enjoy: Suno.

What is the MusicCaps benchmark?

A 5,521-clip evaluation set from Google using YouTube AudioSet clips with detailed text captions. The most widely used reference for text-to-music FAD evaluation. Skewed toward Western pop and rock, which disadvantages models trained on broader musical distributions.

Which AI music tool is best for stem separation?

Demucs v4 (htdemucs) from Meta. Open-source, runs locally, best published SDR on MusDB18 (8.13 dB vocals). Commercial tools build on comparable architectures but don't publish independent SDR evaluations.

Can AI generate music with custom lyrics?

Yes, with limitations. Suno v4/v5 and Udio support lyric input with reasonable alignment. YuE is the best open-source option. Most other tools (MusicGen, Stable Audio, AIVA) are instrumental-only.

AI Audio | Awesome Agents