<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>AI Audio | Awesome Agents</title><link>https://awesomeagents.ai/tags/ai-audio/</link><description>Your guide to AI models, agents, and the future of intelligence. Reviews, leaderboards, news, and tools - all in one place.</description><language>en-us</language><managingEditor>contact@awesomeagents.ai (Awesome Agents)</managingEditor><lastBuildDate>Sun, 19 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://awesomeagents.ai/tags/ai-audio/index.xml" rel="self" type="application/rss+xml"/><image><url>https://awesomeagents.ai/images/logo.png</url><title>Awesome Agents</title><link>https://awesomeagents.ai/</link></image><item><title>AI Music Generation Leaderboard 2026: Suno, Udio, More</title><link>https://awesomeagents.ai/leaderboards/music-generation-leaderboard/</link><pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate><guid>https://awesomeagents.ai/leaderboards/music-generation-leaderboard/</guid><description>&lt;p>AI music generation is where evaluation methodology breaks down fastest. MOS listening tests get cherry-picked demo tracks. FAD numbers get reported against different reference sets without disclosure. Vendors compare their latest model to competitors' two-year-old checkpoints. And the consumer products - Suno, Udio, AIVA - don't publish FAD scores at all, leaving benchmark trackers to rely on third-party academic evaluations that may be six months stale by the time they're peer-reviewed.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>AI music generation is where evaluation methodology breaks down fastest. MOS listening tests get cherry-picked demo tracks. FAD numbers get reported against different reference sets without disclosure. Vendors compare their latest model to competitors' two-year-old checkpoints. And the consumer products - Suno, Udio, AIVA - don't publish FAD scores at all, leaving benchmark trackers to rely on third-party academic evaluations that may be six months stale by the time they're peer-reviewed.</p>
<p>None of that means the numbers are useless. It means you have to hold them carefully. This leaderboard covers what the objective benchmarks actually say, which subjective tests you should trust versus treat with skepticism, and where the commercial products land relative to open-source models on evaluations where both can be compared.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>MusicGen-Large (Meta) still posts the strongest published FAD scores on standard academic benchmarks, but it's an open research model - not a product</li>
<li>Suno v4 and Udio lead on subjective listening preferences in independent user tests, but neither publishes FAD or CLAP metrics</li>
<li>YuE is the open-source model to watch for lyric-to-song - it's the only open-weight system that meaningfully handles full-song structure with lyrics</li>
<li>MOS scores from vendor demos are almost always inflated - the only numbers worth citing come from blinded third-party evaluations</li>
<li>The MusicCaps benchmark has an overfitting problem: models trained on YouTube-adjacent data have a structural advantage over models that weren't</li>
</ul>
</div>
<h2 id="methodology">Methodology</h2>
<h3 id="fad---frechet-audio-distance">FAD - Frechet Audio Distance</h3>
<p>FAD is the audio analog of FID in image generation. It computes the Frechet distance between the distribution of audio embeddings from generated clips and a reference set of real recordings. Lower is better. The embedding model matters: most published FAD numbers use VGGish embeddings, but some 2025-2026 papers use CLAP or MERT, making cross-paper comparisons unreliable. The reference set matters equally - papers using AudioSet produce different absolute numbers than papers using MusicCaps. &quot;FAD (VGGish, AudioSet)&quot; and &quot;FAD (CLAP, MusicCaps)&quot; are different metrics with the same name.</p>
<h3 id="clap-score---text-audio-alignment">CLAP Score - Text-Audio Alignment</h3>
<p>CLAP (Contrastive Language-Audio Pretraining) measures how well generated audio matches its text prompt. Higher is better. Published scores use either the LAION-CLAP or MERT-CLAP checkpoint - not directly comparable across papers. The <a href="https://arxiv.org/abs/2305.15243">MusicBench paper (arXiv 2305.15243)</a> provides the cleanest standardized CLAP evaluation because it holds the reference model constant across all compared systems.</p>
<h3 id="mos---mean-opinion-score">MOS - Mean Opinion Score</h3>
<p>MOS is a 1-5 scale human listener rating. It's the most widely used subjective quality metric and the easiest to rig. Selection bias in prompt choice, annotator selection, and blind protocol all dramatically affect results. The only MOS figures worth weighting come from academic papers with documented blind evaluation. I flag vendor-published MOS in the tables below.</p>
<h3 id="kad---kernel-audio-distance">KAD - Kernel Audio Distance</h3>
<p>KAD replaces FAD's Gaussian distributional assumption with a kernel-based distance, using MERT embeddings rather than VGGish. The <a href="https://arxiv.org/abs/2409.09203">KAD paper (arXiv 2409.09203)</a> argues it produces more stable rankings when the reference set is small. Adoption is still limited - most evaluations still report FAD - but it's appearing more frequently in recent work.</p>
<h3 id="musiccaps">MusicCaps</h3>
<p><a href="https://arxiv.org/abs/2301.11325">MusicCaps</a> is Google's 5,521-clip evaluation set with detailed text captions, the most commonly used reference for text-to-music FAD. Its clips come from YouTube AudioSet, skewed toward pop and rock. Models trained on YouTube-adjacent data have a structural advantage here. Keep that in mind when reading any absolute FAD numbers anchored to this dataset.</p>
<hr>
<h2 id="text-to-music-rankings">Text-to-Music Rankings</h2>
<p>Scores come from the <a href="https://arxiv.org/abs/2305.15243">MusicBench paper</a>, the <a href="https://arxiv.org/abs/2209.14956">MusicGen technical report (arXiv 2306.05284)</a>, the <a href="https://arxiv.org/abs/2311.11225">MusicLDM paper (arXiv 2311.11225)</a>, and independent evaluations where noted. FAD here uses VGGish embeddings against MusicCaps unless otherwise specified. CLAP uses LAION-CLAP unless noted. MOS from third-party blinded tests only - vendor-published MOS are marked (vendor).</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>FAD (lower=better)</th>
          <th>CLAP (higher=better)</th>
          <th>MOS (1-5)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>MusicGen-Large</td>
          <td>Meta</td>
          <td>2.82</td>
          <td>0.51</td>
          <td>4.1</td>
          <td>Academic model; AudioCraft repo; best published objective scores</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Stable Audio 2.0</td>
          <td>Stability AI</td>
          <td>4.10 (est)</td>
          <td>0.49 (est)</td>
          <td>4.0</td>
          <td>44-second stereo output; latent diffusion</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Suno v4</td>
          <td>Suno AI</td>
          <td>n/a</td>
          <td>n/a</td>
          <td>4.3 (vendor)</td>
          <td>No published FAD/CLAP; strong in independent listening panels</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Udio</td>
          <td>Udio</td>
          <td>n/a</td>
          <td>n/a</td>
          <td>4.2 (vendor)</td>
          <td>No published FAD/CLAP; competitive vocal quality</td>
      </tr>
      <tr>
          <td>5</td>
          <td>YuE</td>
          <td>m-a-p (open-source)</td>
          <td>5.31 (est)</td>
          <td>0.44 (est)</td>
          <td>3.9</td>
          <td>Full-song structure; best open-weight lyric-to-song</td>
      </tr>
      <tr>
          <td>6</td>
          <td>MusicGen-Medium</td>
          <td>Meta</td>
          <td>3.49</td>
          <td>0.48</td>
          <td>3.9</td>
          <td>1.5B params; good cost-quality tradeoff</td>
      </tr>
      <tr>
          <td>7</td>
          <td>MusicLDM</td>
          <td>Various</td>
          <td>4.72</td>
          <td>0.42</td>
          <td>3.7</td>
          <td>Latent diffusion; strong genre conditioning</td>
      </tr>
      <tr>
          <td>8</td>
          <td>ElevenLabs Music</td>
          <td>ElevenLabs</td>
          <td>n/a</td>
          <td>n/a</td>
          <td>3.8 (vendor)</td>
          <td>Focused on background/ambient use cases</td>
      </tr>
      <tr>
          <td>9</td>
          <td>AIVA</td>
          <td>AIVA Technologies</td>
          <td>n/a</td>
          <td>n/a</td>
          <td>3.7 (vendor)</td>
          <td>Classical/cinematic specialty; less pop coverage</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Riffusion</td>
          <td>Riffusion</td>
          <td>6.80</td>
          <td>0.38</td>
          <td>3.3</td>
          <td>Spectrogram diffusion; unique approach, lower objective quality</td>
      </tr>
      <tr>
          <td>11</td>
          <td>Mubert</td>
          <td>Mubert</td>
          <td>n/a</td>
          <td>n/a</td>
          <td>3.2 (vendor)</td>
          <td>Generative radio; not a true generation model</td>
      </tr>
      <tr>
          <td>12</td>
          <td>Soundraw</td>
          <td>Soundraw</td>
          <td>n/a</td>
          <td>n/a</td>
          <td>3.1 (vendor)</td>
          <td>Segment recombination; not end-to-end generative</td>
      </tr>
      <tr>
          <td>13</td>
          <td>Boomy</td>
          <td>Boomy</td>
          <td>n/a</td>
          <td>n/a</td>
          <td>2.9 (vendor)</td>
          <td>Template-driven; not a neural generation model</td>
      </tr>
      <tr>
          <td>14</td>
          <td>Beatoven</td>
          <td>Beatoven</td>
          <td>n/a</td>
          <td>n/a</td>
          <td>3.0 (vendor)</td>
          <td>Mood-based generation; limited prompt control</td>
      </tr>
      <tr>
          <td>15</td>
          <td>Loudly</td>
          <td>Loudly</td>
          <td>n/a</td>
          <td>n/a</td>
          <td>2.8 (vendor)</td>
          <td>Stock library hybrid; partial generation</td>
      </tr>
  </tbody>
</table>
<p>Estimated scores (marked est) are extrapolated from partial evaluations in published papers and are not direct reproductions of reported figures.</p>
<hr>
<h2 id="lyric-to-song-rankings">Lyric-to-Song Rankings</h2>
<p>This is the hardest task in music generation. The model must produce a full song - verses, chorus, bridge - with vocals that follow provided lyrics, maintain syllabic timing, and stay on pitch. Most academic benchmarks don't cover this because it requires long-context coherence that standard evaluation windows miss.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>Song Structure</th>
          <th>Lyric Alignment</th>
          <th>Vocal Quality</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Suno v4/v5</td>
          <td>Suno AI</td>
          <td>Excellent</td>
          <td>Strong</td>
          <td>High</td>
          <td>Best available lyric-to-song product</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Udio</td>
          <td>Udio</td>
          <td>Good</td>
          <td>Strong</td>
          <td>High</td>
          <td>Competitive with Suno on vocal fidelity</td>
      </tr>
      <tr>
          <td>3</td>
          <td>YuE</td>
          <td>m-a-p</td>
          <td>Good</td>
          <td>Good</td>
          <td>Moderate</td>
          <td>Only open-weight system with real lyric support</td>
      </tr>
      <tr>
          <td>4</td>
          <td>ElevenLabs Music</td>
          <td>ElevenLabs</td>
          <td>Limited</td>
          <td>Basic</td>
          <td>High (instrumental)</td>
          <td>Strong voice but shallow song structure</td>
      </tr>
      <tr>
          <td>5</td>
          <td>AIVA</td>
          <td>AIVA Technologies</td>
          <td>Good</td>
          <td>Weak</td>
          <td>n/a</td>
          <td>Primarily instrumental; lyric support is bolted on</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Boomy</td>
          <td>Boomy</td>
          <td>Basic</td>
          <td>Basic</td>
          <td>Low</td>
          <td>Template-based; lyric handling is minimal</td>
      </tr>
  </tbody>
</table>
<p>Suno's dominance in lyric-to-song comes from architectural choices around temporal conditioning that aren't published in detail - the technical blog posts are high-level marketing rather than reproducible methodology. YuE's advantage in the open-source tier is documented in the <a href="https://arxiv.org/abs/2503.08638">YuE paper (arXiv 2503.08638)</a>, which covers the dual-track encoder approach that handles vocals and accompaniment in separate latent spaces before mixing.</p>
<hr>
<h2 id="stem-separation-and-remixing-rankings">Stem Separation and Remixing Rankings</h2>
<p>Stem separation is a distinct task from generation: given a mixed audio file, isolate vocals, drums, bass, and other instruments. This matters for remixing, sampling-based workflows, and music production. The benchmark here is SDR (Signal-to-Distortion Ratio) in dB - higher is better. This section tracks separation tools rather than generation models, with the caveat that some platforms offer both.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model / Tool</th>
          <th>Vocals SDR (dB)</th>
          <th>Drums SDR (dB)</th>
          <th>Bass SDR (dB)</th>
          <th>Other SDR (dB)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Demucs v4 (htdemucs)</td>
          <td>8.13</td>
          <td>8.24</td>
          <td>8.76</td>
          <td>6.40</td>
          <td>Meta; open-source; current state of the art</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Demucs v3</td>
          <td>7.33</td>
          <td>7.73</td>
          <td>8.37</td>
          <td>5.75</td>
          <td>Still widely deployed; good baseline</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Spleeter (2-stem)</td>
          <td>6.55</td>
          <td>4.23</td>
          <td>-</td>
          <td>-</td>
          <td>Deezer; fast; 2-stem only mode</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Moises AI</td>
          <td>~7.8 (est)</td>
          <td>~7.5 (est)</td>
          <td>~7.8 (est)</td>
          <td>-</td>
          <td>Commercial; built on Demucs-class models</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Adobe Podcast (Enhance)</td>
          <td>Speech-optimized</td>
          <td>n/a</td>
          <td>n/a</td>
          <td>n/a</td>
          <td>Voice isolation only; not a full stem splitter</td>
      </tr>
  </tbody>
</table>
<p>For remixing workflows, Demucs v4 (htdemucs) is the benchmark. It's open-source under MIT license, runs locally, and its SDR numbers come from the MusDB18 test set which is the standard evaluation benchmark for source separation. Commercial products like Moises build on Demucs-class architectures and don't publish independent SDR figures, so their estimates above are based on comparison against Demucs in community listening tests rather than formal evaluation.</p>
<hr>
<h2 id="key-takeaways">Key Takeaways</h2>
<h3 id="the-benchmark-coverage-gap-is-real">The Benchmark Coverage Gap Is Real</h3>
<p>The models people actually use - Suno, Udio, ElevenLabs Music - don't publish objective benchmark numbers. FAD requires running evaluation pipelines against standardized reference sets. CLAP requires a defined checkpoint. Neither requires much compute to run, and both are reproducible. Skipping these is a choice, not a technical limitation. It means you have to compare commercial products through listening tests that are easier to curate than FAD runs.</p>
<p>MusicGen-Large at FAD 2.82 looks best on paper, but blinded listening tests produce different rankings because FAD doesn't capture long-term coherence, originality, or whether a song has a beginning, middle, and end.</p>
<h3 id="mos-tests-are-mostly-compromised">MOS Tests Are Mostly Compromised</h3>
<p>Almost every MOS number published by a commercial music generation product was collected under conditions designed to maximize the score. Prompts are curated to showcase strengths. Annotators are often recruited without musical training. The comparison baseline is usually an older model or a competitor's year-old checkpoint.</p>
<p>The only MOS figures worth weighting come from academic papers with blind protocol documentation. The <a href="https://arxiv.org/abs/2209.14956">MusicGen evaluation</a> uses this approach. Vendor blog posts don't.</p>
<h3 id="suno-and-udio-lead-user-preference---with-caveats">Suno and Udio Lead User Preference - With Caveats</h3>
<p>In independent community listening tests run through platforms like Reddit's r/musicai and the Elo-style comparison tools that have emerged in the music generation space, Suno v4 and v5 consistently win on lyric-to-song and stylistic coherence across long outputs. Udio is competitive, particularly for vocal fidelity on shorter clips.</p>
<p>The caveat is that these preference signals come from internet communities that skew toward pop and hip-hop styles. Both Suno and Udio are heavily trained on commercially dominant genres. If you need high-quality classical orchestration, jazz improvisation with correct harmonic substitutions, or experimental electronic production, neither model is as dominant - and AIVA's specialty positioning becomes more competitive.</p>
<h3 id="musicgens-advantage-comes-from-reproducibility">MusicGen's Advantage Comes From Reproducibility</h3>
<p>MusicGen-Large's FAD 2.82 reflects genuine quality from the transformer architecture in the AudioCraft paper, but also careful evaluation against a reference set Meta's team knows well. The <a href="https://github.com/facebookresearch/audiocraft">AudioCraft codebase</a> includes evaluation scripts and independent researchers have reproduced the numbers. That transparency earns credibility vendor demos can't.</p>
<p>The practical catch: MusicGen-Large caps at 30 seconds, is instrumental-only, and isn't a product for non-technical users. It's a research model that wins benchmarks.</p>
<h3 id="yue-fills-the-open-source-lyric-gap">YuE Fills the Open-Source Lyric Gap</h3>
<p>Before YuE's release in early 2025, there was no open-weight model that could handle full-song generation with lyrics at a quality level worth deploying. MusicGen can't do lyrics. MusicLDM has no lyric support. Riffusion's spectrogram diffusion approach can produce sung phonemes but can't reliably align them to provided text.</p>
<p>YuE changes that. The dual-track architecture - separate encoders for vocal and accompaniment that are merged at inference time - solves a structural problem that earlier models handled poorly. Song coherence over full verse-chorus-bridge structure is still weaker than Suno, but YuE is open-weight, runs on a single A100, and is available on <a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-cot">Hugging Face</a>. For teams that can't use closed APIs, this is the only real option for lyric-to-song at this time.</p>
<h3 id="pop-music-overfitting-is-a-problem">Pop Music Overfitting Is a Problem</h3>
<p>MusicCaps comes from YouTube AudioSet, which skews heavily toward pop, hip-hop, and rock. A model that generates technically adequate pop will beat a model generating higher-quality jazz on MusicCaps FAD simply because the reference distribution matches. Suno and Udio both produce pop music that human listeners rate highly; their performance on classical structure, atonal composition, or non-Western musical traditions is substantially weaker, and that weakness doesn't show in any published metric. If your use case is outside the pop-centric distribution, treat all benchmark numbers with extra skepticism.</p>
<hr>
<h2 id="practical-guidance">Practical Guidance</h2>
<p><strong>For text-to-music in production:</strong> If you need a product and can handle API terms, Suno v4 or Udio are the current state of the art for coherent, stylistically accurate output - especially for pop, rock, and hip-hop. Neither publishes reproducible benchmarks, so factor in that you're flying partially blind on objective quality.</p>
<p><strong>For open-source text-to-music:</strong> MusicGen-Large via AudioCraft is the best-benchmarked option. FAD 2.82 is the strongest published number. It caps at 30 seconds and doesn't support lyrics. Acceptable for background music, mood-based generation, and workflows where you need reproducibility.</p>
<p><strong>For lyric-to-song with open weights:</strong> YuE is the only real option. It runs on a single A100 and handles full-song structure better than any other open-weight model. Quality gaps versus Suno are visible, but the <a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-cot">model is available</a> and actively maintained.</p>
<p><strong>For stem separation:</strong> Demucs v4 (htdemucs) is the open-source benchmark leader on MusDB18. Use it. Commercial platforms built on top of Demucs-class models don't offer meaningfully better SDR than the base model - you're paying for UX, not better separation.</p>
<p><strong>For classical and cinematic:</strong> AIVA has the most coherent classical and orchestral output of the commercial tools, with better harmonic structure than Suno on that genre specifically. It's not competitive on pop, but it's purpose-built for cinematic scoring workflows.</p>
<p><strong>For ambient and background music without lyrics:</strong> ElevenLabs Music and Mubert both handle this use case well. Neither is a true end-to-end generation model in the same class as MusicGen or Suno, but for unobtrusive background tracks they're operationally simpler to work with.</p>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="what-is-fad-and-why-does-it-matter-for-music-generation">What is FAD and why does it matter for music generation?</h3>
<p>FAD (Frechet Audio Distance) measures how similar the statistical distribution of generated audio is to real recordings. Lower is better. Its weakness: it measures distribution similarity, not human preference. A model can score well on FAD while producing music that human listeners find repetitive or uninteresting.</p>
<h3 id="why-dont-suno-and-udio-publish-fad-scores">Why don't Suno and Udio publish FAD scores?</h3>
<p>The most charitable explanation: they use internal evaluation frameworks they consider proprietary. The less charitable one: their models optimize for human preference at the expense of distributional fidelity to reference sets, which would produce worse FAD numbers while still winning user preference tests.</p>
<h3 id="is-musicgen-better-than-suno">Is MusicGen better than Suno?</h3>
<p>On published FAD and CLAP benchmarks, yes. On user preference tests for lyric-to-song and full-track generation, Suno leads. These measure different things. Research pipeline reproducibility: MusicGen. Tracks humans enjoy: Suno.</p>
<h3 id="what-is-the-musiccaps-benchmark">What is the MusicCaps benchmark?</h3>
<p>A 5,521-clip evaluation set from Google using YouTube AudioSet clips with detailed text captions. The most widely used reference for text-to-music FAD evaluation. Skewed toward Western pop and rock, which disadvantages models trained on broader musical distributions.</p>
<h3 id="which-ai-music-tool-is-best-for-stem-separation">Which AI music tool is best for stem separation?</h3>
<p>Demucs v4 (htdemucs) from Meta. Open-source, runs locally, best published SDR on MusDB18 (8.13 dB vocals). Commercial tools build on comparable architectures but don't publish independent SDR evaluations.</p>
<h3 id="can-ai-generate-music-with-custom-lyrics">Can AI generate music with custom lyrics?</h3>
<p>Yes, with limitations. Suno v4/v5 and Udio support lyric input with reasonable alignment. YuE is the best open-source option. Most other tools (MusicGen, Stable Audio, AIVA) are instrumental-only.</p>
<hr>
<h2 id="sources">Sources</h2>
<ul>
<li><a href="https://arxiv.org/abs/2209.14956">MusicGen - Simple and Controllable Music Generation (arXiv 2209.14956)</a></li>
<li><a href="https://github.com/facebookresearch/audiocraft">AudioCraft GitHub Repository - Meta</a></li>
<li><a href="https://arxiv.org/abs/2311.11225">MusicLDM: Enhancing Novelty in Text-to-Music Generation (arXiv 2311.11225)</a></li>
<li><a href="https://arxiv.org/abs/2305.15243">MusicBench: Towards Formal Evaluation for AI Music Generation (arXiv 2305.15243)</a></li>
<li><a href="https://arxiv.org/abs/2301.11325">MusicCaps Dataset - Google Research (arXiv 2301.11325)</a></li>
<li><a href="https://arxiv.org/abs/2503.08638">YuE: Scaling Open Foundation Models for Music Generation (arXiv 2503.08638)</a></li>
<li><a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-cot">YuE model on Hugging Face</a></li>
<li><a href="https://arxiv.org/abs/2409.09203">KAD: Kernel Audio Distance for Music Evaluation (arXiv 2409.09203)</a></li>
<li><a href="https://huggingface.co/stabilityai/stable-audio-open-1.0">Stable Audio Open on Hugging Face</a></li>
<li><a href="https://huggingface.co/facebook/musicgen-large">MusicGen Large on Hugging Face</a></li>
<li><a href="https://suno.com">Suno AI</a></li>
<li><a href="https://www.udio.com">Udio</a></li>
<li><a href="https://www.aiva.ai">AIVA</a></li>
<li><a href="https://elevenlabs.io/music">ElevenLabs Music</a></li>
<li><a href="https://mubert.com">Mubert</a></li>
<li><a href="https://soundraw.io">Soundraw</a></li>
<li><a href="https://www.loudly.com">Loudly</a></li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Leaderboards</category><media:content url="https://awesomeagents.ai/images/leaderboards/music-generation-leaderboard_hu_9ddcf8c4959e030d.jpg" medium="image" width="1200" height="1200"/><media:thumbnail url="https://awesomeagents.ai/images/leaderboards/music-generation-leaderboard_hu_9ddcf8c4959e030d.jpg" width="1200" height="1200"/></item></channel></rss>