Stable Audio 3.0 Ships Open Weights, 6-Min Songs

Stability AI shipped Stable Audio 3.0 today as a four-model family, doubling the maximum track length to six minutes and twenty seconds and making three of the four variants available as open weights on Hugging Face. The release targets a meaningful gap in the market: most competing audio models are closed-source and face active copyright litigation over training data.

TL;DR

Four models released: Small-SFX (459M params), Small (459M), Medium (1.4B), Large (2.7B)
Max duration jumps from roughly 3 minutes to 6 minutes 20 seconds for Medium and Large
Three variants - Small-SFX, Small, and Medium - ship as open weights under the Stability AI Community License
Training data is 1.278 million fully licensed recordings from AudioSparx and Freesound, not from the major labels
New SAME autoencoder and elimination of classifier-free guidance at inference are the core architectural changes

Model	Parameters	Max Duration	Inference (H200)	Access
Small-SFX	459M diffusion + 108M autoencoder	2 minutes	0.44s	Open weights
Small	459M diffusion + 108M autoencoder	2 minutes	0.44s	Open weights
Medium	1.4B diffusion + 852M autoencoder	6m 20s	1.31s	Open weights
Large	2.7B diffusion + 852M autoencoder	6m 20s	1.80s	API / Enterprise

Four Models for Different Use Cases

The Small Tier

Small and Small-SFX are positioned as on-device models. At 459M diffusion parameters with the lighter SAME-S autoencoder (108M parameters), they're designed to run on consumer hardware - phones, MacBook M4 class machines - and create audio in under half a second on server-grade silicon. The SFX variant adds a sound effects conditioning path; the standard Small handles music. Both cap at two minutes.

For game developers, podcast producers, or app builders who need fast local generation without a server round-trip, these fill a slot that no prior Stability AI audio model occupied. The open weights under the Community License mean commercial use is permitted for organizations under $1 million in annual revenue.

Medium and Large - Full Tracks

Medium (1.4B diffusion parameters, 852M SAME-L autoencoder) and Large (2.7B diffusion, same SAME-L autoencoder) both hit 6 minutes 20 seconds - more than double what Stable Audio 2.0 could produce. Stability says both maintain melodic and structural consistency across that length. The inference numbers are striking: 1.80 seconds to generate a 6:20 track on a H200.

Medium ships as open weights with the small models. Large is API-only via platform.stability.ai, with enterprise self-hosting requiring a commercial license for organizations above the $1M revenue threshold.

The Stable Audio 3.0 interface for prompting and generating tracks The stableaudio.com interface for producing tracks with Stable Audio 3.0. Source: stableaudio.com

Architecture - What Changed From 2.0

SAME: A New Autoencoder

The most real architectural change is the SAME encoder, short for Semantic-Acoustic Music Encoder. Two variants exist: SAME-S (108M parameters, used in Small models) and SAME-L (852M parameters, used in Medium and Large). Both use a 4096x downsampling ratio and produce 256-dimensional latent embeddings at roughly 10.76 Hz for 44.1kHz stereo input.

The important part is what SAME is trained to encode. Standard autoencoders for audio optimize for acoustic reconstruction fidelity. SAME is explicitly trained to capture semantic structure - phrasing, harmonic movement, rhythmic patterns - not just the acoustic signal. That shift is what allows Medium and Large to maintain coherence over six-plus minutes instead of drifting into mush past the three-minute mark.

No CFG at Inference

Stable Audio 2.0, like most diffusion models, required classifier-free guidance at inference: two forward passes through the model per denoising step. Stable Audio 3.0 eliminates that. The team baked guidance into the model through a distillation warmup stage, where a student model learns to match the CFG-enhanced outputs of a teacher, internalizing quality without needing the extra pass.

The practical effect is roughly a 2x inference speedup at quality parity. The 1.80-second figure for a 6:20 track on a H200 reflects that.

Inpainting

Stable Audio 2.0 could only generate from scratch. The 3.0 family adds inpainting: pass a mask specifying which segments to preserve, and the model regenerates only the unmasked regions. Stability trained three mask types - full mask (80% of training examples), random segments (10%), and causal continuation (10%). This enables single-segment edits, multi-segment edits, and extending existing tracks.

The Data Question

The training dataset is 1,278,902 audio recordings. AudioSparx contributed 806,284 fully licensed tracks; Freesound provided the remaining 472,618 under Creative Commons licenses (CC-0, CC-BY, CC-Sampling+).

This matters for competitive positioning. Suno and Udio are both defendants in copyright litigation from major labels over their training data. Stability AI signed partnership agreements with Warner Music Group (November 2025) and Universal Music Group in late 2025. Neither label's catalog appears in the Stable Audio 3.0 training dataset - those deals are described as future tool development partnerships, not retroactive licensing for this model family.

Stability is explicitly positioning that distinction as an enterprise sales point. The Large model's enterprise license includes legal indemnification for commercial outputs.

A professional audio mixing console in a recording studio The enterprise pitch: licensed training data is the differentiator Suno and Udio can't currently offer. Source: unsplash.com

Open Weights vs. the API Wall

The open-weight strategy here is deliberate and calibrated. Small and Medium on Hugging Face at stabilityai/stable-audio-3-medium give developers a free, capable model to build against. The llama.cpp community has already shown appetite for running audio models locally, and Medium at 1.4B parameters is small enough to fit on consumer GPUs. Tencent's Covo-Audio and the broader field of open audio models give developers options, but none of the competitors offer 6-minute generation with inpainting as open weights.

Large stays closed. That's where enterprise revenue lives: film scores, ad production, TV episodic music - use cases that need the full context window, need indemnification, and need SLAs that a Hugging Face download can't provide.

What It Does Not Tell You

Stability AI made no direct benchmark comparisons to Suno, Udio, Google Lyria, or ElevenLabs in the official release materials. The arXiv paper (arXiv:2605.17991) uses Frechet Audio Distance and CLAP score as evaluation metrics, but the specific numeric results comparing to named competitors weren't included in publicly available summaries. The "6:20 with structural coherence" claim rests on Stability's own testing.

The WMG and UMG partnership announcements created an impression that the training data includes major-label music. It doesn't. Those deals fund future tool development; this model was trained on AudioSparx and Freesound. That distinction is buried in the paper, not the press release.

There are also no public quotes from CEO Prem Akkaraju or the newly hired professional music lead Ethan Kaplan (ex-Universal Audio, ex-Fender). The official blog post credits product writer Louisa Marshall. What the executive team actually believes this model does to their competitive position - versus Suno at its current ELO ratings - is unspoken.

The four-tier model family with open weights at Medium and below is a credible open-source play in a market where most competitors are closed and several are being sued. Whether the SAME architecture's 6:20 coherence claim holds up under independent testing is the question worth watching when researchers start posting evaluations on the Hugging Face demo space in the coming days.

Sources: