Hugging Face Launches Storage Buckets for ML Artifacts
Hugging Face introduced Storage Buckets, mutable S3-like object storage built on Xet deduplication for ML checkpoints, logs, and artifacts - starting at $8/TB/month at volume.

Hugging Face shipped Storage Buckets, the first new repository type on the Hub in four years. Buckets are mutable, non-versioned object storage containers designed for the artifacts that ML workflows generate constantly but Git was never built to handle - training checkpoints, optimizer states, processed dataset shards, agent traces, logs, and intermediate pipeline outputs.
TL;DR
- Storage Buckets are S3-like mutable object storage on the Hugging Face Hub - non-versioned, optimized for high-throughput writes
- Built on Xet, HF's chunk-based deduplication engine - successive checkpoints sharing 80% of data only cost 20% extra storage
- Pricing starts at $12/TB/month for public storage, dropping to $8/TB/month at 500 TB+ volume - competitive with cloud object storage
- Full CLI (
hf buckets sync), Python SDK (huggingface_hubv1.5.0+), JavaScript, and fsspec support - Tested by Jasper, Arcee, IBM, and PixAI; direct bucket-to-repo promotion is planned
Why Not Git
Hugging Face's existing model and dataset repos use Git-based version control. That works well for publishing finished artifacts - a trained model, a curated dataset - but breaks down for the messy middle of ML work. Git struggles with concurrent writes from training clusters, frequent file overwrites, and the sheer volume of transient data that training runs produce.
Buckets solve this by stripping out versioning entirely. They are mutable containers. You write, overwrite, and delete objects freely. No commits, no diffs, no history. The tradeoff is intentional: checkpoints and logs don't need an audit trail; they need fast, cheap storage with concurrent access.
Xet Deduplication
The technical story underneath is Xet, the chunk-based storage backend Hugging Face has been building. Instead of storing files as monolithic blobs, Xet breaks them into content-defined chunks and deduplicates across the storage layer.
This is a natural fit for ML workflows. Successive training checkpoints share the vast majority of their bytes. A 1 TB raw dataset processed into a 1.2 TB output with 80% overlap would only require roughly 400 GB of additional storage. Transfers are faster too - the sync operation skips bytes already present on either end.
Enterprise accounts are billed on deduplicated storage, not raw bytes written.
Pricing
Bucket storage follows the Hub's standard tiered pricing:
| Volume | Public | Private |
|---|---|---|
| Base | $12/TB/month | $18/TB/month |
| 50 TB+ | $10/TB/month | $16/TB/month |
| 200 TB+ | $9/TB/month | $14/TB/month |
| 500 TB+ | $8/TB/month | $12/TB/month |
At the 500 TB+ tier, $8/TB/month for public storage undercuts AWS S3 Standard ($23/TB/month) by roughly 3x. Deduplication pushes effective costs even lower for workloads with high data overlap.
How It Works
Buckets live at hf://buckets/username/bucket-name and are browsable on the Hub like any other repo type. Permissions use the existing Hugging Face access model - private, public, user, or org namespace.
The CLI is the primary interface for heavy usage:
# Create a private bucket
hf buckets create my-training-bucket --private
# Sync checkpoints to the bucket
hf buckets sync ./checkpoints hf://buckets/user/my-training-bucket/checkpoints
# Preview what would sync without executing
hf buckets sync ./checkpoints hf://buckets/user/my-training-bucket/checkpoints --dry-run
The Python SDK (huggingface_hub v1.5.0+) exposes create_bucket, sync_bucket, and list_bucket_tree. The fsspec integration means pandas, Polars, and Dask can read and write directly:
import pandas as pd
df = pd.read_csv("hf://buckets/user/my-training-bucket/results.csv")
JavaScript support is available in @huggingface/hub v2.10.5+.
Pre-warming and Multi-Cloud
Buckets support pre-warming - caching hot data in specific cloud regions before compute jobs start. Launch partners include AWS and GCP, with more regions planned. For distributed training pipelines that span clouds, this avoids the latency penalty of cross-region data transfers.
What's Next
Hugging Face is building direct transfers from buckets to versioned repos. The intended workflow: train in a bucket, promote the final checkpoint to a model repo, commit processed shards to a dataset repo. Working storage and publishing stay separate but connected.
The feature was tested by Jasper, Arcee, IBM, and PixAI before launch.
Buckets fill a gap that has pushed ML teams toward S3, GCS, or ad-hoc NFS mounts for working storage while keeping Hugging Face only for distribution. If Xet deduplication delivers on the cost and speed claims in practice, buckets could consolidate both halves of that workflow on the Hub - which is exactly the platform play Hugging Face is making.
Sources: Introducing Storage Buckets on Hugging Face - Hugging Face Blog | Hugging Face Pricing
