HappyHorse-1.0
HappyHorse-1.0 is Alibaba's 15-billion-parameter video generation model that ranked #1 on Artificial Analysis, producing 720p-1080p clips with joint audio-video synthesis in a single forward pass.

HappyHorse-1.0 arrived quietly. On April 7, 2026, an anonymous model called "HappyHorse" appeared on Artificial Analysis's video generation leaderboard and right away jumped to the top of both the text-to-video and image-to-video rankings. No blog post, no press release, no technical paper. Three days later, Alibaba confirmed it: the model comes from the company's newly formed ATH Innovation Unit, built by a team called the Taotian Future Life Lab and led by Zhang Di, the former VP at Kuaishou who architected Kling AI before rejoining Alibaba in late 2025.
TL;DR
- Ranked #1 on Artificial Analysis Video Arena for text-to-video (Elo 1,357) and image-to-video (Elo 1,402) without audio, as of April 2026
- 15B unified transformer, 4-10 second clips, 720p and 1080p output, joint audio-video generation in one pass
- Priced at $0.14/s (720p) and $0.28/s (1080p) via fal.ai; also available on Alibaba Cloud Bailian
What makes the debut unusual isn't just the performance - it's the architecture. Most video generation models run video and audio as separate pipelines and stitch them together in post. HappyHorse creates both in a single forward pass, using a unified 40-layer transformer that processes text, image, video, and audio tokens in the same attention sequence. That design choice is why the audio stays in sync without a separate alignment step.
A follow-up version, HappyHorse-1.1, added native multilingual lip-sync across seven languages and bumped the maximum output to 1080p as a standard offering. The variant dropped shortly after the 1.0 launch as Alibaba moved to close the gap with Seedance 2.0 on the with-audio leaderboard track.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Alibaba ATH Innovation Unit (Taotian Future Life Lab) |
| Model Family | HappyHorse |
| Parameters | 15B |
| Architecture | Unified 40-layer self-attention transformer |
| Max Clip Duration | 4-10 seconds (fal.ai); up to 15s on Alibaba Cloud |
| Output Resolution | 720p (standard), 1080p (premium) |
| Aspect Ratios | 16:9, 9:16, 1:1, 4:3, 3:4 |
| Input Price | $0.14/s at 720p, $0.28/s at 1080p (fal.ai) |
| Release Date | April 27, 2026 |
| License | Closed source (proprietary weights) |
| API Availability | fal.ai, Alibaba Cloud Bailian |
Benchmark Performance
HappyHorse-1.0 debuted at the top of Artificial Analysis's video leaderboard and held its position through May 2026. The scores below reflect the April 2026 standing as recorded at public launch.
| Benchmark | HappyHorse-1.0 | Seedance 2.0 | Kling 3.0 Pro |
|---|---|---|---|
| AA Text-to-Video Elo (no audio) | 1,357 | 1,273 | 1,243 |
| AA Image-to-Video Elo (no audio) | 1,402 | 1,355 | ~1,250 (est.) |
| AA Text-to-Video Elo (with audio) | 1,215 | 1,220 | - |
Sources: Artificial Analysis Video Arena blind human preference voting, April 2026.
The Artificial Analysis Video Arena without-audio leaderboard at launch, April 2026.
Source: fal.ai
The 84-point gap over Seedance 2.0 in text-to-video (no audio) is wide by leaderboard standards, but the with-audio track tells a more nuanced story. Seedance 2.0 edged HappyHorse-1.0 by 5 Elo points there, which is within measurement noise but worth noting for teams where audio quality is the primary criterion. Kling 3.0 Pro, covered in our Kling 3.0 review, doesn't appear in the with-audio Elo table because Kling's audio capabilities are handled differently.
Our video generation benchmarks leaderboard tracks how these standings evolve month to month.
Key Capabilities
Joint Audio-Video Generation
The architecture decision to concatenate text, image, video, and audio tokens into a single sequence has a practical effect: the model doesn't guess where to put footsteps or a car engine sound after the video is rendered. It determines the audio at the same time it determines what's on screen. The first and last four layers of the 40-layer transformer handle modality-specific encoding and decoding; the middle 32 layers share parameters across all four token types.
Inference uses DMD-2 distillation with eight sampling steps and no classifier-free guidance, which is how the model gets to roughly 38 seconds for a 1080p clip on a single NVIDIA H100. That's fast for the resolution class, though direct comparison against Veo 3.1 (covered on our Veo 3.1 model page) depends on hardware setup and clip length.
Four API Endpoints
The fal.ai integration exposes four distinct endpoints:
- Text-to-video: full scene generation from a text prompt
- Image-to-video: animate a static reference frame with motion guidance
- Reference-to-video: style-consistent generation across multiple clips using up to nine reference images (1.1 only)
- Video editing: natural language edits on existing footage
The image-to-video endpoint is where HappyHorse-1.0 leads by the largest margin on Artificial Analysis, pulling a 47-point gap over Seedance 2.0 in the no-audio track.
Multilingual Lip-Sync (1.1 Only)
HappyHorse-1.1 added native lip-sync across seven languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French. The sync is generated in-pass, not added as a post-processing alignment step. Compared to Seedance 2.0's approach as analyzed in our Seedance 2.0 review, the single-pass generation means mouth movements and audio are determined together rather than fitted afterward.
The with-audio leaderboard track shows Seedance 2.0 slightly ahead once audio quality enters the evaluation, a gap HappyHorse-1.1 aims to close.
Source: fal.ai
Pricing and Availability
HappyHorse-1.0 is available on two platforms at launch:
fal.ai is the primary western API access point, with pay-per-second billing and no minimum spend:
- 720p: $0.14/second
- 1080p: $0.28/second
- A 5-second 720p clip costs $0.70. A 10-second 1080p clip costs $2.80.
Alibaba Cloud Bailian (Alibaba's model-as-a-service platform) prices in RMB: ¥0.9/s at 720p and ¥1.6/s at 1080p, roughly $0.31/s at 1080p at current exchange rates - slightly more expensive than fal.ai in USD terms. Alibaba offered a 10% introductory discount at launch.
The Qwen App provides limited free access for consumer-facing users, with a daily quota for 720p clips.
Compared to Seedance 2.0's API pricing and Kling 3.0's commercial tiers, HappyHorse-1.0 sits in a similar range for 1080p output. The no-minimum billing structure makes it practical for developers running evaluation batches before committing to volume.
Enterprise pricing is available from Alibaba directly for high-volume video production use cases, especially in e-commerce contexts where Taotian integration gives HappyHorse a distribution advantage inside Alibaba's ecosystem.
Strengths and Weaknesses
Strengths
- Leads Artificial Analysis Video Arena on text-to-video and image-to-video (no audio) by a meaningful margin as of April 2026
- Joint single-pass audio-video generation avoids the sync artifacts common in post-processed audio tracks
- Four API modes, including video editing and reference-to-video, not just basic generation
- Competitive per-second pricing with no minimum commitment on fal.ai
- DMD-2 distillation keeps generation fast at eight sampling steps
Weaknesses
- Clips cap at 10 seconds on fal.ai (15s on Alibaba Cloud), shorter than Seedance 2.0's 15-second default
- Closed weights - no fine-tuning, no local deployment, no customization outside Alibaba's enterprise arrangements
- Seedance 2.0 leads on the with-audio Elo track, a gap HappyHorse-1.1 narrows but doesn't fully close
- Open-source claims made at announcement haven't been fulfilled as of mid-2026; weights haven't been published
- No official technical paper; architecture details rely on third-party reporting and developer reverse engineering
Related Coverage
- Video Generation Benchmarks Leaderboard - full monthly rankings across all video models
- Veo 3.1 model page - Google's competing video model with 4K output
- Seedance 2.0 Review - ByteDance's closest competitor on the Artificial Analysis leaderboard
- Kling 3.0 Review - the model Zhang Di built before joining Alibaba
FAQ
What is HappyHorse-1.0?
HappyHorse-1.0 is Alibaba's 15B-parameter video generation model that ranked #1 on Artificial Analysis in April 2026, built by the ATH Innovation Unit using a unified transformer that generates video and audio in a single pass.
Is HappyHorse-1.0 open source?
No. Despite early announcements suggesting Apache 2.0 release plans, no model weights have been published as of mid-2026. Access is through fal.ai and Alibaba Cloud Bailian APIs only.
How much does HappyHorse-1.0 cost?
On fal.ai: $0.14/s for 720p and $0.28/s for 1080p. A 10-second 1080p clip costs $2.80, no subscription or minimum spend required.
What is the difference between HappyHorse-1.0 and 1.1?
HappyHorse-1.1 adds native multilingual lip-sync across seven languages, support for up to nine reference images, and positions 1080p as the standard output tier. The Elo scores are similar, with 1.1 showing slight gains on the with-audio leaderboard track.
Who built HappyHorse?
It was built by Alibaba's Taotian Future Life Lab inside the ATH Innovation Unit, led by Zhang Di - the former VP at Kuaishou who designed Kling AI before rejoining Alibaba in late 2025.
How long can HappyHorse clips be?
On fal.ai, clips run 4-10 seconds. Alibaba Cloud Bailian supports up to 15 seconds. Multi-shot sequences require external stitching.
Sources:
- HappyHorse-1.0 AI Goes Live on fal - fal.ai
- HappyHorse-1.0 official page on fal.ai
- Alibaba launches HappyHorse video model - DataNorth
- HappyHorse 1.1 - Alibaba video generation model 2026 - ExplainX
- Alibaba confirms HappyHorse belongs to its ATH unit - TechNode
- Happy Horse by Alibaba Cloud - full model overview - AI/ML API
- HappyHorse API on Alibaba Cloud Bailian - Apiyi
- Why Is HappyHorse-1.0 Suddenly #1 on Video Leaderboard? - WaveSpeed
✓ Last verified June 24, 2026
