HappyHorse-1.0 arrived quietly. On April 7, 2026, an anonymous model called "HappyHorse" appeared on Artificial Analysis's video generation leaderboard and right away jumped to the top of both the text-to-video and image-to-video rankings. No blog post, no press release, no technical paper. Three days later, Alibaba confirmed it: the model comes from the company's newly formed ATH Innovation Unit, built by a team called the Taotian Future Life Lab and led by Zhang Di, the former VP at Kuaishou who architected Kling AI before rejoining Alibaba in late 2025.

TL;DR

Ranked #1 on Artificial Analysis Video Arena for text-to-video (Elo 1,357) and image-to-video (Elo 1,402) without audio, as of April 2026
15B unified transformer, 4-10 second clips, 720p and 1080p output, joint audio-video generation in one pass
Priced at $0.14/s (720p) and $0.28/s (1080p) via fal.ai; also available on Alibaba Cloud Bailian

What makes the debut unusual isn't just the performance - it's the architecture. Most video generation models run video and audio as separate pipelines and stitch them together in post. HappyHorse creates both in a single forward pass, using a unified 40-layer transformer that processes text, image, video, and audio tokens in the same attention sequence. That design choice is why the audio stays in sync without a separate alignment step.

A follow-up version, HappyHorse-1.1, added native multilingual lip-sync across seven languages and bumped the maximum output to 1080p as a standard offering. The variant dropped shortly after the 1.0 launch as Alibaba moved to close the gap with Seedance 2.0 on the with-audio leaderboard track.

Key Specifications

Specification	Details
Provider	Alibaba ATH Innovation Unit (Taotian Future Life Lab)
Model Family	HappyHorse
Parameters	15B
Architecture	Unified 40-layer self-attention transformer
Max Clip Duration	4-10 seconds (fal.ai); up to 15s on Alibaba Cloud
Output Resolution	720p (standard), 1080p (premium)
Aspect Ratios	16:9, 9:16, 1:1, 4:3, 3:4
Input Price	$0.14/s at 720p, $0.28/s at 1080p (fal.ai)
Release Date	April 27, 2026
License	Closed source (proprietary weights)
API Availability	fal.ai, Alibaba Cloud Bailian

Benchmark Performance

HappyHorse-1.0 debuted at the top of Artificial Analysis's video leaderboard and held its position through May 2026. The scores below reflect the April 2026 standing as recorded at public launch.

Benchmark	HappyHorse-1.0	Seedance 2.0	Kling 3.0 Pro
AA Text-to-Video Elo (no audio)	1,357	1,273	1,243
AA Image-to-Video Elo (no audio)	1,402	1,355	~1,250 (est.)
AA Text-to-Video Elo (with audio)	1,215	1,220	-

Sources: Artificial Analysis Video Arena blind human preference voting, April 2026.

The Artificial Analysis Video Arena without-audio leaderboard at launch, April 2026. Source: fal.ai

The 84-point gap over Seedance 2.0 in text-to-video (no audio) is wide by leaderboard standards, but the with-audio track tells a more nuanced story. Seedance 2.0 edged HappyHorse-1.0 by 5 Elo points there, which is within measurement noise but worth noting for teams where audio quality is the primary criterion. Kling 3.0 Pro, covered in our Kling 3.0 review, doesn't appear in the with-audio Elo table because Kling's audio capabilities are handled differently.

Our video generation benchmarks leaderboard tracks how these standings evolve month to month.

Key Capabilities

Joint Audio-Video Generation

The architecture decision to concatenate text, image, video, and audio tokens into a single sequence has a practical effect: the model doesn't guess where to put footsteps or a car engine sound after the video is rendered. It determines the audio at the same time it determines what's on screen. The first and last four layers of the 40-layer transformer handle modality-specific encoding and decoding; the middle 32 layers share parameters across all four token types.

Inference uses DMD-2 distillation with eight sampling steps and no classifier-free guidance, which is how the model gets to roughly 38 seconds for a 1080p clip on a single NVIDIA H100. That's fast for the resolution class, though direct comparison against Veo 3.1 (covered on our Veo 3.1 model page) depends on hardware setup and clip length.

Four API Endpoints

The fal.ai integration exposes four distinct endpoints:

Text-to-video: full scene generation from a text prompt
Image-to-video: animate a static reference frame with motion guidance
Reference-to-video: style-consistent generation across multiple clips using up to nine reference images (1.1 only)
Video editing: natural language edits on existing footage

The image-to-video endpoint is where HappyHorse-1.0 leads by the largest margin on Artificial Analysis, pulling a 47-point gap over Seedance 2.0 in the no-audio track.

Multilingual Lip-Sync (1.1 Only)

HappyHorse-1.1 added native lip-sync across seven languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French. The sync is generated in-pass, not added as a post-processing alignment step. Compared to Seedance 2.0's approach as analyzed in our Seedance 2.0 review, the single-pass generation means mouth movements and audio are determined together rather than fitted afterward.

Artificial Analysis with-audio leaderboard comparison for HappyHorse-1.0 The with-audio leaderboard track shows Seedance 2.0 slightly ahead once audio quality enters the evaluation, a gap HappyHorse-1.1 aims to close. Source: fal.ai

Pricing and Availability

HappyHorse-1.0 is available on two platforms at launch:

fal.ai is the primary western API access point, with pay-per-second billing and no minimum spend:

720p: $0.14/second
1080p: $0.28/second
A 5-second 720p clip costs $0.70. A 10-second 1080p clip costs $2.80.

Alibaba Cloud Bailian (Alibaba's model-as-a-service platform) prices in RMB: ¥0.9/s at 720p and ¥1.6/s at 1080p, roughly $0.31/s at 1080p at current exchange rates - slightly more expensive than fal.ai in USD terms. Alibaba offered a 10% introductory discount at launch.

The Qwen App provides limited free access for consumer-facing users, with a daily quota for 720p clips.

Compared to Seedance 2.0's API pricing and Kling 3.0's commercial tiers, HappyHorse-1.0 sits in a similar range for 1080p output. The no-minimum billing structure makes it practical for developers running evaluation batches before committing to volume.

Enterprise pricing is available from Alibaba directly for high-volume video production use cases, especially in e-commerce contexts where Taotian integration gives HappyHorse a distribution advantage inside Alibaba's ecosystem.

Strengths and Weaknesses

Strengths

Leads Artificial Analysis Video Arena on text-to-video and image-to-video (no audio) by a meaningful margin as of April 2026
Joint single-pass audio-video generation avoids the sync artifacts common in post-processed audio tracks
Four API modes, including video editing and reference-to-video, not just basic generation
Competitive per-second pricing with no minimum commitment on fal.ai
DMD-2 distillation keeps generation fast at eight sampling steps

Weaknesses

Clips cap at 10 seconds on fal.ai (15s on Alibaba Cloud), shorter than Seedance 2.0's 15-second default
Closed weights - no fine-tuning, no local deployment, no customization outside Alibaba's enterprise arrangements
Seedance 2.0 leads on the with-audio Elo track, a gap HappyHorse-1.1 narrows but doesn't fully close
Open-source claims made at announcement haven't been fulfilled as of mid-2026; weights haven't been published
No official technical paper; architecture details rely on third-party reporting and developer reverse engineering

Video Generation Benchmarks Leaderboard - full monthly rankings across all video models
Veo 3.1 model page - Google's competing video model with 4K output
Seedance 2.0 Review - ByteDance's closest competitor on the Artificial Analysis leaderboard
Kling 3.0 Review - the model Zhang Di built before joining Alibaba

FAQ

What is HappyHorse-1.0?

HappyHorse-1.0 is Alibaba's 15B-parameter video generation model that ranked #1 on Artificial Analysis in April 2026, built by the ATH Innovation Unit using a unified transformer that generates video and audio in a single pass.

Is HappyHorse-1.0 open source?

No. Despite early announcements suggesting Apache 2.0 release plans, no model weights have been published as of mid-2026. Access is through fal.ai and Alibaba Cloud Bailian APIs only.

How much does HappyHorse-1.0 cost?

On fal.ai: $0.14/s for 720p and $0.28/s for 1080p. A 10-second 1080p clip costs $2.80, no subscription or minimum spend required.

What is the difference between HappyHorse-1.0 and 1.1?

HappyHorse-1.1 adds native multilingual lip-sync across seven languages, support for up to nine reference images, and positions 1080p as the standard output tier. The Elo scores are similar, with 1.1 showing slight gains on the with-audio leaderboard track.

Who built HappyHorse?

It was built by Alibaba's Taotian Future Life Lab inside the ATH Innovation Unit, led by Zhang Di - the former VP at Kuaishou who designed Kling AI before rejoining Alibaba in late 2025.

How long can HappyHorse clips be?

On fal.ai, clips run 4-10 seconds. Alibaba Cloud Bailian supports up to 15 seconds. Multi-shot sequences require external stitching.

Sources: