Name: Grok Imagine Video 1.5
Author: xAI

Overview

Grok Imagine Video 1.5 is xAI's production image-to-video and text-to-video model, released as generally available on June 16-17, 2026. Built on Aurora, xAI's autoregressive video engine trained on 110,000 NVIDIA GB200 GPUs, it produces 720p clips up to 15 seconds with synchronized audio in a single inference pass. As of its GA launch, it holds the #1 spot on the Artificial Analysis Image-to-Video Arena leaderboard with an Elo of 1,473 - a 52-point jump over version 1.0, which Artificial Analysis noted as the largest single-version gain recorded on that board.

TL;DR

Best-in-class image-to-video: #1 on Artificial Analysis i2v Arena at Elo 1,473, beating Veo 3.1, Kling 3.0, and Seedance 2.0
720p clips up to 15 seconds, native synchronized audio at no surcharge, generation in ~25 seconds via the Fast variant
API at $0.14/s for 720p ($4.20/min) - 86% cheaper than Sora 2 Pro, 65% cheaper than Veo 3.1

The model went from zero to first place in the i2v category in roughly ten months. xAI launched Grok Imagine Video in August 2025, updated it to version 1.0 in February 2026 (10-second clips, 720p, improved audio), and shipped 1.5 in late May-June 2026. In the 30 days following the 1.0 release, users generated 1.245 billion videos, giving xAI a large feedback pool to train the next iteration. Version 1.5 extends max clip length from 10 to 15 seconds, improves physics and motion coherence, and ships a Fast variant - a speed-optimized build now live on grok.com and the iOS/Android apps.

The model gained mainstream attention when Elon Musk posted an AI-made Iliad trailer built with Grok Imagine 1.5 on June 4, 2026. The 40-second clip built up 18.4 million views on X, generating the kind of viral visibility no benchmark can copy. That doesn't validate the model's quality metrics by itself, but the engagement showed the model's capability for cinematic composition.

August 2025 - Grok Imagine Video launches as xAI's first video generation product.
January 28, 2026 - Grok Imagine API launches: text-to-video, image-to-video, and video editing at $0.05/second.
February 2026 - Version 1.0 released. Clips extend to 10 seconds, 720p output, improved audio. 1.245 billion videos produced in the first 30 days.
May 31, 2026 - Version 1.5 enters preview via the xAI API. Elo 1,404 on day one, +52 over v1.0.
June 16-17, 2026 - Version 1.5 goes GA. Fast variant rolls out on grok.com and mobile apps.

Key Specifications

Specification	Details
Provider	xAI
Model Family	Grok Imagine
Engine	Aurora (autoregressive, not diffusion-based)
Parameters	Not disclosed
Max Clip Length	15 seconds (video editing capped at 8.7s)
Frame Rate	24 fps
Resolutions	480p, 720p, 1080p (i2v only)
Aspect Ratios	7 options: 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3
Audio	Native one-pass generation (dialogue, SFX, music)
Input Price (480p)	$0.08/second
Input Price (720p)	$0.14/second
Input Price (1080p)	$0.25/second
API Model ID	grok-imagine-video-1.5
Release Date (GA)	June 16, 2026
Open Source	No
License	Proprietary

Benchmark Performance

The clearest quality signal comes from the Artificial Analysis Video Arena, which collects blind pairwise votes from users comparing videos produced from the same prompt. As of June 24, 2026 (the image-to-video, with-audio leaderboard), rankings are:

Model	Elo (i2v + audio)	Provider
Dreamina Seedance 2.0 720p	1,194	ByteDance Seed
HappyHorse-1.1	1,121	Alibaba-ATH
Grok Imagine Video 1.5 preview	1,111	xAI
Wan 2.7	1,090	Alibaba
Veo 3.1	1,087	Google
Grok Imagine Video 1.0	1,081	xAI
Kling 3.0 1080p (Pro)	1,072	Kuaishou

Note: these are live scores and shift as more votes accumulate. The Elo of 1,473 cited in xAI's announcement and the buildfastwithai review reflects the Arena leaderboard at or shortly after GA launch, when vote counts were lower and confidence intervals wider. The current score of 1,111 in the Artificial Analysis with-audio board reflects a larger vote pool. The v1.5 model still sits #3 overall - above Veo 3.1 and Kling 3.0.

A separate data point: when the preview launched on May 31, the model entered the Arena at Elo 1,404 with a ±6 confidence interval - a genuine top-of-board position that held for several weeks before Seedance 2.0 and HappyHorse built up enough votes to push higher.

Grok Imagine Video 1.5 vs competitors comparison chart Feature and capability comparison: Grok Imagine Video 1.5 vs Kling 3.0, Sora 2 Pro, and Seedance 2.0. Source: buildfastwithai.com

Sora 2 is notably absent from the Artificial Analysis with-audio board. OpenAI discontinued the Sora consumer app on April 26, 2026, and the Sora 2 API remains on a deprecated track until September 24, 2026. That removes a key competitor from current comparison.

Key Capabilities

Aurora Engine: Autoregressive vs. Diffusion

Most video generation models use diffusion: they denoise all frames from noise simultaneously, then patch them together. Aurora produces each frame sequentially, conditioning each new frame on all prior ones. That sequential process is what produces stable camera movements and consistent subject positioning across a full 15-second clip - artifacts that show up quickly in diffusion models past the 5-second mark.

The architecture also processes audio tokens alongside video tokens in the same forward pass. Sound effects, dialogue, and ambient music sync with visual events because both modalities share latent representations during generation rather than being stitched together post-hoc. This one-pass approach is a real engineering distinction from competing models: Veo 3.1 and Kling 3.0 require separate audio pipeline steps (or charge extra for it). Grok Imagine Video 1.5 includes audio at no surcharge at every tier.

Generation Modes

The API supports five modes: text-to-video, image-to-video, reference-to-video (up to seven reference images), edit-video, and extend-video. The image-to-video mode is where the model performs best in blind evaluations. Text-to-video is available via the consumer interface on grok.com but not through the API - developers who need t2v programmatically will need to route to a different model.

The extend-video mode chains clips from the final frame, enabling sequences beyond 15 seconds. Face consistency degrades across multiple extensions, which is a known limitation documented by independent reviewers.

The Fast Variant

Video 1.5 Fast generates a 6-second 720p clip in roughly 25 seconds, down from 40+ seconds in version 1.0. It's the default in the consumer apps and designed for real-time workflows. The standard API variant takes longer but focuses on output consistency. Both use the same underlying Aurora engine; the Fast variant is a quality-speed tradeoff, not a separate model.

Grok Imagine Video 1.5 API pricing structure API per-second pricing across resolution tiers. Source: buildfastwithai.com

Pricing and Availability

At $4.20/minute for 720p, Grok Imagine Video 1.5 costs 86% less than Sora 2 Pro and 65% less than Veo 3.1, with audio included at every tier.

The API uses per-second billing with no minimum:

480p: $0.08/second ($4.80/minute)
720p: $0.14/second ($8.40/minute; xAI quotes $4.20/minute in marketing, which corresponds to a 30-second clip - verify against your actual usage pattern)
1080p: $0.25/second ($15.00/minute); limited to image-to-video only

Audio is included at all tiers. Rate limit is 60 requests per minute. Available in us-east-1, eu-west-1, and us-west-2.

Consumer access comes in three tiers: a free account gets 5 credits per day on grok.com (accessible via the Imagine tab). SuperGrok Lite ($10/month) provides 480p up to 6-second clips. SuperGrok ($30/month) enables 720p and 15-second clips. X Premium+ ($40/month) includes platform-integrated access.

The API model ID is grok-imagine-video-1.5. An alias grok-imagine-video-1.5-2026-05-30 is also supported. Access via the xAI console at console.x.ai. The xAI Python SDK handles asynchronous polling automatically; REST API users need to poll every 5 seconds using the returned request_id.

For broader context on AI video generation costs, our video generation benchmarks leaderboard tracks pricing across the main providers.

Strengths and Weaknesses

Strengths

#1 or top-3 on Artificial Analysis i2v arena across most measurement windows
Native one-pass audio is the only major video API to include it at no surcharge
Aurora's autoregressive architecture produces stable motion across full 15-second clips without the typical diffusion-model jerkiness
86% cheaper than Sora 2 Pro at 720p, with Sora 2 now on deprecated track
1080p available for image-to-video (listed in pricing, though not emphasized in consumer docs)
Strong video extension capability for sequences beyond 15 seconds
Available on iOS, Android, and web in addition to API

Weaknesses

720p ceiling for standard use - Kling 3.0 and Veo 3.1 both reach 1080p in their core offerings
No text-to-video via API - only through the consumer interface
Face consistency degrades across multiple video extension passes (normally after 5-6 chains)
24 fps only - no 60 fps for gaming, sports, or smooth product demos
Aurora creates frames sequentially, so prompt structure matters more than in diffusion models: actions described early appear early, buried actions may be skipped
Free tier (5 credits/day) is restrictive for meaningful testing

Video Generation Benchmarks Leaderboard - full rankings across all major video AI models
Veo 3.1 - Google DeepMind's competing model, top-ranked in several text-to-video categories
Grok 4 - xAI's flagship language model
AI Image Generation Leaderboard - for still image generation rankings

Sources: