Name: LTX-2.3: 22B Open-Source Video and Audio Model
Author: Lightricks

LTX-2.3 is Lightricks' second major release in the LTX-2 series - a 22-billion-parameter diffusion transformer that produces video and synchronized audio in a single forward pass. It's been available since March 5, 2026, under the LTX-2 Community License, with weights downloadable from HuggingFace and a commercial API via fal.ai.

TL;DR

First open-source model to produce audio and video together in one diffusion pass - no post-processing dubbing step
22B parameters, up to 4K at 50 FPS, 20-second clips; runs on consumer GPUs with FP8 quantization
Ranked #1 among open-weight video models on the Artificial Analysis leaderboard (Elo 1121), well ahead of Wan 2.2 (Elo 1111)

What makes this model distinct from every prior open-source video system is the architecture: a dual-stream asymmetric diffusion transformer, with roughly 14B parameters in the video stream and 5B in the audio stream, tied together by bidirectional cross-attention and shared timestep conditioning. That design is what allows audio to be frame-locked to the video rather than stapled on afterward.

The LTX-2 base architecture was described in the arXiv paper LTX-2: Efficient Joint Audio-Visual Foundation Model (January 2026). LTX-2.3 is a point release with a rebuilt VAE, an upgraded HiFi-GAN vocoder, a 4x larger text connector, and native portrait (9:16) generation trained on actual vertical data.

LTX-2.3 launch - open source announcement banner LTX-2.3 was released March 5, 2026 as fully open-weights under the LTX-2 Community License. Source: studio.aifilms.ai

Key Specifications

Specification	Details
Provider	Lightricks
Model Family	LTX-2
Parameters	22B (dual-stream: ~14B video, ~5B audio)
Architecture	Asymmetric Diffusion Transformer (DiT)
Max Video Length	20 seconds (extendable via API endpoint)
Max Resolution	3840x2160 (4K native)
Frame Rates	24, 25, 48, 50 FPS
Context (portrait)	1080x1920 native (9:16)
Release Date	March 5, 2026
License	LTX-2 Community License (free commercial use under $10M ARR)
Weights	HuggingFace: Lightricks/LTX-2.3

Benchmark Performance

LTX-2.3 doesn't have a published VBench, FVD, or EvalCrafter score from the release. What we have is the Artificial Analysis Text-to-Video Arena Elo (March 2026), which is the best public head-to-head ranking for video generation models right now.

Model	Elo	Open Source	Native Audio	Max Resolution
Dreamina Seedance 2.0	1,273	No	Yes	720p
Kling 3.0 Pro 1080p	1,241	No	No	1080p
Runway Gen-4.5	1,227	No	No	4K
Sora 2 Pro	1,195	No	No	1080p
LTX-2 Pro (earlier)	1,133	Yes	Yes	4K
LTX-2.3 Fast	1,121	Yes	Yes	4K
Wan 2.2 A14B	1,111	Yes	No	1080p
Wan 2.1 14B	1,020	Yes	No	1080p

LTX-2.3 Fast sits at Elo 1121, ranking #1 among open-weight video models on the board. It's 10 Elo points ahead of Wan 2.2 and 101 points ahead of Wan 2.1. The Elo gap to proprietary leaders like Seedance 2.0 (1,273) is real - Lightricks isn't claiming parity with the closed commercial systems on overall quality. What the numbers do show is that LTX-2.3 is the strongest open-weight option, and it runs significantly faster than Wan 2.2 on equivalent hardware (10-14x faster on RTX 4090 per community benchmarks).

The speed advantage matters for iteration. Running 5-10 second draft clips locally on a single RTX 4090 takes 1-2 minutes with LTX-2.3 compared to 12-18 minutes with Wan 2.2.

LTX-2.3 visual quality comparison - fine detail and textures LTX-2.3's rebuilt VAE produces sharper fabric, hair, and reflective surface detail compared to prior open-source video models. Source: wavespeed.ai

Key Capabilities

Native Audio-Video Generation

The architecture's defining feature is unified audio-visual synthesis. Because audio is created within the same diffusion pass as the video - not added in a second step - a door slam lands on exactly the correct frame. The model creates foley, ambient sound, environmental audio, and music aligned to video content. The vocoder was upgraded in 2.3 (HiFi-GAN) for cleaner stereo output at 24 kHz.

There's also a dedicated audio-to-video endpoint: provide an audio clip, and the model produces matching video. It doesn't have frame-accurate beat alignment yet, and audio quality for non-speech sounds is lower than for speech-like content. But no other open-source model offers native audio generation at all.

High Resolution and Portrait Support

4K generation (3840x2160) at up to 50 FPS puts LTX-2.3 at or above most commercial competitors on the resolution spec. The practical ceiling for most RTX 4090 setups is 1080p without quantization - 4K generation at fp16 requires approximately 44GB VRAM. FP8 quantization brings that down to around 24GB; GGUF quantizations (Q4_K_S) push it lower still, to under 12GB, with some quality tradeoff.

Native portrait mode (1080x1920) is new in 2.3, trained on actual vertical data rather than cropped landscape footage. For TikTok, Reels, or YouTube Shorts workflows, this matters.

Seven API Endpoints

The official fal.ai integration exposes seven separate endpoints: text-to-video (standard and fast), image-to-video (standard and fast), audio-to-video, extend-video, and retake-video. The extend-video endpoint handles clips longer than 20 seconds by chaining generation steps. The LTX Desktop app - open-source, free, Windows/NVIDIA - provides a local generation workspace with timeline tools for multi-clip projects.

ComfyUI v0.16.1+ has day-0 native support via custom nodes (LTXVSeparateAVLatent, LTXVAudioVAEDecode). Recommended parameters: guidance scale 3.0-3.5, 20-30 steps for iteration, 40+ for final renders.

One constraint to plan around: width and height must be divisible by 32, and frame count must follow (N × 8) + 1. These are hard requirements in the architecture, not soft recommendations.

Pricing and Availability

Cloud API (fal.ai)

fal.ai is the primary API provider for LTX-2.3. Pricing is per second of created video:

Resolution	Standard	Fast
1080p	$0.06/s	$0.04/s
1440p	$0.12/s	$0.08/s
4K	$0.24/s	$0.16/s

A 10-second 4K clip at standard quality costs $2.40. Audio-to-video, extend-video, and retake-video are priced at $0.10/s regardless of resolution. There's no free tier on the API.

Official LTX API (docs.ltx.video)

Lightricks runs their own API with pricing effective April 1, 2026, ranging from $0.06/s (1080p fast) to $0.32/s (4K pro). Before April 1, rates were lower at $0.04/s for fast 1080p.

Self-Hosting

Weights are free on HuggingFace. Model variants:

ltx-2.3-22b-dev - Full BF16 (42GB on disk), for fine-tuning
ltx-2.3-22b-distilled - 8 steps, CFG=1, faster inference
ltx-2.3-22b-distilled-lora-384 - LoRA adapter
FP8 and GGUF quantizations available from community

Spatial and temporal upscalers (x1.5, x2) are available as separate models for post-generation upscaling.

License Note

Multiple third-party platforms describe LTX-2.3 as Apache 2.0. The official HuggingFace model card says LTX-2 Community License. Free for commercial use for companies under $10M annual revenue. Above that threshold, verify the applicable terms directly with Lightricks before building a production product.

Strengths and Weaknesses

Strengths

Only open-source model with native audio-video synthesis in a single pass
4K resolution at up to 50 FPS; native portrait (9:16) mode trained on real vertical data
10-14x faster inference than Wan 2.2 on RTX 4090 - practical for rapid iteration
Runs locally on consumer hardware; GGUF quantization works under 12GB VRAM
Free model weights; commercial use free for sub-$10M revenue companies
ComfyUI day-0 support; active community (5M+ downloads; EasyCache, NVFP8 contributions)
LoRA fine-tuning feasible in under 1 hour

Weaknesses

Image-to-video instability bugs and reproducible crashes documented in current release
Audio quality for non-speech content (music, complex foley) is lower than speech content
Lip-sync for talking heads is unreliable; no frame-accurate beat alignment in audio-to-video
Complex physics (water dynamics, crowd interaction, multi-subject coordination) fall behind Wan 2.2 for cinematic work
Full BF16 model requires 32GB+ VRAM; FP8 and GGUF quantizations involve quality tradeoffs
Diffusers integration incomplete (model_index.json missing); tensor mismatches with LTX-2.0 assets
License ambiguity between fal.ai terms and Lightricks terms for commercial users

Our Review: LTX-2.3 Review: Open-Source Video AI That Delivers - Elena Marchetti's 8.2/10 full review
Competing model review: Seedance 2.0 Review - the current Elo leader in video generation
Open-source model rankings: Open-Source LLM Leaderboard - overall open-weight model standings
Image generation context: AI Image Generation Leaderboard - for comparison with image-only models

FAQ

Does LTX-2.3 require a GPU to run?

Yes. Minimum 16GB VRAM for 720p generation with FP8 quantization; 44GB for 4K at fp16. CPU-only inference isn't practical for video generation.

Is LTX-2.3 truly free for commercial use?

Free for organizations with under $10M annual revenue under the LTX-2 Community License. Larger commercial use requires an enterprise license from Lightricks.

How does LTX-2.3 compare to Wan 2.1?

LTX-2.3 is faster (10-14x on RTX 4090), supports native audio, reaches 4K, and has portrait mode. Wan 2.1 has a larger community ecosystem and stronger cinematic camera motion. For most API-based applications, LTX-2.3 is the better choice today.

Can LTX-2.3 produce videos longer than 20 seconds?

Yes, via the extend-video endpoint which chains generation steps. The base 20-second limit applies to a single generation pass.

What is the recommended quantization for a RTX 3080 (10GB VRAM)?

The Q4_K_S GGUF variant, which community members have run successfully at 960x544 resolution. Accept approximately 5-8% softening in fine detail versus the BF16 baseline.