LTX-2.3: 22B Open-Source Video and Audio Model
LTX-2.3 is a 22-billion-parameter open-source video generation model from Lightricks that produces native 4K video with synchronized audio in a single diffusion pass.

LTX-2.3 is Lightricks' second major release in the LTX-2 series - a 22-billion-parameter diffusion transformer that produces video and synchronized audio in a single forward pass. It's been available since March 5, 2026, under the LTX-2 Community License, with weights downloadable from HuggingFace and a commercial API via fal.ai.
TL;DR
- First open-source model to produce audio and video together in one diffusion pass - no post-processing dubbing step
- 22B parameters, up to 4K at 50 FPS, 20-second clips; runs on consumer GPUs with FP8 quantization
- Ranked #1 among open-weight video models on the Artificial Analysis leaderboard (Elo 1121), well ahead of Wan 2.2 (Elo 1111)
What makes this model distinct from every prior open-source video system is the architecture: a dual-stream asymmetric diffusion transformer, with roughly 14B parameters in the video stream and 5B in the audio stream, tied together by bidirectional cross-attention and shared timestep conditioning. That design is what allows audio to be frame-locked to the video rather than stapled on afterward.
The LTX-2 base architecture was described in the arXiv paper LTX-2: Efficient Joint Audio-Visual Foundation Model (January 2026). LTX-2.3 is a point release with a rebuilt VAE, an upgraded HiFi-GAN vocoder, a 4x larger text connector, and native portrait (9:16) generation trained on actual vertical data.
LTX-2.3 was released March 5, 2026 as fully open-weights under the LTX-2 Community License.
Source: studio.aifilms.ai
Key Specifications
| Specification | Details |
|---|---|
| Provider | Lightricks |
| Model Family | LTX-2 |
| Parameters | 22B (dual-stream: ~14B video, ~5B audio) |
| Architecture | Asymmetric Diffusion Transformer (DiT) |
| Max Video Length | 20 seconds (extendable via API endpoint) |
| Max Resolution | 3840x2160 (4K native) |
| Frame Rates | 24, 25, 48, 50 FPS |
| Context (portrait) | 1080x1920 native (9:16) |
| Release Date | March 5, 2026 |
| License | LTX-2 Community License (free commercial use under $10M ARR) |
| Weights | HuggingFace: Lightricks/LTX-2.3 |
Benchmark Performance
LTX-2.3 doesn't have a published VBench, FVD, or EvalCrafter score from the release. What we have is the Artificial Analysis Text-to-Video Arena Elo (March 2026), which is the best public head-to-head ranking for video generation models right now.
| Model | Elo | Open Source | Native Audio | Max Resolution |
|---|---|---|---|---|
| Dreamina Seedance 2.0 | 1,273 | No | Yes | 720p |
| Kling 3.0 Pro 1080p | 1,241 | No | No | 1080p |
| Runway Gen-4.5 | 1,227 | No | No | 4K |
| Sora 2 Pro | 1,195 | No | No | 1080p |
| LTX-2 Pro (earlier) | 1,133 | Yes | Yes | 4K |
| LTX-2.3 Fast | 1,121 | Yes | Yes | 4K |
| Wan 2.2 A14B | 1,111 | Yes | No | 1080p |
| Wan 2.1 14B | 1,020 | Yes | No | 1080p |
LTX-2.3 Fast sits at Elo 1121, ranking #1 among open-weight video models on the board. It's 10 Elo points ahead of Wan 2.2 and 101 points ahead of Wan 2.1. The Elo gap to proprietary leaders like Seedance 2.0 (1,273) is real - Lightricks isn't claiming parity with the closed commercial systems on overall quality. What the numbers do show is that LTX-2.3 is the strongest open-weight option, and it runs significantly faster than Wan 2.2 on equivalent hardware (10-14x faster on RTX 4090 per community benchmarks).
The speed advantage matters for iteration. Running 5-10 second draft clips locally on a single RTX 4090 takes 1-2 minutes with LTX-2.3 compared to 12-18 minutes with Wan 2.2.
LTX-2.3's rebuilt VAE produces sharper fabric, hair, and reflective surface detail compared to prior open-source video models.
Source: wavespeed.ai
Key Capabilities
Native Audio-Video Generation
The architecture's defining feature is unified audio-visual synthesis. Because audio is created within the same diffusion pass as the video - not added in a second step - a door slam lands on exactly the correct frame. The model creates foley, ambient sound, environmental audio, and music aligned to video content. The vocoder was upgraded in 2.3 (HiFi-GAN) for cleaner stereo output at 24 kHz.
There's also a dedicated audio-to-video endpoint: provide an audio clip, and the model produces matching video. It doesn't have frame-accurate beat alignment yet, and audio quality for non-speech sounds is lower than for speech-like content. But no other open-source model offers native audio generation at all.
High Resolution and Portrait Support
4K generation (3840x2160) at up to 50 FPS puts LTX-2.3 at or above most commercial competitors on the resolution spec. The practical ceiling for most RTX 4090 setups is 1080p without quantization - 4K generation at fp16 requires approximately 44GB VRAM. FP8 quantization brings that down to around 24GB; GGUF quantizations (Q4_K_S) push it lower still, to under 12GB, with some quality tradeoff.
Native portrait mode (1080x1920) is new in 2.3, trained on actual vertical data rather than cropped landscape footage. For TikTok, Reels, or YouTube Shorts workflows, this matters.
Seven API Endpoints
The official fal.ai integration exposes seven separate endpoints: text-to-video (standard and fast), image-to-video (standard and fast), audio-to-video, extend-video, and retake-video. The extend-video endpoint handles clips longer than 20 seconds by chaining generation steps. The LTX Desktop app - open-source, free, Windows/NVIDIA - provides a local generation workspace with timeline tools for multi-clip projects.
ComfyUI v0.16.1+ has day-0 native support via custom nodes (LTXVSeparateAVLatent, LTXVAudioVAEDecode). Recommended parameters: guidance scale 3.0-3.5, 20-30 steps for iteration, 40+ for final renders.
One constraint to plan around: width and height must be divisible by 32, and frame count must follow (N × 8) + 1. These are hard requirements in the architecture, not soft recommendations.
Pricing and Availability
Cloud API (fal.ai)
fal.ai is the primary API provider for LTX-2.3. Pricing is per second of created video:
| Resolution | Standard | Fast |
|---|---|---|
| 1080p | $0.06/s | $0.04/s |
| 1440p | $0.12/s | $0.08/s |
| 4K | $0.24/s | $0.16/s |
A 10-second 4K clip at standard quality costs $2.40. Audio-to-video, extend-video, and retake-video are priced at $0.10/s regardless of resolution. There's no free tier on the API.
Official LTX API (docs.ltx.video)
Lightricks runs their own API with pricing effective April 1, 2026, ranging from $0.06/s (1080p fast) to $0.32/s (4K pro). Before April 1, rates were lower at $0.04/s for fast 1080p.
Self-Hosting
Weights are free on HuggingFace. Model variants:
ltx-2.3-22b-dev- Full BF16 (42GB on disk), for fine-tuningltx-2.3-22b-distilled- 8 steps, CFG=1, faster inferenceltx-2.3-22b-distilled-lora-384- LoRA adapter- FP8 and GGUF quantizations available from community
Spatial and temporal upscalers (x1.5, x2) are available as separate models for post-generation upscaling.
License Note
Multiple third-party platforms describe LTX-2.3 as Apache 2.0. The official HuggingFace model card says LTX-2 Community License. Free for commercial use for companies under $10M annual revenue. Above that threshold, verify the applicable terms directly with Lightricks before building a production product.
Strengths and Weaknesses
Strengths
- Only open-source model with native audio-video synthesis in a single pass
- 4K resolution at up to 50 FPS; native portrait (9:16) mode trained on real vertical data
- 10-14x faster inference than Wan 2.2 on RTX 4090 - practical for rapid iteration
- Runs locally on consumer hardware; GGUF quantization works under 12GB VRAM
- Free model weights; commercial use free for sub-$10M revenue companies
- ComfyUI day-0 support; active community (5M+ downloads; EasyCache, NVFP8 contributions)
- LoRA fine-tuning feasible in under 1 hour
Weaknesses
- Image-to-video instability bugs and reproducible crashes documented in current release
- Audio quality for non-speech content (music, complex foley) is lower than speech content
- Lip-sync for talking heads is unreliable; no frame-accurate beat alignment in audio-to-video
- Complex physics (water dynamics, crowd interaction, multi-subject coordination) fall behind Wan 2.2 for cinematic work
- Full BF16 model requires 32GB+ VRAM; FP8 and GGUF quantizations involve quality tradeoffs
- Diffusers integration incomplete (
model_index.jsonmissing); tensor mismatches with LTX-2.0 assets - License ambiguity between fal.ai terms and Lightricks terms for commercial users
Related Coverage
- Our Review: LTX-2.3 Review: Open-Source Video AI That Delivers - Elena Marchetti's 8.2/10 full review
- Competing model review: Seedance 2.0 Review - the current Elo leader in video generation
- Open-source model rankings: Open-Source LLM Leaderboard - overall open-weight model standings
- Image generation context: AI Image Generation Leaderboard - for comparison with image-only models
FAQ
Does LTX-2.3 require a GPU to run?
Yes. Minimum 16GB VRAM for 720p generation with FP8 quantization; 44GB for 4K at fp16. CPU-only inference isn't practical for video generation.
Is LTX-2.3 truly free for commercial use?
Free for organizations with under $10M annual revenue under the LTX-2 Community License. Larger commercial use requires an enterprise license from Lightricks.
How does LTX-2.3 compare to Wan 2.1?
LTX-2.3 is faster (10-14x on RTX 4090), supports native audio, reaches 4K, and has portrait mode. Wan 2.1 has a larger community ecosystem and stronger cinematic camera motion. For most API-based applications, LTX-2.3 is the better choice today.
Can LTX-2.3 produce videos longer than 20 seconds?
Yes, via the extend-video endpoint which chains generation steps. The base 20-second limit applies to a single generation pass.
What is the recommended quantization for a RTX 3080 (10GB VRAM)?
The Q4_K_S GGUF variant, which community members have run successfully at 960x544 resolution. Accept approximately 5-8% softening in fine detail versus the BF16 baseline.
Sources
- LTX-2.3 Model Card - HuggingFace
- LTX-2.3 Release Announcement - ltx.io
- LTX-2 arXiv paper: Efficient Joint Audio-Visual Foundation Model
- fal.ai LTX-2.3 API and Pricing
- LTX API Official Pricing - docs.ltx.video
- Artificial Analysis Text-to-Video Leaderboard
- LTX-2.3: What's New - WaveSpeedAI
- LTX-2.3 vs Wan 2.2 - WaveSpeedAI
- ComfyUI Day-0 LTX-2.3 Support
- Running LTX-2.3 Locally - DEV Community
- LTX-2 Open Source Announcement - GlobeNewswire
✓ Last verified March 31, 2026
