LTX-2.3: 22B Open-Source Video and Audio Model

LTX-2.3 is a 22-billion-parameter open-source video generation model from Lightricks that produces native 4K video with synchronized audio in a single diffusion pass.

LTX-2.3: 22B Open-Source Video and Audio Model

LTX-2.3 is Lightricks' second major release in the LTX-2 series - a 22-billion-parameter diffusion transformer that produces video and synchronized audio in a single forward pass. It's been available since March 5, 2026, under the LTX-2 Community License, with weights downloadable from HuggingFace and a commercial API via fal.ai.

TL;DR

  • First open-source model to produce audio and video together in one diffusion pass - no post-processing dubbing step
  • 22B parameters, up to 4K at 50 FPS, 20-second clips; runs on consumer GPUs with FP8 quantization
  • Ranked #1 among open-weight video models on the Artificial Analysis leaderboard (Elo 1121), well ahead of Wan 2.2 (Elo 1111)

What makes this model distinct from every prior open-source video system is the architecture: a dual-stream asymmetric diffusion transformer, with roughly 14B parameters in the video stream and 5B in the audio stream, tied together by bidirectional cross-attention and shared timestep conditioning. That design is what allows audio to be frame-locked to the video rather than stapled on afterward.

The LTX-2 base architecture was described in the arXiv paper LTX-2: Efficient Joint Audio-Visual Foundation Model (January 2026). LTX-2.3 is a point release with a rebuilt VAE, an upgraded HiFi-GAN vocoder, a 4x larger text connector, and native portrait (9:16) generation trained on actual vertical data.

LTX-2.3 launch - open source announcement banner LTX-2.3 was released March 5, 2026 as fully open-weights under the LTX-2 Community License. Source: studio.aifilms.ai

Key Specifications

SpecificationDetails
ProviderLightricks
Model FamilyLTX-2
Parameters22B (dual-stream: ~14B video, ~5B audio)
ArchitectureAsymmetric Diffusion Transformer (DiT)
Max Video Length20 seconds (extendable via API endpoint)
Max Resolution3840x2160 (4K native)
Frame Rates24, 25, 48, 50 FPS
Context (portrait)1080x1920 native (9:16)
Release DateMarch 5, 2026
LicenseLTX-2 Community License (free commercial use under $10M ARR)
WeightsHuggingFace: Lightricks/LTX-2.3

Benchmark Performance

LTX-2.3 doesn't have a published VBench, FVD, or EvalCrafter score from the release. What we have is the Artificial Analysis Text-to-Video Arena Elo (March 2026), which is the best public head-to-head ranking for video generation models right now.

ModelEloOpen SourceNative AudioMax Resolution
Dreamina Seedance 2.01,273NoYes720p
Kling 3.0 Pro 1080p1,241NoNo1080p
Runway Gen-4.51,227NoNo4K
Sora 2 Pro1,195NoNo1080p
LTX-2 Pro (earlier)1,133YesYes4K
LTX-2.3 Fast1,121YesYes4K
Wan 2.2 A14B1,111YesNo1080p
Wan 2.1 14B1,020YesNo1080p

LTX-2.3 Fast sits at Elo 1121, ranking #1 among open-weight video models on the board. It's 10 Elo points ahead of Wan 2.2 and 101 points ahead of Wan 2.1. The Elo gap to proprietary leaders like Seedance 2.0 (1,273) is real - Lightricks isn't claiming parity with the closed commercial systems on overall quality. What the numbers do show is that LTX-2.3 is the strongest open-weight option, and it runs significantly faster than Wan 2.2 on equivalent hardware (10-14x faster on RTX 4090 per community benchmarks).

The speed advantage matters for iteration. Running 5-10 second draft clips locally on a single RTX 4090 takes 1-2 minutes with LTX-2.3 compared to 12-18 minutes with Wan 2.2.

LTX-2.3 visual quality comparison - fine detail and textures LTX-2.3's rebuilt VAE produces sharper fabric, hair, and reflective surface detail compared to prior open-source video models. Source: wavespeed.ai


Key Capabilities

Native Audio-Video Generation

The architecture's defining feature is unified audio-visual synthesis. Because audio is created within the same diffusion pass as the video - not added in a second step - a door slam lands on exactly the correct frame. The model creates foley, ambient sound, environmental audio, and music aligned to video content. The vocoder was upgraded in 2.3 (HiFi-GAN) for cleaner stereo output at 24 kHz.

There's also a dedicated audio-to-video endpoint: provide an audio clip, and the model produces matching video. It doesn't have frame-accurate beat alignment yet, and audio quality for non-speech sounds is lower than for speech-like content. But no other open-source model offers native audio generation at all.

High Resolution and Portrait Support

4K generation (3840x2160) at up to 50 FPS puts LTX-2.3 at or above most commercial competitors on the resolution spec. The practical ceiling for most RTX 4090 setups is 1080p without quantization - 4K generation at fp16 requires approximately 44GB VRAM. FP8 quantization brings that down to around 24GB; GGUF quantizations (Q4_K_S) push it lower still, to under 12GB, with some quality tradeoff.

Native portrait mode (1080x1920) is new in 2.3, trained on actual vertical data rather than cropped landscape footage. For TikTok, Reels, or YouTube Shorts workflows, this matters.

Seven API Endpoints

The official fal.ai integration exposes seven separate endpoints: text-to-video (standard and fast), image-to-video (standard and fast), audio-to-video, extend-video, and retake-video. The extend-video endpoint handles clips longer than 20 seconds by chaining generation steps. The LTX Desktop app - open-source, free, Windows/NVIDIA - provides a local generation workspace with timeline tools for multi-clip projects.

ComfyUI v0.16.1+ has day-0 native support via custom nodes (LTXVSeparateAVLatent, LTXVAudioVAEDecode). Recommended parameters: guidance scale 3.0-3.5, 20-30 steps for iteration, 40+ for final renders.

One constraint to plan around: width and height must be divisible by 32, and frame count must follow (N × 8) + 1. These are hard requirements in the architecture, not soft recommendations.


Pricing and Availability

Cloud API (fal.ai)

fal.ai is the primary API provider for LTX-2.3. Pricing is per second of created video:

ResolutionStandardFast
1080p$0.06/s$0.04/s
1440p$0.12/s$0.08/s
4K$0.24/s$0.16/s

A 10-second 4K clip at standard quality costs $2.40. Audio-to-video, extend-video, and retake-video are priced at $0.10/s regardless of resolution. There's no free tier on the API.

Official LTX API (docs.ltx.video)

Lightricks runs their own API with pricing effective April 1, 2026, ranging from $0.06/s (1080p fast) to $0.32/s (4K pro). Before April 1, rates were lower at $0.04/s for fast 1080p.

Self-Hosting

Weights are free on HuggingFace. Model variants:

  • ltx-2.3-22b-dev - Full BF16 (42GB on disk), for fine-tuning
  • ltx-2.3-22b-distilled - 8 steps, CFG=1, faster inference
  • ltx-2.3-22b-distilled-lora-384 - LoRA adapter
  • FP8 and GGUF quantizations available from community

Spatial and temporal upscalers (x1.5, x2) are available as separate models for post-generation upscaling.

License Note

Multiple third-party platforms describe LTX-2.3 as Apache 2.0. The official HuggingFace model card says LTX-2 Community License. Free for commercial use for companies under $10M annual revenue. Above that threshold, verify the applicable terms directly with Lightricks before building a production product.


Strengths and Weaknesses

Strengths

  • Only open-source model with native audio-video synthesis in a single pass
  • 4K resolution at up to 50 FPS; native portrait (9:16) mode trained on real vertical data
  • 10-14x faster inference than Wan 2.2 on RTX 4090 - practical for rapid iteration
  • Runs locally on consumer hardware; GGUF quantization works under 12GB VRAM
  • Free model weights; commercial use free for sub-$10M revenue companies
  • ComfyUI day-0 support; active community (5M+ downloads; EasyCache, NVFP8 contributions)
  • LoRA fine-tuning feasible in under 1 hour

Weaknesses

  • Image-to-video instability bugs and reproducible crashes documented in current release
  • Audio quality for non-speech content (music, complex foley) is lower than speech content
  • Lip-sync for talking heads is unreliable; no frame-accurate beat alignment in audio-to-video
  • Complex physics (water dynamics, crowd interaction, multi-subject coordination) fall behind Wan 2.2 for cinematic work
  • Full BF16 model requires 32GB+ VRAM; FP8 and GGUF quantizations involve quality tradeoffs
  • Diffusers integration incomplete (model_index.json missing); tensor mismatches with LTX-2.0 assets
  • License ambiguity between fal.ai terms and Lightricks terms for commercial users


FAQ

Does LTX-2.3 require a GPU to run?

Yes. Minimum 16GB VRAM for 720p generation with FP8 quantization; 44GB for 4K at fp16. CPU-only inference isn't practical for video generation.

Is LTX-2.3 truly free for commercial use?

Free for organizations with under $10M annual revenue under the LTX-2 Community License. Larger commercial use requires an enterprise license from Lightricks.

How does LTX-2.3 compare to Wan 2.1?

LTX-2.3 is faster (10-14x on RTX 4090), supports native audio, reaches 4K, and has portrait mode. Wan 2.1 has a larger community ecosystem and stronger cinematic camera motion. For most API-based applications, LTX-2.3 is the better choice today.

Can LTX-2.3 produce videos longer than 20 seconds?

Yes, via the extend-video endpoint which chains generation steps. The base 20-second limit applies to a single generation pass.

The Q4_K_S GGUF variant, which community members have run successfully at 960x544 resolution. Accept approximately 5-8% softening in fine detail versus the BF16 baseline.


Sources

✓ Last verified March 31, 2026

LTX-2.3: 22B Open-Source Video and Audio Model
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.