LeCun's JEPA World Model Plans 47x Faster on One GPU

Yann LeCun has been insisting for years that the next step in AI isn't a larger language model but a world model that predicts what happens next inside a learned latent space. The Joint-Embedding Predictive Architecture - JEPA - is his bet. It has also been fragile: prior end-to-end variants need exponential moving averages, pretrained encoders, stop-gradient tricks, and a cocktail of loss terms to avoid representation collapse. A new paper from LeCun's group at Mila, NYU and Brown argues most of that scaffolding can go.

TL;DR

LeWorldModel (LeWM) - a 15M-parameter JEPA world model that trains end-to-end from raw pixels on a single GPU in a few hours, with just two loss terms (prediction + Gaussian regularizer)
Loss collapse - prior end-to-end JEPA designs expose six loss hyperparameters; LeWM gets the same stability with one
Planning speed - roughly 1 second per plan versus ~47 seconds for a DINO-WM baseline, a ~47x reduction
Where it shines - beats PLDM and DINO-WM on Reacher and Push-T; loses on 3D OGBench-Cube where a DINOv2 foundation encoder still helps
Open release - code at github.com/lucas-maes/le-wm, checkpoints on HuggingFace, project page at le-wm.github.io

The paper is arXiv 2603.19312, submitted March 13, 2026 and revised March 24. Authors: Lucas Maes and Quentin Le Lidec (equal first), Damien Scieur, Yann LeCun, and Randall Balestriero.

What a JEPA Is, In One Paragraph

A JEPA takes two views of the world - a context and a target - encodes both into compact vectors, and trains a predictor to map the context embedding forward to the target embedding. That's it. There's no pixel-level reconstruction loss, no generative decoder. The trick is that a model with that architecture will happily cheat by mapping every observation to the same constant vector. Every loss term is zero. Representation has collapsed. Most of the literature is about preventing that collapse.

The prior state of the art, I-JEPA and its successors, handle collapse with a teacher-student setup (an exponential moving average of the student acts as target network), pretrained DINOv2 encoders, or a grab bag of auxiliary regularizers. Fine when you're Meta and have thousands of H100s. Less fine when you're a PhD student with a 4090.

What LeWorldModel Actually Does

LeWM reduces the training objective to two terms. The first is the obvious one: a prediction loss pushing the predictor's output toward the target encoder's embedding. The second is SIGReg - a Gaussian-distribution regularizer that forces the latent embeddings to look approximately standard normal. The authors argue, and show empirically, that this second term is sufficient to prevent collapse on its own. No EMA teacher. No pretrained encoder. No stop-gradient. No variance-invariance-covariance acrobatics.

The scale of the simplification matters. The paper's main comparison is against the only existing end-to-end JEPA alternative, which requires tuning six loss hyperparameters before it'll train reliably. LeWM has one. For any team trying to reproduce a result, "one knob" versus "six knobs" is the difference between an afternoon and a month.

The encoder is a small vision transformer. The predictor is an MLP. Latents are 192-dimensional - the authors note this is roughly 200x fewer dimensions than DINO-WM, which inherits DINOv2's 768-dim embeddings. Total trainable parameters: ~15M. Training fits on one consumer or workstation GPU and completes in hours.

The Planning Numbers

Once the world model is trained, it can be used for model-predictive control: imagine a sequence of actions, roll the predictor forward in latent space, evaluate the imagined outcomes against a goal embedding, pick the best sequence. Planning is where the parameter count pays off. DINO-WM needs roughly 47 seconds per plan in the authors' setup. LeWM needs roughly 1 second. The 48x figure in the abstract is a ceiling; the clean headline number on the project page is ~47x.

That gap isn't about the prediction loss. It's about how many floating-point operations a 15M-parameter MLP predictor burns per rollout compared to a transformer predictor sitting on top of DINOv2. The world model doesn't need to know about ImageNet categories; it needs to predict the next latent given an action. LeWM strips the model down to exactly that job.

Humanoid ASIMO-style robot in low-light studio, its glossy black visor reflecting overhead fluorescent tubes Model-predictive control in latent space is the practical payoff. A 15M-parameter MLP predictor rolled forward 1,000 candidate action sequences per second is the difference between offline analysis and a humanoid making a decision in real time. Source: unsplash.com

Where It Wins and Where It Doesn't

The authors evaluate on four control environments and are honest about where the approach breaks.

Environment	Task	LeWM vs. Baselines
Two-Room	2D navigation	Loses - authors attribute it to low intrinsic dimensionality hurting the Gaussian regularizer
Reacher	2-joint arm control	Wins - beats PLDM and DINO-WM
Push-T	2D block manipulation	Wins - beats DINO-WM even without proprioceptive inputs
OGBench-Cube	3D pick-and-place	Loses - DINO-WM's foundation-model pretraining still helps on richer 3D scenes

The Two-Room loss is the sharper result for a methods paper. Forcing a Gaussian distribution on a two-dimensional navigation task is basically over-parameterizing the latent space in the wrong direction. The regularizer pushes the embeddings to fill a higher-dimensional Gaussian ball than the task requires, and the predictor has to compensate. It's a clean illustration that "stable training" and "correct inductive bias" are different problems.

The OGBench-Cube loss is the one that matters practically. Pick-and-place in a cluttered 3D scene is exactly the kind of task where a generalist encoder trained on ImageNet-scale data earns its keep. LeWM, trained from scratch on the task's own pixels, has no such prior. You can read this two ways: either LeWM is an incomplete alternative to foundation-model world models, or it is a drop-in component that can be paired with a pretrained encoder when the task warrants. The paper leans toward the latter.

Probing and Violation-of-Expectation

The authors include two sanity checks that don't always show up in world-model papers. Lightweight linear probes trained on the learned embeddings can recover physical quantities - agent location, block position, joint angles - from the raw latent vectors. That's evidence the latent space is encoding the structure of the task, not just a shortcut that happens to minimize loss.

The second check is violation-of-expectation. The model is shown a sequence of frames including physically implausible transitions - an object teleporting, an object changing color mid-sequence - and asked to produce a surprise signal (high prediction loss on the target embedding). It does. That's a property you want in a world model and one that pixel-reconstruction models routinely fail because the pixel loss averages out fine-grained structure.

Why Practitioners Should Care

Three reasons this paper lands differently than the typical "we improved a JEPA benchmark" result.

Reproducibility. A 15M-parameter model on one GPU in hours, with one hyperparameter and two code-released checkpoints, is the exact footprint an independent researcher can copy. Prior end-to-end JEPAs weren't in that bracket.

Planning cost. If you're doing any form of model-predictive control - robotics, game AI, agent planning, any loop where an agent imagines trajectories before acting - a 47x reduction in per-plan wall clock is the difference between "offline analysis" and "real-time control." That's not a gradual speedup.

Methodology. The SIGReg framing - "force the latent distribution to a standard Gaussian and stability follows" - is small, testable, and composable. If it holds up in replications, expect it to show up in other self-supervised losses before year-end.

What It Doesn't Fix

The paper does not claim to beat foundation-model world models on every task. It doesn't solve representation collapse in general (the Two-Room failure shows the regularizer isn't universally appropriate). It doesn't address long-horizon planning beyond the short rollouts used in the benchmarks. And it doesn't show robustness to distribution shift between training trajectories and planning-time conditions - something the control literature has been burning on for years.

What it does is clear away a pile of scaffolding that had been treated as necessary for end-to-end JEPA training. That scaffolding turns out to be optional. Whether the same argument holds for text, audio, or multimodal JEPAs is the obvious next question.

Sources: