arxiv:2603.19312

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Published on Mar 13

randall-lab

Upvote

Authors:

Lucas Maes ,

Abstract

LeWorldModel presents a stable end-to-end JEPA framework that trains efficiently from raw pixels using minimal loss terms while maintaining competitive performance in control tasks and encoding meaningful physical structures.

AI-generated summary

Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.

View arXiv page View PDF Project page GitHub 1.02k Add to collection

Community

mishig

about 10 hours ago

Here are the main results from LeWorldModel (LeWM), organized by key contribution:

1. Stable End-to-End Training with Minimal Hyperparameters

The Innovation: LeWM is the first JEPA (Joint-Embedding Predictive Architecture) world model that trains stably end-to-end from raw pixels using only two loss terms:

A next-embedding prediction loss (MSE)
SIGReg: A regularizer enforcing Gaussian-distributed latent embeddings via random projections and the Epps-Pulley normality test

Why this matters: Prior work requires complex multi-term losses (PLDM uses 7 terms), exponential moving averages, stop-gradient tricks, or frozen pre-trained encoders to prevent representation collapse. LeWM eliminates these heuristics while providing provable anti-collapse guarantees.

Figure 1: LeWM trains encoder and predictor jointly. SIGReg projects latent embeddings onto random directions and applies normality tests to prevent collapse without stop-gradients or EMAs.

Key comparison (Figure 2):

Method	End-to-End	Task-Agnostic	Pixel-Based	Collapse Guarantee	Hyperparameters
PLDM	✓	✓	✓	✗	6 (unstable)
DINO-WM	✗ (frozen encoder)	✓	✓	✓	Few
LeWM	✓	✓	✓	✓	1 (λ)

2. Planning Performance and Efficiency

LeWM achieves 48× faster planning than foundation-model-based approaches while maintaining competitive performance:

Figure 3: Left: Planning time comparison. LeWM uses ~200× fewer tokens than DINO-WM, achieving speeds comparable to PLDM while being ~50× faster than DINO-WM. Center/Right: Under fixed compute budgets, LeWM outperforms DINO-WM on Push-T and OGBench-Cube.

Quantitative Results (Figure 6):

Push-T: 18% higher success rate than PLDM (the only other end-to-end method); outperforms DINO-WM even when DINO-WM uses additional proprioceptive inputs
Reacher: Competitive or better than all baselines
OGBench-Cube: Slightly below DINO-WM (likely due to 3D visual complexity), but far above PLDM
Two-Room: Underperforms baselines (noted limitation: SIGReg's Gaussian prior may be too strong for low-dimensional simple environments)

Figure 6: Success rates across environments. LeWM consistently outperforms PLDM and matches or exceeds DINO-WM on most tasks, except the simple Two-Room navigation task.

Computational Efficiency:

15M parameters (tiny ViT encoder + predictor)
Trains on single GPU in hours (vs large foundation models)
Planning completes in under one second

3. Physical Understanding in Latent Space

LeWM's latent space captures meaningful physical structure without explicit supervision:

Probing Results (Table 1, Figure 7):
Linear and non-linear probes trained on frozen LeWM embeddings accurately predict physical quantities (agent position, block position/velocity, end-effector pose). LeWM consistently outperforms PLDM and approaches DINOv2 (trained on 124M images) on most metrics.

Figure 7: Open-loop latent predictions decoded back to pixels. The model accurately predicts future states (agent/block motion), confirming the latent space preserves physical dynamics.

Figure 9: t-SNE visualization of Push-T embeddings. The latent space preserves spatial neighborhood structure—nearby points in the 2D workspace remain nearby in latent space.

Violation-of-Expectation (Figure 10):
LeWM reliably detects physically implausible events (object teleportation) but ignores visual perturbations (color changes), demonstrating genuine physical understanding rather than superficial visual matching.

Figure 10: Surprise signals spike significantly when objects teleport (physical violations), but not when object colors change (visual perturbations), across all three environments (TwoRoom, PushT, OGBench Cube).

4. Training Stability and Ablations

Training Dynamics:
Unlike PLDM's noisy, non-monotonic seven-term objective, LeWM exhibits smooth, stable convergence (Appendix I). The SIGReg term drops sharply early in training then plateaus, indicating the latent distribution quickly matches the target Gaussian.

Robustness (Appendix G):

Single hyperparameter: Only λ (SIGReg weight) requires tuning; performance remains high across λ ∈ [0.01, 0.2]
Architecture agnostic: Works with both ViT and ResNet-18 encoders
Low variance: Success rate variance across seeds is lower than PLDM
Temporal straightening emerges naturally: Latent trajectories become increasingly straight over training (higher cosine similarity between consecutive velocity vectors than PLDM, despite having no explicit temporal smoothness loss)

Summary

LeWM demonstrates that stable, end-to-end world modeling from pixels is possible with a principled two-term objective. It eliminates the engineering complexity of previous JEPA methods (reducing tunable loss hyperparameters from six to one), achieves 48× faster planning than foundation-model approaches, and learns latent spaces that encode genuine physical structure validated by both probing and violation-of-expectation tests.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2603.19312

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.19312 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.19312 in a Space README.md to link it from this page.