LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
Abstract
LeWorldModel presents a stable end-to-end JEPA framework that trains efficiently from raw pixels using minimal loss terms while maintaining competitive performance in control tasks and encoding meaningful physical structures.
Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.
Community
Here are the main results from LeWorldModel (LeWM), organized by key contribution:
1. Stable End-to-End Training with Minimal Hyperparameters
The Innovation: LeWM is the first JEPA (Joint-Embedding Predictive Architecture) world model that trains stably end-to-end from raw pixels using only two loss terms:
- A next-embedding prediction loss (MSE)
- SIGReg: A regularizer enforcing Gaussian-distributed latent embeddings via random projections and the Epps-Pulley normality test
Why this matters: Prior work requires complex multi-term losses (PLDM uses 7 terms), exponential moving averages, stop-gradient tricks, or frozen pre-trained encoders to prevent representation collapse. LeWM eliminates these heuristics while providing provable anti-collapse guarantees.

Figure 1: LeWM trains encoder and predictor jointly. SIGReg projects latent embeddings onto random directions and applies normality tests to prevent collapse without stop-gradients or EMAs.
Key comparison (Figure 2):
| Method | End-to-End | Task-Agnostic | Pixel-Based | Collapse Guarantee | Hyperparameters |
|---|---|---|---|---|---|
| PLDM | ✓ | ✓ | ✓ | ✗ | 6 (unstable) |
| DINO-WM | ✗ (frozen encoder) | ✓ | ✓ | ✓ | Few |
| LeWM | ✓ | ✓ | ✓ | ✓ | 1 (λ) |
2. Planning Performance and Efficiency
LeWM achieves 48× faster planning than foundation-model-based approaches while maintaining competitive performance:

Figure 3: Left: Planning time comparison. LeWM uses ~200× fewer tokens than DINO-WM, achieving speeds comparable to PLDM while being ~50× faster than DINO-WM. Center/Right: Under fixed compute budgets, LeWM outperforms DINO-WM on Push-T and OGBench-Cube.
Quantitative Results (Figure 6):
- Push-T: 18% higher success rate than PLDM (the only other end-to-end method); outperforms DINO-WM even when DINO-WM uses additional proprioceptive inputs
- Reacher: Competitive or better than all baselines
- OGBench-Cube: Slightly below DINO-WM (likely due to 3D visual complexity), but far above PLDM
- Two-Room: Underperforms baselines (noted limitation: SIGReg's Gaussian prior may be too strong for low-dimensional simple environments)

Figure 6: Success rates across environments. LeWM consistently outperforms PLDM and matches or exceeds DINO-WM on most tasks, except the simple Two-Room navigation task.
Computational Efficiency:
- 15M parameters (tiny ViT encoder + predictor)
- Trains on single GPU in hours (vs large foundation models)
- Planning completes in under one second
3. Physical Understanding in Latent Space
LeWM's latent space captures meaningful physical structure without explicit supervision:
Probing Results (Table 1, Figure 7):
Linear and non-linear probes trained on frozen LeWM embeddings accurately predict physical quantities (agent position, block position/velocity, end-effector pose). LeWM consistently outperforms PLDM and approaches DINOv2 (trained on 124M images) on most metrics.

Figure 7: Open-loop latent predictions decoded back to pixels. The model accurately predicts future states (agent/block motion), confirming the latent space preserves physical dynamics.

Figure 9: t-SNE visualization of Push-T embeddings. The latent space preserves spatial neighborhood structure—nearby points in the 2D workspace remain nearby in latent space.
Violation-of-Expectation (Figure 10):
LeWM reliably detects physically implausible events (object teleportation) but ignores visual perturbations (color changes), demonstrating genuine physical understanding rather than superficial visual matching.

Figure 10: Surprise signals spike significantly when objects teleport (physical violations), but not when object colors change (visual perturbations), across all three environments (TwoRoom, PushT, OGBench Cube).
4. Training Stability and Ablations
Training Dynamics:
Unlike PLDM's noisy, non-monotonic seven-term objective, LeWM exhibits smooth, stable convergence (Appendix I). The SIGReg term drops sharply early in training then plateaus, indicating the latent distribution quickly matches the target Gaussian.
Robustness (Appendix G):
- Single hyperparameter: Only λ (SIGReg weight) requires tuning; performance remains high across λ ∈ [0.01, 0.2]
- Architecture agnostic: Works with both ViT and ResNet-18 encoders
- Low variance: Success rate variance across seeds is lower than PLDM
- Temporal straightening emerges naturally: Latent trajectories become increasingly straight over training (higher cosine similarity between consecutive velocity vectors than PLDM, despite having no explicit temporal smoothness loss)
Summary
LeWM demonstrates that stable, end-to-end world modeling from pixels is possible with a principled two-term objective. It eliminates the engineering complexity of previous JEPA methods (reducing tunable loss hyperparameters from six to one), achieves 48× faster planning than foundation-model approaches, and learns latent spaces that encode genuine physical structure validated by both probing and violation-of-expectation tests.
Get this paper in your agent:
hf papers read 2603.19312 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper