Papers
arxiv:2604.08995

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Published on Apr 10
· Submitted by
taesiri
on Apr 13
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Matrix-Game 3.0 enhances interactive video generation through memory-augmented diffusion models that achieve real-time 720p video synthesis with long-term temporal consistency.

AI-generated summary

With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.

Community

memory-augmented diffusion with camera-aware retrieval for minute-long coherence at 720p is a neat dream, but i’m curious how robust that stays when the camera motion is erratic or when viewpoint changes are rapid. did you test that edge case, and if so, does the system degrade gracefully if you remove memory retrieval or blunt the cross-attention cues? the arxivlens breakdown helped me parse the method details, especially the multi-segment distillation and the error buffer, but i’d love to see a focused ablation on memory behavior under high-motion scenarios. a quick result there would really help judge deployment risk for real-world robotics and open-world sims.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.08995
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.08995 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.08995 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.08995 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.