Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels
Abstract
A feedforward model called Track4World enables efficient holistic 3D tracking of every pixel in a video by utilizing a global 3D scene representation and novel 3D correlation scheme for dense flow estimation.
Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.
Community
Track4World estimates dense 3D scene flow of every pixel between arbitrary frame pairs from a monocular video in a global feedforward manner, enabling efficient and dense 3D tracking of every pixel in the world-centric coordinate system.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TrajVG: 3D Trajectory-Coupled Visual Geometry Learning (2026)
- MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE (2026)
- 4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere (2026)
- Flow4R: Unifying 4D Reconstruction and Tracking with Scene Flow (2026)
- Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis (2026)
- CoWTracker: Tracking by Warping instead of Correlation (2026)
- 3AM: 3egment Anything with Geometric Consistency in Videos (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper