π MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors
π Introduction
MTPano is a robust multi-task panoramic foundation model designed to overcome the limitations of geometric distortions and the scarcity of high-resolution annotations in 360Β° vision. By leveraging powerful perspective dense priors, MTPano establishes a unified representation for spherical scene understanding.
Key Contributions
- Label-Free Training Pipeline: Circummvents data scarcity by projecting panoramas into distortion-free perspective patches, generating high-quality pseudo-labels using foundation models (InternImage-H, MoGe-2), and re-projecting them for patch-wise supervision.
- Panoramic Dual BridgeNet (PD-BridgeNet): A dual-stream architecture that disentangles rotation-invariant features (Semantic Segmentation, Depth) from rotation-variant features (Surface Normals).
- ERP Token Mixer: A latitude-adaptive mechanism that handles Equirectangular Projection (ERP) distortion by dynamically adjusting kernels based on pixel stretching.
- Truncated Gradient Flow: Facilitates synergistic cross-task interaction while strictly blocking conflicting gradients between feature branches to avoid negative transfer.
π Training Data
MTPano is trained on a large-scale composite dataset of over 408k images, combining real-world captures with high-fidelity synthetic scenes.
Model Versions Comparison
| Dataset | 140k Weights | 408k Weights |
|---|---|---|
| Structured3D | 16.6k | 16.6k |
| Sun360 | 34.3k | 34.3k |
| Matterport3D | 7.9k | 7.9k |
| DiT360 (Synthetic) | 82k | 182k |
| Hunyuan (Synthetic) | - | 100k |
| ZInD | - | 67.4k |
| Total Images | 140k | 408k |
π Performance
MTPano achieves state-of-the-art performance across all tasks on both synthetic and real-world benchmarks, consistently outperforming previous single-task specialists and multi-task models.
Structured3D (Synthetic Benchmark)
| Task | Metric | MTPano |
|---|---|---|
| Semantic Segmentation | mIoU (β) | 75.66 |
| Depth Estimation | AbsRel (β) | 0.0248 |
| RMSE (β) | 0.0968 | |
| $\delta_1$ (β) | 99.27 | |
| Surface Normal | Mean Error (β) | 3.85Β° |
| Median Error (β) | 0.01Β° | |
| $<11.5^\circ$ (β) | 91.66 |
Stanford2D3D (Real-World Benchmark)
| Task | Metric | MTPano (Ours) |
|---|---|---|
| Semantic Segmentation | mIoU (β) | 69.47 |
| Depth Estimation | AbsRel (β) | 0.0675 |
| RMSE (β) | 0.4317 | |
| $\delta_1$ (β) | 96.86 | |
| Surface Normal | Mean Error (β) | 9.71Β° |
| Median Error (β) | 0.93Β° | |
| $<11.5^\circ$ (β) | 80.65 |
Note: On the real-world Stanford2D3D dataset, MTPano (a multi-task model) achieves performance highly competitive with single-task specialist foundation models while maintaining superior cross-task consistency.
π οΈ Implementation
Please refer to https://github.com/Evergreen0929/MTPano for detailed implementations.
π Citation
@article{zhang2026mtpano,
title={MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors},
author={Zhang, Jingdong and Zhan, Xiaohang and Zhang, Lingzhi and Wang, Yizhou and Yu, Zhengming and Wang, Jionghao and Wang, Wenping and Li, Xin},
journal={arXiv preprint},
year={2026}
}
π Acknowledgement
This work is supported by researchers from Texas A&M University and Adobe. We thank the creators of DINOv3, InternImage, and MoGe for their foundational contributions to the field.
Model tree for jdzhang0929/MTPano
Base model
facebook/dinov3-vit7b16-pretrain-lvd1689m