🌐 MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors

Paper Hugging Face Project Page GitHub

πŸ“œ Introduction

MTPano is a robust multi-task panoramic foundation model designed to overcome the limitations of geometric distortions and the scarcity of high-resolution annotations in 360Β° vision. By leveraging powerful perspective dense priors, MTPano establishes a unified representation for spherical scene understanding.

Key Contributions

  • Label-Free Training Pipeline: Circummvents data scarcity by projecting panoramas into distortion-free perspective patches, generating high-quality pseudo-labels using foundation models (InternImage-H, MoGe-2), and re-projecting them for patch-wise supervision.
  • Panoramic Dual BridgeNet (PD-BridgeNet): A dual-stream architecture that disentangles rotation-invariant features (Semantic Segmentation, Depth) from rotation-variant features (Surface Normals).
  • ERP Token Mixer: A latitude-adaptive mechanism that handles Equirectangular Projection (ERP) distortion by dynamically adjusting kernels based on pixel stretching.
  • Truncated Gradient Flow: Facilitates synergistic cross-task interaction while strictly blocking conflicting gradients between feature branches to avoid negative transfer.

πŸ“Š Training Data

MTPano is trained on a large-scale composite dataset of over 408k images, combining real-world captures with high-fidelity synthetic scenes.

Model Versions Comparison

Dataset 140k Weights 408k Weights
Structured3D 16.6k 16.6k
Sun360 34.3k 34.3k
Matterport3D 7.9k 7.9k
DiT360 (Synthetic) 82k 182k
Hunyuan (Synthetic) - 100k
ZInD - 67.4k
Total Images 140k 408k

πŸš€ Performance

MTPano achieves state-of-the-art performance across all tasks on both synthetic and real-world benchmarks, consistently outperforming previous single-task specialists and multi-task models.

Structured3D (Synthetic Benchmark)

Task Metric MTPano
Semantic Segmentation mIoU (↑) 75.66
Depth Estimation AbsRel (↓) 0.0248
RMSE (↓) 0.0968
$\delta_1$ (↑) 99.27
Surface Normal Mean Error (↓) 3.85Β°
Median Error (↓) 0.01Β°
$<11.5^\circ$ (↑) 91.66

Stanford2D3D (Real-World Benchmark)

Task Metric MTPano (Ours)
Semantic Segmentation mIoU (↑) 69.47
Depth Estimation AbsRel (↓) 0.0675
RMSE (↓) 0.4317
$\delta_1$ (↑) 96.86
Surface Normal Mean Error (↓) 9.71Β°
Median Error (↓) 0.93Β°
$<11.5^\circ$ (↑) 80.65

Note: On the real-world Stanford2D3D dataset, MTPano (a multi-task model) achieves performance highly competitive with single-task specialist foundation models while maintaining superior cross-task consistency.

πŸ› οΈ Implementation

Please refer to https://github.com/Evergreen0929/MTPano for detailed implementations.

πŸŽ“ Citation

@article{zhang2026mtpano,
  title={MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors},
  author={Zhang, Jingdong and Zhan, Xiaohang and Zhang, Lingzhi and Wang, Yizhou and Yu, Zhengming and Wang, Jionghao and Wang, Wenping and Li, Xin},
  journal={arXiv preprint},
  year={2026}
}

πŸ‘ Acknowledgement

This work is supported by researchers from Texas A&M University and Adobe. We thank the creators of DINOv3, InternImage, and MoGe for their foundational contributions to the field.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jdzhang0929/MTPano

Paper for jdzhang0929/MTPano