🌐 MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors

📜 Introduction

MTPano is a robust multi-task panoramic foundation model designed to overcome the limitations of geometric distortions and the scarcity of high-resolution annotations in 360° vision. By leveraging powerful perspective dense priors, MTPano establishes a unified representation for spherical scene understanding.

Key Contributions

Label-Free Training Pipeline: Circummvents data scarcity by projecting panoramas into distortion-free perspective patches, generating high-quality pseudo-labels using foundation models (InternImage-H, MoGe-2), and re-projecting them for patch-wise supervision.
Panoramic Dual BridgeNet (PD-BridgeNet): A dual-stream architecture that disentangles rotation-invariant features (Semantic Segmentation, Depth) from rotation-variant features (Surface Normals).
ERP Token Mixer: A latitude-adaptive mechanism that handles Equirectangular Projection (ERP) distortion by dynamically adjusting kernels based on pixel stretching.
Truncated Gradient Flow: Facilitates synergistic cross-task interaction while strictly blocking conflicting gradients between feature branches to avoid negative transfer.

📊 Training Data

MTPano is trained on a large-scale composite dataset of over 408k images, combining real-world captures with high-fidelity synthetic scenes.

Model Versions Comparison

Dataset	140k Weights	408k Weights
Structured3D	16.6k	16.6k
Sun360	34.3k	34.3k
Matterport3D	7.9k	7.9k
DiT360 (Synthetic)	82k	182k
Hunyuan (Synthetic)	-	100k
ZInD	-	67.4k
Total Images	140k	408k

🚀 Performance

MTPano achieves state-of-the-art performance across all tasks on both synthetic and real-world benchmarks, consistently outperforming previous single-task specialists and multi-task models.

Structured3D (Synthetic Benchmark)

Task	Metric	MTPano
Semantic Segmentation	mIoU (↑)	75.66
Depth Estimation	AbsRel (↓)	0.0248
	RMSE (↓)	0.0968
	$\delta_1$ (↑)	99.27
Surface Normal	Mean Error (↓)	3.85°
	Median Error (↓)	0.01°
	$<11.5^\circ$ (↑)	91.66

Stanford2D3D (Real-World Benchmark)

Task	Metric	MTPano (Ours)
Semantic Segmentation	mIoU (↑)	69.47
Depth Estimation	AbsRel (↓)	0.0675
	RMSE (↓)	0.4317
	$\delta_1$ (↑)	96.86
Surface Normal	Mean Error (↓)	9.71°
	Median Error (↓)	0.93°
	$<11.5^\circ$ (↑)	80.65

Note: On the real-world Stanford2D3D dataset, MTPano (a multi-task model) achieves performance highly competitive with single-task specialist foundation models while maintaining superior cross-task consistency.

🛠️ Implementation

Please refer to https://github.com/Evergreen0929/MTPano for detailed implementations.

🎓 Citation

@article{zhang2026mtpano,
  title={MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors},
  author={Zhang, Jingdong and Zhan, Xiaohang and Zhang, Lingzhi and Wang, Yizhou and Yu, Zhengming and Wang, Jionghao and Wang, Wenping and Li, Xin},
  journal={arXiv preprint},
  year={2026}
}

👏 Acknowledgement

This work is supported by researchers from Texas A&M University and Adobe. We thank the creators of DINOv3, InternImage, and MoGe for their foundational contributions to the field.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for jdzhang0929/MTPano

Base model

facebook/dinov3-vit7b16-pretrain-lvd1689m

Finetuned

facebook/dinov3-vitl16-pretrain-lvd1689m

Finetuned

(2)

this model

Paper for jdzhang0929/MTPano

MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors

Paper • 2602.05330 • Published Feb 5