Papers
arxiv:2603.04291

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Published on Mar 4
· Submitted by
Lingen Li
on Mar 5
Authors:
,
,
,
,
,
,
,

Abstract

CubeComposer is a spatio-temporal autoregressive diffusion model that generates high-resolution 360° panoramic videos by decomposing them into cubemap representations and using efficient autoregressive synthesis techniques.

AI-generated summary

Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting leq 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360° videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360° video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams. Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios. Project page: https://lg-li.github.io/project/cubecomposer

Community

Paper submitter

TL;DR: CubeComposer generates 360° video from perspective videos in a cubemap face‑wise spatio‑temporal autoregressive manner. Each step generates one face over a temporal window, which greatly reduces peak memory and enables native 2K/3K/4K 360° video generation.

Project page: https://lg-li.github.io/project/cubecomposer
GitHub repo: https://github.com/TencentARC/CubeComposer
Model repo: https://huggingface.co/TencentARC/CubeComposer

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.04291 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.04291 in a Space README.md to link it from this page.

Collections including this paper 1