arxiv:2603.11647

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Published on Mar 12

· Submitted by

Xue Zeyue on Mar 16

Upvote

Authors:

Yaofeng Su ,

Zezhong Qian ,

Abstract

OmniForcing distills a dual-stream bidirectional diffusion model into a streaming autoregressive generator while addressing training instability and synchronization issues through asymmetric alignment and specialized token mechanisms.

AI-generated summary

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at sim25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.Project Page: https://omniforcing.com{https://omniforcing.com}

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

Exploration

Paper author about 9 hours ago

•

edited about 9 hours ago

Hi everyone! Joint audio-visual generation models like LTX-2 and Veo 3 can produce stunning synchronized video and audio from text, but they require minutes of offline processing (e.g., 197s for a 5-second clip) due to bidirectional full-sequence attention — real-time or interactive use is simply out of reach. We present OmniForcing, the first framework to enable real-time streaming for general text-to-audio-visual (T2AV) generation, by distilling a heavy bidirectional dual-stream model into a causal autoregressive engine. OmniForcing achieves ~25 FPS on a single GPU with a first-chunk latency of only ~0.7s — a ~35× speedup — while preserving both visual and acoustic fidelity on par with the teacher across nearly all dimensions on JavisBench. Unlike prior streaming works that are limited to video-only, OmniForcing jointly streams synchronized audio and video, opening the door to truly interactive multi-modal generation. Project page with playable demos: https://omniforcing.com — code and weights coming in two weeks, https://github.com/OmniForcing/OmniForcing!

xzyhku

Paper submitter about 8 hours ago

The first framework to distill an offline, dual-stream audio-visual bidirectional diffusion model into a high-fidelity streaming autoregressive generator.