OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
Abstract
OmniJigsaw presents a self-supervised framework for video-audio understanding and collaborative reasoning through temporal reordering and cross-modal integration strategies.
To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.
Community
We introduce OmniJigsaw, a self-supervised RL post-training framework for omni-modal models. The core idea is a temporal jigsaw proxy task: reconstruct chronology from shuffled audio–visual clips, with three modality-orchestration strategies (JMI / SMS / CMM) to encourage real cross-modal integration. We also analyze a bi-modal shortcut under full multimodal cues and show that clip-level modality masking (CMM) helps mitigate it. Strong gains across 15 video / audio / omni-modal reasoning benchmarks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs (2026)
- MAPLE: Modality-Aware Post-training and Learning Ecosystem (2026)
- STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering (2026)
- OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning (2026)
- MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos (2026)
- Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs (2026)
- DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.08209 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper