Abstract
NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising.
Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.
Community
NAVA is a 6.3B-parameter Native Audio-Visual Alignment framework for joint audio-video generation. To overcome the limitations of existing dual-tower and unified paradigms, NAVA employs an Align-then-Fuse MMDiT architecture that first establishes fine-grained audio-video correspondence before applying textual context. Furthermore, it introduces Timbre-in-Context Conditioning for highly controllable speech generation. Experiments show NAVA achieves superior A-V synchronization, robust video quality, and enhanced reference-timbre controllability on Verse-Bench and Seed-TTS.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation (2026)
- SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing (2026)
- Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation (2026)
- MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation (2026)
- Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling (2026)
- Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation (2026)
- StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.30073 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
robingg1/NAVA
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper