SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving
Abstract
SoundWeaver accelerates text-to-audio diffusion generation by caching semantically similar audio and dynamically skipping function evaluations, achieving significant latency reduction with minimal quality loss.
Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0 times latency reduction with a cache of only {sim}1K entries while preserving or improving perceptual quality.
Community
Tired of multi-second waits for stunning AI audio? We introduce SoundWeaver, the first training-free, model-agnostic serving system that revolutionizes text-to-audio diffusion by semantically warm-starting from a tiny cache of similar audio clips! With just ~1K cached entries, it delivers massive 1.8–3.0× latency reduction while actually improving perceptual quality! Additionally, the first Text-To-Audio paper to supplement quality analysis with a fine-crafted LLM-as-judge evaluation scheme (prompt available in paper)!
Hello everyone!! Please have a read, let me know if you wish to access code. We see amazing results with little overhead, very very easy to integrate into your workflow. Warm-starting really needs to be explored more within the diffusion audio space.
Furthermore we are the FIRST paper to use LLM-as-judge for text to audio, I highly recommend using this as a supplementary metric in addition to the usual CLAP, FD, KL etc. Feel free to use our prompt!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CHAI: CacHe Attention Inference for text2video (2026)
- Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention (2026)
- Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation (2026)
- SemanticAudio: Audio Generation and Editing in Semantic Space (2026)
- SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models (2026)
- From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation (2026)
- TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
