Papers
arxiv:2603.07865

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

Published on Mar 9
· Submitted by
Ayush Barik
on Mar 13
Authors:
,
,
,
,
,

Abstract

SoundWeaver accelerates text-to-audio diffusion generation by caching semantically similar audio and dynamically skipping function evaluations, achieving significant latency reduction with minimal quality loss.

AI-generated summary

Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0 times latency reduction with a cache of only {sim}1K entries while preserving or improving perceptual quality.

Community

Paper author Paper submitter

Screenshot 2026-03-12 at 9.33.26 PM

Tired of multi-second waits for stunning AI audio? We introduce SoundWeaver, the first training-free, model-agnostic serving system that revolutionizes text-to-audio diffusion by semantically warm-starting from a tiny cache of similar audio clips! With just ~1K cached entries, it delivers massive 1.8–3.0× latency reduction while actually improving perceptual quality! Additionally, the first Text-To-Audio paper to supplement quality analysis with a fine-crafted LLM-as-judge evaluation scheme (prompt available in paper)!

Paper author Paper submitter

Hello everyone!! Please have a read, let me know if you wish to access code. We see amazing results with little overhead, very very easy to integrate into your workflow. Warm-starting really needs to be explored more within the diffusion audio space.

Furthermore we are the FIRST paper to use LLM-as-judge for text to audio, I highly recommend using this as a supplementary metric in addition to the usual CLAP, FD, KL etc. Feel free to use our prompt!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.07865 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.07865 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.07865 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.