arxiv:2603.07865

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

Published on Mar 9

· Submitted by

Ayush Barik on Mar 13

University of Illinois at Urbana-Champaign

Upvote

Authors:

Ayush Barik ,

Abstract

SoundWeaver accelerates text-to-audio diffusion generation by caching semantically similar audio and dynamically skipping function evaluations, achieving significant latency reduction with minimal quality loss.

AI-generated summary

Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0 times latency reduction with a cache of only {sim}1K entries while preserving or improving perceptual quality.

View arXiv page View PDF Add to collection

Community

jormungandr2017

Paper author Paper submitter 1 day ago

Tired of multi-second waits for stunning AI audio? We introduce SoundWeaver, the first training-free, model-agnostic serving system that revolutionizes text-to-audio diffusion by semantically warm-starting from a tiny cache of similar audio clips! With just ~1K cached entries, it delivers massive 1.8–3.0× latency reduction while actually improving perceptual quality! Additionally, the first Text-To-Audio paper to supplement quality analysis with a fine-crafted LLM-as-judge evaluation scheme (prompt available in paper)!

jormungandr2017

Paper author Paper submitter 1 day ago

Hello everyone!! Please have a read, let me know if you wish to access code. We see amazing results with little overhead, very very easy to integrate into your workflow. Warm-starting really needs to be explored more within the diffusion audio space.

Furthermore we are the FIRST paper to use LLM-as-judge for text to audio, I highly recommend using this as a supplementary metric in addition to the usual CLAP, FD, KL etc. Feel free to use our prompt!