YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
MOLM-Audio: SPDMark-Style Segment-Wise Audio Watermarking
LoRA-routing audio watermarks for three generators (HiFi-GAN, VibeVoice acoustic decoder, DiffWave). Per SPDMark (https://arxiv.org/abs/2512.12090), each audio is split into S=8 segments; each segment carries an HMAC-derived M-bit message embedded via a parallel LoRA "basis dictionary." The verifier matches recovered per-segment bits against the expected HMAC sequence using Hungarian assignment + Binomial hypothesis test.
Repo layout
checkpoints/
diffwave_spdmark_spec/final/{lora_weights.pt, extractor.pt, diffwave_full.pt}
diffwave_v5_2step/final/{lora_weights.pt, extractor.pt, diffwave_full.pt}
hifigan_spdmark_spec/final/{lora_weights.pt, extractor.pt, hifigan_full.pt}
vibevoice_spdmark_spec/final/{lora_weights.pt, extractor.pt, model_full.pt}
vibevoice_14bit_v2/final/{lora_weights.pt, extractor.pt, model_full.pt}
code/ # training + inference + smoke-test scripts (clone or download)
eval/<run>_<regime>/results.json
README.md
lora_weights.pt + the base model from torchaudio/pretrained is sufficient for
inference. *_full.pt is provided for one-step loading where the base model
isn't available locally.
Training/test wavs live in MOLM-Audio/molm-audio-data under data/.
Setup
# 1. Clone or download this repo.
hf download MOLM-Audio/molm-audio --local-dir molm-audio
# 2. Drop the code/ contents into a Python env.
cd molm-audio/code
pip install -r requirements.txt
# 3. For DiffWave: also place the LJSpeech base checkpoint at
# pretrained/diffwave-ljspeech.pt (download from the upstream DiffWave repo).
# 4. For VibeVoice: extract decoder + diffusion head components once via:
python precompute_vibevoice_data.py extract \
--vibevoice_model microsoft/VibeVoice-1.5B \
--output_dir pretrained/vibevoice
# Then encode your audio into latents with `precompute_vibevoice_data.py encode`.
# 5. For HiFi-GAN: nothing extra โ torchaudio downloads V3 LJSpeech weights on first run.
Checkpoints
Current runs (segments, spec-trained, with Hungarian verifier)
| Run | Backbone | Routing | Paths | Bits/seg | Train attack |
|---|---|---|---|---|---|
hifigan_spdmark_spec |
HiFi-GAN V3 (torchaudio LJSpeech) | 0,1,2,3,4,5,6 |
4 | 14 | nvlceqr |
vibevoice_spdmark_spec |
VibeVoice acoustic decoder | 0,1,2,3,4,5,6 (ร2 slots) |
2 | 14 | nvlceqr |
diffwave_spdmark_spec |
DiffWave (LJSpeech) | 0,4,8,12,16,20,24 |
4 | 14 | nvlceqr |
Legacy runs (no segments)
| Run | Backbone | Routing | Paths | Bits/seg | Train attack | Notes |
|---|---|---|---|---|---|---|
vibevoice_14bit_v2 |
VibeVoice acoustic decoder | 0,1,2,3,4,5,6 (ร2 slots) |
2 | 14 | nvlceq |
Pre-SPDMark-temporal-attacks training; eval JSON has no Hungarian verify block. |
diffwave_v5_2step |
DiffWave (LJSpeech) | 0,4,8,12,16,20,24 |
2 | 7 | nvlceq |
2-step diffusion, lambda_perc 0.1 โ louder watermark, faster inference. |
Spectral attack codes: n noise, v gain, l lowpass, c crop, e erase, q quantize, r resample. SPDMark temporal codes: d segment-drop, s segment-swap, i segment-insert.
Training
All three trainers share the SPDMark plumbing in code/molm_audio_adapter.py
(HMAC keys via sample_training_keys, routing-mask construction, segment-map),
code/audio_augmentations.py (segment-aware aug with optional temporal
attacks), and code/audio_losses.py (SegmentCombinedAudioLoss with
valid_mask).
The exact commands used to produce the spec-trained checkpoints in this repo:
HiFi-GAN segments
python code/training_molm_hifigan.py \
--dataset_path data/train_full \
--output_dir checkpoints/hifigan_spdmark_spec \
--exp_name MOLM_HiFiGAN_SPDMark_spec \
--routing_blocks 0,1,2,3,4,5,6 --num_paths 4 --lora_rank 64 \
--num_segments 8 --key_bits 128 \
--attack nvlceqr --aug_prob 0.25 \
--lambda_perc 0.5 --lambda_perc_warmup_steps 4500 \
--max_train_steps 75000 --train_batch_size 8 --gradient_accumulation_steps 4 \
--learning_rate 2e-4 --lora_init_std 0.07 \
--audio_seconds 2.0 \
--checkpointing_steps 500 --logging_steps 10 --use_wandb
DiffWave segments
python code/training_molm_audio.py \
--diffwave_checkpoint pretrained/diffwave-ljspeech.pt \
--dataset_path data/train_full \
--output_dir checkpoints/diffwave_spdmark_spec \
--exp_name DiffWave_SPDMark_spec \
--routing_layers 0,4,8,12,16,20,24 --num_paths 4 --lora_rank 64 \
--lora_alpha 24 --lora_init_std 0.01 \
--diffusion_steps 4 \
--num_segments 8 --key_bits 128 \
--attack nvlceqr --aug_prob 0.25 \
--adam_weight_decay 0.05 --lambda_lora_reg 0.01 \
--lambda_perc 0.3 --lambda_perc_warmup_steps 2000 \
--max_train_steps 40000 --train_batch_size 8 --gradient_accumulation_steps 2 \
--learning_rate 3e-4 --max_grad_norm 1.0 --audio_seconds 2.0 \
--checkpointing_steps 500 --logging_steps 10 --use_wandb
VibeVoice spec
python code/training_molm_vibevoice.py \
--components_dir pretrained/vibevoice \
--data_dir data/vibevoice_latents_train_full \
--output_dir checkpoints/vibevoice_spdmark_spec \
--exp_name vibevoice_spdmark_spec \
--routing_blocks 0,1,2,3,4,5,6 \
--num_paths 2 --lora_rank 16 --lora_alpha 8 \
--num_frames 8 --num_segments 8 --key_bits 128 \
--attack nvlceqr --aug_prob 0.3 \
--lambda_perc 0.5 --lambda_perc_warmup_steps 4500 --lambda_lora_reg 0.01 \
--max_train_steps 120000 --cosine_cycle 1 \
--train_batch_size 16 --gradient_accumulation_steps 2 \
--learning_rate 1e-4 \
--checkpointing_steps 500 --logging_steps 10 --use_wandb
No segments training
diffwave_v5_2step (2-step diffusion, very low lambda_perc):
python code/training_molm_audio.py \
--diffwave_checkpoint pretrained/diffwave-ljspeech.pt \
--dataset_path data/train_full \
--routing_layers 0,4,8,12,16,20,24 --num_paths 2 --lora_rank 64 --lora_alpha 64 \
--diffusion_steps 2 \
--max_train_steps 120000 --cosine_cycle 1 \
--train_batch_size 8 --gradient_accumulation_steps 4 \
--learning_rate 1e-4 \
--attack nvlceq --aug_prob 0.3 \
--lambda_perc 0.1 --lambda_perc_warmup_steps 4500 --lambda_lora_reg 0.01 \
--output_dir checkpoints/diffwave_v5_2step \
--checkpointing_steps 100 --logging_steps 10 --use_wandb \
--exp_name diffwave_v5_2step
vibevoice_14bit_v2:
python code/training_molm_vibevoice.py \
--components_dir pretrained/vibevoice \
--data_dir data/vibevoice_latents_train_full \
--routing_blocks 0,1,2,3,4,5,6 \
--num_paths 2 --lora_rank 16 --lora_alpha 8 \
--max_train_steps 120000 --cosine_cycle 1 \
--train_batch_size 16 --gradient_accumulation_steps 2 \
--learning_rate 1e-4 \
--attack nvlceq --aug_prob 0.3 \
--lambda_perc 0.5 --lambda_perc_warmup_steps 4500 --lambda_lora_reg 0.01 \
--output_dir checkpoints/vibevoice_14bit_v2 \
--exp_name vibevoice_14bit_v2 \
--checkpointing_steps 500 --logging_steps 10 --use_wandb
Key SPDMark training flags (all three trainers):
--num_segments 8โ S, segments per audio clip.--key_bits 128โ base-key width fed into HMAC-SHA256 for per-segment message derivation.--attack <codes>โ see attack-code legend above. Appenddsito also train against segment-level temporal attacks (drop/swap/insert).
Inference
All three generators share a common eval interface; --verify enables the
Hungarian + Binomial verifier (auto-sets --message_scheme hmac).
HiFi-GAN
python code/generate_molm_hifigan.py \
--lora_weights checkpoints/hifigan_spdmark_spec/final/lora_weights.pt \
--extractor_weights checkpoints/hifigan_spdmark_spec/final/extractor.pt \
--routing_blocks 0,1,2,3,4,5,6 --num_paths 4 --lora_rank 64 \
--test_dir <path-to-wavs> --num_samples 20 \
--chunked_generation --num_chunks 8 \
--verify --gamma_f 0.01 --gamma_v 0.01 --attacks nvlceqr \
--output_dir eval_out/hifigan_spec_spectral --device cuda
DiffWave (current spdmark_spec)
python code/generate_molm_audio.py \
--diffwave_checkpoint pretrained/diffwave-ljspeech.pt \
--lora_weights checkpoints/diffwave_spdmark_spec/final/lora_weights.pt \
--extractor_weights checkpoints/diffwave_spdmark_spec/final/extractor.pt \
--routing_layers 0,4,8,12,16,20,24 --num_paths 4 --lora_rank 64 \
--diffusion_steps 4 \
--test_dir <path-to-wavs> --num_samples 20 \
--chunked_generation --num_chunks 8 \
--verify --gamma_f 0.01 --gamma_v 0.01 --attacks nvlceqr \
--output_dir eval_out/diffwave_spec_spectral --device cuda
DiffWave (no segments v5_2step)
python code/generate_molm_audio.py \
--diffwave_checkpoint pretrained/diffwave-ljspeech.pt \
--lora_weights checkpoints/diffwave_v5_2step/final/lora_weights.pt \
--extractor_weights checkpoints/diffwave_v5_2step/final/extractor.pt \
--routing_layers 0,4,8,12,16,20,24 --num_paths 2 --lora_rank 64 \
--diffusion_steps 2 \
--test_dir <path-to-wavs> --num_samples 20 \
--chunked_generation --num_chunks 8 \
--verify --gamma_f 0.01 --gamma_v 0.01 --attacks nvlceqr \
--output_dir eval_out/diffwave_v5_2step_spectral --device cuda
VibeVoice (current spdmark_spec)
python code/generate_molm_vibevoice.py \
--components_dir pretrained/vibevoice \
--lora_weights checkpoints/vibevoice_spdmark_spec/final/lora_weights.pt \
--extractor_weights checkpoints/vibevoice_spdmark_spec/final/extractor.pt \
--routing_blocks 0,1,2,3,4,5,6 --num_paths 2 --lora_rank 16 --lora_alpha 8 \
--num_frames 8 \
--data_dir <path-to-precomputed-latents> --num_samples 20 \
--chunked_generation --num_chunks 8 \
--verify --gamma_f 0.01 --gamma_v 0.01 --attacks nvlceqr \
--output_dir eval_out/vibevoice_spec_spectral --device cuda
VibeVoice (no segments 14bit_v2)
Same flags as spdmark_spec but point at checkpoints/vibevoice_14bit_v2/final/.
Smoke tests
Quick sanity check of the SPDMark plumbing:
python code/smoke_test_hifigan.py # 9 checks incl. HMAC keys + segment-aware aug + verify
python code/smoke_test.py # DiffWave equivalents
python code/smoke_test_vibevoice.py # VibeVoice equivalents