You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

MOLM-Audio: SPDMark-Style Segment-Wise Audio Watermarking

LoRA-routing audio watermarks for three generators (HiFi-GAN, VibeVoice acoustic decoder, DiffWave). Per SPDMark (https://arxiv.org/abs/2512.12090), each audio is split into S=8 segments; each segment carries an HMAC-derived M-bit message embedded via a parallel LoRA "basis dictionary." The verifier matches recovered per-segment bits against the expected HMAC sequence using Hungarian assignment + Binomial hypothesis test.

Repo layout

checkpoints/
  diffwave_spdmark_spec/final/{lora_weights.pt, extractor.pt, diffwave_full.pt}
  diffwave_v5_2step/final/{lora_weights.pt, extractor.pt, diffwave_full.pt}
  hifigan_spdmark_spec/final/{lora_weights.pt, extractor.pt, hifigan_full.pt}
  vibevoice_spdmark_spec/final/{lora_weights.pt, extractor.pt, model_full.pt}
  vibevoice_14bit_v2/final/{lora_weights.pt, extractor.pt, model_full.pt}
code/                  # training + inference + smoke-test scripts (clone or download)
eval/<run>_<regime>/results.json
README.md

lora_weights.pt + the base model from torchaudio/pretrained is sufficient for inference. *_full.pt is provided for one-step loading where the base model isn't available locally.

Training/test wavs live in MOLM-Audio/molm-audio-data under data/.

Setup

# 1. Clone or download this repo.
hf download MOLM-Audio/molm-audio --local-dir molm-audio

# 2. Drop the code/ contents into a Python env.
cd molm-audio/code
pip install -r requirements.txt

# 3. For DiffWave: also place the LJSpeech base checkpoint at
#    pretrained/diffwave-ljspeech.pt (download from the upstream DiffWave repo).
# 4. For VibeVoice: extract decoder + diffusion head components once via:
python precompute_vibevoice_data.py extract \
    --vibevoice_model microsoft/VibeVoice-1.5B \
    --output_dir pretrained/vibevoice
# Then encode your audio into latents with `precompute_vibevoice_data.py encode`.
# 5. For HiFi-GAN: nothing extra — torchaudio downloads V3 LJSpeech weights on first run.

Checkpoints

Current runs (segments, spec-trained, with Hungarian verifier)

Run	Backbone	Routing	Paths	Bits/seg	Train attack
`hifigan_spdmark_spec`	HiFi-GAN V3 (torchaudio LJSpeech)	`0,1,2,3,4,5,6`	4	14	`nvlceqr`
`vibevoice_spdmark_spec`	VibeVoice acoustic decoder	`0,1,2,3,4,5,6` (×2 slots)	2	14	`nvlceqr`
`diffwave_spdmark_spec`	DiffWave (LJSpeech)	`0,4,8,12,16,20,24`	4	14	`nvlceqr`

Legacy runs (no segments)

Run	Backbone	Routing	Paths	Bits/seg	Train attack	Notes
`vibevoice_14bit_v2`	VibeVoice acoustic decoder	`0,1,2,3,4,5,6` (×2 slots)	2	14	`nvlceq`	Pre-SPDMark-temporal-attacks training; eval JSON has no Hungarian verify block.
`diffwave_v5_2step`	DiffWave (LJSpeech)	`0,4,8,12,16,20,24`	2	7	`nvlceq`	2-step diffusion, `lambda_perc 0.1` → louder watermark, faster inference.

Spectral attack codes: n noise, v gain, l lowpass, c crop, e erase, q quantize, r resample. SPDMark temporal codes: d segment-drop, s segment-swap, i segment-insert.

Training

All three trainers share the SPDMark plumbing in code/molm_audio_adapter.py (HMAC keys via sample_training_keys, routing-mask construction, segment-map), code/audio_augmentations.py (segment-aware aug with optional temporal attacks), and code/audio_losses.py (SegmentCombinedAudioLoss with valid_mask).

The exact commands used to produce the spec-trained checkpoints in this repo:

HiFi-GAN segments

python code/training_molm_hifigan.py \
    --dataset_path data/train_full \
    --output_dir checkpoints/hifigan_spdmark_spec \
    --exp_name MOLM_HiFiGAN_SPDMark_spec \
    --routing_blocks 0,1,2,3,4,5,6 --num_paths 4 --lora_rank 64 \
    --num_segments 8 --key_bits 128 \
    --attack nvlceqr --aug_prob 0.25 \
    --lambda_perc 0.5 --lambda_perc_warmup_steps 4500 \
    --max_train_steps 75000 --train_batch_size 8 --gradient_accumulation_steps 4 \
    --learning_rate 2e-4 --lora_init_std 0.07 \
    --audio_seconds 2.0 \
    --checkpointing_steps 500 --logging_steps 10 --use_wandb

DiffWave segments

python code/training_molm_audio.py \
    --diffwave_checkpoint pretrained/diffwave-ljspeech.pt \
    --dataset_path data/train_full \
    --output_dir checkpoints/diffwave_spdmark_spec \
    --exp_name DiffWave_SPDMark_spec \
    --routing_layers 0,4,8,12,16,20,24 --num_paths 4 --lora_rank 64 \
    --lora_alpha 24 --lora_init_std 0.01 \
    --diffusion_steps 4 \
    --num_segments 8 --key_bits 128 \
    --attack nvlceqr --aug_prob 0.25 \
    --adam_weight_decay 0.05 --lambda_lora_reg 0.01 \
    --lambda_perc 0.3 --lambda_perc_warmup_steps 2000 \
    --max_train_steps 40000 --train_batch_size 8 --gradient_accumulation_steps 2 \
    --learning_rate 3e-4 --max_grad_norm 1.0 --audio_seconds 2.0 \
    --checkpointing_steps 500 --logging_steps 10 --use_wandb

VibeVoice spec

python code/training_molm_vibevoice.py \
    --components_dir pretrained/vibevoice \
    --data_dir data/vibevoice_latents_train_full \
    --output_dir checkpoints/vibevoice_spdmark_spec \
    --exp_name vibevoice_spdmark_spec \
    --routing_blocks 0,1,2,3,4,5,6 \
    --num_paths 2 --lora_rank 16 --lora_alpha 8 \
    --num_frames 8 --num_segments 8 --key_bits 128 \
    --attack nvlceqr --aug_prob 0.3 \
    --lambda_perc 0.5 --lambda_perc_warmup_steps 4500 --lambda_lora_reg 0.01 \
    --max_train_steps 120000 --cosine_cycle 1 \
    --train_batch_size 16 --gradient_accumulation_steps 2 \
    --learning_rate 1e-4 \
    --checkpointing_steps 500 --logging_steps 10 --use_wandb

No segments training

diffwave_v5_2step (2-step diffusion, very low lambda_perc):

python code/training_molm_audio.py \
    --diffwave_checkpoint pretrained/diffwave-ljspeech.pt \
    --dataset_path data/train_full \
    --routing_layers 0,4,8,12,16,20,24 --num_paths 2 --lora_rank 64 --lora_alpha 64 \
    --diffusion_steps 2 \
    --max_train_steps 120000 --cosine_cycle 1 \
    --train_batch_size 8 --gradient_accumulation_steps 4 \
    --learning_rate 1e-4 \
    --attack nvlceq --aug_prob 0.3 \
    --lambda_perc 0.1 --lambda_perc_warmup_steps 4500 --lambda_lora_reg 0.01 \
    --output_dir checkpoints/diffwave_v5_2step \
    --checkpointing_steps 100 --logging_steps 10 --use_wandb \
    --exp_name diffwave_v5_2step

vibevoice_14bit_v2:

python code/training_molm_vibevoice.py \
    --components_dir pretrained/vibevoice \
    --data_dir data/vibevoice_latents_train_full \
    --routing_blocks 0,1,2,3,4,5,6 \
    --num_paths 2 --lora_rank 16 --lora_alpha 8 \
    --max_train_steps 120000 --cosine_cycle 1 \
    --train_batch_size 16 --gradient_accumulation_steps 2 \
    --learning_rate 1e-4 \
    --attack nvlceq --aug_prob 0.3 \
    --lambda_perc 0.5 --lambda_perc_warmup_steps 4500 --lambda_lora_reg 0.01 \
    --output_dir checkpoints/vibevoice_14bit_v2 \
    --exp_name vibevoice_14bit_v2 \
    --checkpointing_steps 500 --logging_steps 10 --use_wandb

Key SPDMark training flags (all three trainers):

--num_segments 8 — S, segments per audio clip.
--key_bits 128 — base-key width fed into HMAC-SHA256 for per-segment message derivation.
--attack <codes> — see attack-code legend above. Append dsi to also train against segment-level temporal attacks (drop/swap/insert).

Inference

All three generators share a common eval interface; --verify enables the Hungarian + Binomial verifier (auto-sets --message_scheme hmac).

HiFi-GAN

python code/generate_molm_hifigan.py \
    --lora_weights checkpoints/hifigan_spdmark_spec/final/lora_weights.pt \
    --extractor_weights checkpoints/hifigan_spdmark_spec/final/extractor.pt \
    --routing_blocks 0,1,2,3,4,5,6 --num_paths 4 --lora_rank 64 \
    --test_dir <path-to-wavs> --num_samples 20 \
    --chunked_generation --num_chunks 8 \
    --verify --gamma_f 0.01 --gamma_v 0.01 --attacks nvlceqr \
    --output_dir eval_out/hifigan_spec_spectral --device cuda

DiffWave (current `spdmark_spec`)

python code/generate_molm_audio.py \
    --diffwave_checkpoint pretrained/diffwave-ljspeech.pt \
    --lora_weights checkpoints/diffwave_spdmark_spec/final/lora_weights.pt \
    --extractor_weights checkpoints/diffwave_spdmark_spec/final/extractor.pt \
    --routing_layers 0,4,8,12,16,20,24 --num_paths 4 --lora_rank 64 \
    --diffusion_steps 4 \
    --test_dir <path-to-wavs> --num_samples 20 \
    --chunked_generation --num_chunks 8 \
    --verify --gamma_f 0.01 --gamma_v 0.01 --attacks nvlceqr \
    --output_dir eval_out/diffwave_spec_spectral --device cuda

DiffWave (no segments `v5_2step`)

python code/generate_molm_audio.py \
    --diffwave_checkpoint pretrained/diffwave-ljspeech.pt \
    --lora_weights checkpoints/diffwave_v5_2step/final/lora_weights.pt \
    --extractor_weights checkpoints/diffwave_v5_2step/final/extractor.pt \
    --routing_layers 0,4,8,12,16,20,24 --num_paths 2 --lora_rank 64 \
    --diffusion_steps 2 \
    --test_dir <path-to-wavs> --num_samples 20 \
    --chunked_generation --num_chunks 8 \
    --verify --gamma_f 0.01 --gamma_v 0.01 --attacks nvlceqr \
    --output_dir eval_out/diffwave_v5_2step_spectral --device cuda

VibeVoice (current `spdmark_spec`)

python code/generate_molm_vibevoice.py \
    --components_dir pretrained/vibevoice \
    --lora_weights checkpoints/vibevoice_spdmark_spec/final/lora_weights.pt \
    --extractor_weights checkpoints/vibevoice_spdmark_spec/final/extractor.pt \
    --routing_blocks 0,1,2,3,4,5,6 --num_paths 2 --lora_rank 16 --lora_alpha 8 \
    --num_frames 8 \
    --data_dir <path-to-precomputed-latents> --num_samples 20 \
    --chunked_generation --num_chunks 8 \
    --verify --gamma_f 0.01 --gamma_v 0.01 --attacks nvlceqr \
    --output_dir eval_out/vibevoice_spec_spectral --device cuda

VibeVoice (no segments `14bit_v2`)

Same flags as spdmark_spec but point at checkpoints/vibevoice_14bit_v2/final/.

Smoke tests

Quick sanity check of the SPDMark plumbing:

python code/smoke_test_hifigan.py    # 9 checks incl. HMAC keys + segment-aware aug + verify
python code/smoke_test.py            # DiffWave equivalents
python code/smoke_test_vibevoice.py  # VibeVoice equivalents

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for MOLM-Audio/molm-audio

SPDMark: Selective Parameter Displacement for Robust Video Watermarking

Paper • 2512.12090 • Published Apr 1