LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation
Paper β’ 2605.00777 β’ Published β’ 2
Reference checkpoint for the paper "LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation" (arXiv:2605.00777).
LASE is a 256-d speaker embedding that preserves speaker identity across Devanagari, Telugu, Tamil, and Latin scripts. It wraps a frozen microsoft/wavlm-base-plus backbone with a 2-layer projection MLP and a gradient-reversal language classifier (~170k trainable params).
| Encoder | Western voices gap | Indian voices gap |
|---|---|---|
| WavLM-base-plus-sv (off-the-shelf) | 0.082 | 0.006 |
| ECAPA-TDNN (off-the-shelf) | 0.105 | 0.058 |
| ECAPA + GRL (ablation) | 0.027 | 0.037 |
| LASE r1 (ours) | 0.013 | β0.000 |
Lower is better. gap = within-script median minus cross-script median for the same speaker. LASE r1's bootstrap 95% CI on gap straddles zero on both held-out corpora.
from huggingface_hub import hf_hub_download
import torch
# clone github.com/praxelhq/lase first for the model code
from models.lase import LASE, LambdaSchedule, WavLMSpeakerEncoder
ckpt_path = hf_hub_download("Praxel/lase-r1", "last.pt")
backbone = WavLMSpeakerEncoder("microsoft/wavlm-base-plus", embedding_dim=256, freeze_backbone=True)
model = LASE(backbone, embedding_dim=256, n_languages=4,
lambda_schedule=LambdaSchedule(200, 500, 0.1))
model.load_state_dict(torch.load(ckpt_path)["model"], strict=False)
model.eval()
# wav: (B, T) float32 at 16 kHz, ~2 seconds
embedding = model(wav)["embedding"] # (B, 256)
microsoft/wavlm-base-plus (frozen)Praxel/codeswitch-pairs-lasePraxel/codeswitch-pairs-lase-heldoutPraxel/codeswitch-pairs-lase-indianMIT.
@misc{lase2026,
title={{LASE}: Language-Adversarial Speaker Encoding for {Indic} Cross-Script Identity Preservation},
author={Menta, Venkata Pushpak Teja},
year={2026},
eprint={2605.00777},
archivePrefix={arXiv},
primaryClass={eess.AS},
}