VocalParse-1.7B

VocalParse is a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Fine-tuned from Qwen3-ASR-1.7B, it transcribes singing audio into a structured autoregressive token sequence that jointly encodes lyrics, pitch, note values, and global tempo (BPM).

Singing Audio (16kHz) β†’ Whisper Encoder β†’ Qwen LLM Decoder β†’ AST Token Sequence

ζ„Ÿ <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>

Usage

Installation

It is recommended to use uv for setup:

uv venv --python 3.10
source .venv/bin/activate
uv pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
uv pip install git+https://github.com/pymaster17/VocalParse.git

Quick Inference

from vocalparse import transcribe_one

text = transcribe_one(
    audio="path/to/song.wav",
    checkpoint="pymaster/VocalParse",
)
print(text)
# Example output: ζ„Ÿ <P_68> <NOTE_4> 受 <P_60> <NOTE_8> ... <BPM_89>

Model Details

Property Value
Base model Qwen3-ASR-1.7B (Whisper encoder + Qwen LLM decoder)
Fine-tuning task Automatic Singing Transcription (AST)
Training mode CoT (asr_cot=true, bpm_position=last)
New vocabulary tokens ~400 AST tokens (pitch, note value, BPM)
Input Mono 16 kHz singing audio
Output Interleaved lyric + pitch + note sequence with global BPM

AST Token Vocabulary Extension

The base Qwen3-ASR vocabulary is extended with:

  • Pitch: 128 tokens (<P_0> – <P_127>) representing MIDI notes.
  • Note value: 12 tokens (e.g., <NOTE_4>, <NOTE_8>, <NOTE_DOT_8>).
  • Tempo: 256 tokens (<BPM_0> – <BPM_255>).

Output Format

  • Standard interleaved format (bpm_position=last): ζ„Ÿ <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>

  • CoT format produced during generation (asr_cot=true): the model first outputs plain lyrics, then the full interleaved score, separated by <|file_sep|>: ζ„Ÿε—εˆ°<|file_sep|>ζ„Ÿ <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>

Evaluation Metrics

Metrics are computed with two-stage Needleman-Wunsch alignment: word-level alignment for lyrics, then pair-level alignment inside each matched word for pitch and note.

  • CER: Character error rate on lyrics (silence tokens excluded).
  • Pitch MAE: Mean absolute pitch error in MIDI semitones.
  • Note MAE: Mean absolute error in logβ‚‚ note-value space.
  • BPM MAE: Mean absolute tempo error.

Limitations

  • Primarily trained on Mandarin Chinese singing.
  • Physical note durations are not predicted by this checkpoint.
  • Long audio segments (> 30s) should be pre-segmented before inference.

Citation

@article{vocalparse2026,
  title   = {VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models},
  author  = {Yukun Chen and Tianrui Wang and Zhaoxi Mu and Xinyu Yang and EngSiong Chng},
  journal = {arXiv preprint arXiv:2605.04613},
  year    = {2026},
  url     = {http://arxiv.org/abs/2605.04613}
}

License

This model is licensed under Apache 2.0.

Downloads last month
48
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for pymaster/VocalParse

Finetuned
(55)
this model

Paper for pymaster/VocalParse