VocalParse-1.7B
VocalParse is a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Fine-tuned from Qwen3-ASR-1.7B, it transcribes singing audio into a structured autoregressive token sequence that jointly encodes lyrics, pitch, note values, and global tempo (BPM).
- Paper: VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
- Repository: github.com/pymaster17/VocalParse
Singing Audio (16kHz) β Whisper Encoder β Qwen LLM Decoder β AST Token Sequence
ζ <P_68> <NOTE_4> ε <P_60> <NOTE_8> ε° <P_65> <NOTE_8> ... <BPM_89>
Usage
Installation
It is recommended to use uv for setup:
uv venv --python 3.10
source .venv/bin/activate
uv pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
uv pip install git+https://github.com/pymaster17/VocalParse.git
Quick Inference
from vocalparse import transcribe_one
text = transcribe_one(
audio="path/to/song.wav",
checkpoint="pymaster/VocalParse",
)
print(text)
# Example output: ζ <P_68> <NOTE_4> ε <P_60> <NOTE_8> ... <BPM_89>
Model Details
| Property | Value |
|---|---|
| Base model | Qwen3-ASR-1.7B (Whisper encoder + Qwen LLM decoder) |
| Fine-tuning task | Automatic Singing Transcription (AST) |
| Training mode | CoT (asr_cot=true, bpm_position=last) |
| New vocabulary tokens | ~400 AST tokens (pitch, note value, BPM) |
| Input | Mono 16 kHz singing audio |
| Output | Interleaved lyric + pitch + note sequence with global BPM |
AST Token Vocabulary Extension
The base Qwen3-ASR vocabulary is extended with:
- Pitch: 128 tokens (
<P_0>β<P_127>) representing MIDI notes. - Note value: 12 tokens (e.g.,
<NOTE_4>,<NOTE_8>,<NOTE_DOT_8>). - Tempo: 256 tokens (
<BPM_0>β<BPM_255>).
Output Format
Standard interleaved format (
bpm_position=last):ζ <P_68> <NOTE_4> ε <P_60> <NOTE_8> ε° <P_65> <NOTE_8> ... <BPM_89>CoT format produced during generation (
asr_cot=true): the model first outputs plain lyrics, then the full interleaved score, separated by<|file_sep|>:ζεε°<|file_sep|>ζ <P_68> <NOTE_4> ε <P_60> <NOTE_8> ε° <P_65> <NOTE_8> ... <BPM_89>
Evaluation Metrics
Metrics are computed with two-stage Needleman-Wunsch alignment: word-level alignment for lyrics, then pair-level alignment inside each matched word for pitch and note.
- CER: Character error rate on lyrics (silence tokens excluded).
- Pitch MAE: Mean absolute pitch error in MIDI semitones.
- Note MAE: Mean absolute error in logβ note-value space.
- BPM MAE: Mean absolute tempo error.
Limitations
- Primarily trained on Mandarin Chinese singing.
- Physical note durations are not predicted by this checkpoint.
- Long audio segments (> 30s) should be pre-segmented before inference.
Citation
@article{vocalparse2026,
title = {VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models},
author = {Yukun Chen and Tianrui Wang and Zhaoxi Mu and Xinyu Yang and EngSiong Chng},
journal = {arXiv preprint arXiv:2605.04613},
year = {2026},
url = {http://arxiv.org/abs/2605.04613}
}
License
This model is licensed under Apache 2.0.
- Downloads last month
- 48
Model tree for pymaster/VocalParse
Base model
Qwen/Qwen3-ASR-1.7B