VocalParse-1.7B

VocalParse is a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Fine-tuned from Qwen3-ASR-1.7B, it transcribes singing audio into a structured autoregressive token sequence that jointly encodes lyrics, pitch, note values, and global tempo (BPM).

Paper: VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
Repository: github.com/pymaster17/VocalParse

Singing Audio (16kHz) → Whisper Encoder → Qwen LLM Decoder → AST Token Sequence

感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>

Usage

Installation

It is recommended to use uv for setup:

uv venv --python 3.10
source .venv/bin/activate
uv pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
uv pip install git+https://github.com/pymaster17/VocalParse.git

Quick Inference

from vocalparse import transcribe_one

text = transcribe_one(
    audio="path/to/song.wav",
    checkpoint="pymaster/VocalParse",
)
print(text)
# Example output: 感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> ... <BPM_89>

Model Details

Property	Value
Base model	Qwen3-ASR-1.7B (Whisper encoder + Qwen LLM decoder)
Fine-tuning task	Automatic Singing Transcription (AST)
Training mode	CoT (`asr_cot=true`, `bpm_position=last`)
New vocabulary tokens	~400 AST tokens (pitch, note value, BPM)
Input	Mono 16 kHz singing audio
Output	Interleaved lyric + pitch + note sequence with global BPM

AST Token Vocabulary Extension

The base Qwen3-ASR vocabulary is extended with:

Pitch: 128 tokens (<P_0> – <P_127>) representing MIDI notes.
Note value: 12 tokens (e.g., <NOTE_4>, <NOTE_8>, <NOTE_DOT_8>).
Tempo: 256 tokens (<BPM_0> – <BPM_255>).

Output Format

Standard interleaved format (bpm_position=last): 感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>
CoT format produced during generation (asr_cot=true): the model first outputs plain lyrics, then the full interleaved score, separated by <|file_sep|>: 感受到<|file_sep|>感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>

Evaluation Metrics

Metrics are computed with two-stage Needleman-Wunsch alignment: word-level alignment for lyrics, then pair-level alignment inside each matched word for pitch and note.

CER: Character error rate on lyrics (silence tokens excluded).
Pitch MAE: Mean absolute pitch error in MIDI semitones.
Note MAE: Mean absolute error in log₂ note-value space.
BPM MAE: Mean absolute tempo error.

Limitations

Primarily trained on Mandarin Chinese singing.
Physical note durations are not predicted by this checkpoint.
Long audio segments (> 30s) should be pre-segmented before inference.

Citation

@article{vocalparse2026,
  title   = {VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models},
  author  = {Yukun Chen and Tianrui Wang and Zhaoxi Mu and Xinyu Yang and EngSiong Chng},
  journal = {arXiv preprint arXiv:2605.04613},
  year    = {2026},
  url     = {http://arxiv.org/abs/2605.04613}
}

License

This model is licensed under Apache 2.0.

Downloads last month: 48

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for pymaster/VocalParse

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(55)

this model

Paper for pymaster/VocalParse

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

Paper • 2605.04613 • Published 15 days ago