Abstract
A large-scale piano MIDI dataset called PianoCoRe is introduced, featuring unified and refined open-source corpora with diverse performances and note-level alignments for music information retrieval applications.
Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. This work presents PianoCoRe, a large-scale piano MIDI dataset that unifies and refines major open-source piano corpora. The dataset contains 250,046 performances of 5,625 pieces written by 483 composers, totaling 21,763 h of performed music. PianoCoRe is released in tiered subsets to support different applications: from large-scale analysis and pre-training (PianoCoRe-C and deduplicated PianoCoRe-B) to expressive performance modeling with note-level score alignment (PianoCoRe-A/A*). The note-aligned subset, PianoCoRe-A, provides the largest open-source collection of 157,207 performances aligned to 1,591 scores to date. In addition to the dataset, the contributions are: (1) a MIDI quality classifier for detecting corrupted and score-like transcriptions and (2) RAScoP, an alignment refinement pipeline that cleans temporal alignment errors and interpolates missing notes. The analysis shows that the refinement reduces temporal noise and eliminates tempo outliers. Moreover, an expressive performance rendering model trained on PianoCoRe demonstrates improved robustness to unseen pieces compared to models trained on raw or smaller datasets. PianoCoRe provides a ready-to-use foundation for the next generation of expressive piano performance research.
Community
PianoCoRe is a large-scale piano MIDI dataset of annotated and note-aligned scores and performances. It unifies multiple open-source corpora into a single, ready-to-use resource for piano performance modeling.
๐ผ 250,046 performances (21k+ hours) covering 5,625 pieces by 483 composers
๐ 157,207 performances with refined note-level alignments to scores
๐ Quality labels computed with a trained MIDI Quality Classifier
๐ ๏ธ Alignment refinement pipeline (RAScoP), integrated into the symupe library
๐ค Dataset: https://huggingface.co/datasets/SyMuPe/PianoCoRe
๐ป GitHub: https://github.com/ilya16/PianoCoRe
๐ TISMIR: https://doi.org/10.5334/tismir.333
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BMdataset: A Musicologically Curated LilyPond Dataset (2026)
- Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations (2026)
- ACAVCaps: Enabling Large-Scale Training for Fine-Grained and Diverse Audio Understanding (2026)
- VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models (2026)
- TAGARELA - A Portuguese Speech Dataset from Podcasts (2026)
- Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering (2026)
- Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.06627 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
