This dataset comprises synthetic drum scores created using RhythmForm, where a Markov-Chain model to decide the "most likely next bar" was used. The data is intended for use in training a transformer model that would convert images or (scanned) PDFs of drum scores to editable digitised drum scores. The transformer would convert to Symbolic Music Text (SMT), which is converted to MusicXML programmatically.
- 150,000 scores of varying length are included as: PDF (1 per score); MusicXML (1 or 2* per score); PNG (1 per page per score); and SMT (1 per page per score) files.
- logs files corresponding to 23 data synthesis runs are included
- a dataset.json file is included to use the dataset to train a transformer model as described in https://github.com/DrumScoreAI/RhythmForm
- vocabulary files are included: all_tokens_corpus.smt; full_tokenizer_vocab.json; markov_training_corpus.smt; merged_tokenizer_vocab.json.
- a pickled Markov-chain model is included: markov_model.pkl
-
total number of files: 700859
-
Depending on a use_repeat_bars condition, either 1 MusicXML file (no repeat bars) or 2 MusicXML files (1 with repeat bars and equivalent without) are generated.