Training datasets: MTT
and MSD
Included models: musicnn
and vgg
Python interface: music tagger
Python interface: music feature extractor
Command-line interface: music tagger
This repository contains a set of pre-trained deep convolutional neural networks for music audio tagging, and these are trained with two different datasets: the MagnaTagATune dataset (the MTT
of ~19k training songs) and the Million Song Dataset (the MSD
of ~200k training songs).
The MTT
dataset considers this 50-tags vocabulary: guitar, classical, slow, techno, strings, drums, electronic, rock, fast, piano, ambient, beat, violin, vocal, synth, female, indian, opera, male, singing, vocals, no vocals, harpsichord, loud, quiet, flute, woman, male vocal, no vocal, pop, soft, sitar, solo, man, classic, choir, voice, new age, dance, male voice, female vocal, beats, harp, cello, no voice, weird, country, metal, female voice, choral.
The MSD
dataset considers this 50-tags vocabulary: rock, pop, alternative, indie, electronic, female vocalists, dance, 00s, alternative rock, jazz, beautiful, metal, chillout, male vocalists, classic rock, soul, indie rock, Mellow, electronica, 80s, folk, 90s, chill, instrumental, punk, oldies, blues, hard rock, ambient, acoustic, experimental, female vocalist, guitar, Hip-Hop, 70s, party, country, easy listening, sexy, catchy, funk, electro, heavy metal, Progressive rock, 60s, rnb, indie pop, sad, House, happy.
musicnn
is a musically motivated convolutional neural network (CNN) trained for music audio tagging. It consists of a CNN front-end that is musically motivated, of a densely connected mid-end, and a temporal-pooling back-end is employed for the output layers. For more details about musicnn, check this jupyter notebook.
For the Million Song Dataset, we also include a larger model with more capacity: the MSD_musicnn_big
. This model is based on musicnn
, but the dense-layered mid-end has x512 filters (instead of x64) and the temporal-pooling back-end has 500 units (instead of 200).
The following models are available: 'MTT_musicnn'
, 'MSD_musicnn'
, and 'MSD_musicnn_big'
.
Important! 'MSD_musicnn_big'
is only available if you install from source: python setup.py install
.
vgg
is a computer vision baseline model that we trained for music audio tagging. This naive adaption of a vision CNN for audio-spectrograms stacks several 3x3 CNN layers with max pooling. For more details about our vgg implementation, check this jupyter notebook.
The following models are available: 'MTT_vgg'
, 'MSD_vgg'
.
tags = musicnn.tagger.top_tags(file_name, model='MTT_musicnn', topN=3, input_length=3, input_overlap=None, print_tags=True, save_tags=False)
Predict the topN
tags of the music-clip in file_name
with the selected model
.
Input
- file_name: path to the music file to tag.
Data format: string.
Example: './audio/TRWJAZW128F42760DD_test.mp3' - model: select a music audio tagging model.
Data format: string.
Options:'MTT_musicnn'
,'MTT_vgg'
,'MSD_musicnn'
,'MSD_musicnn_big'
or'MSD_vgg'
.
MTT
models are trained with the MagnaTagATune dataset.
MSD
models are trained with the Million Song Dataset.
To know more about these models, check our musicnn / vgg examples, and the FAQs.
Important!'MSD_musicnn_big'
is only available if you install from source:python setup.py install
. - topN: extract N most likely tags according to the selected model.
Data format: integer.
Example: 3 - input_length: length (in seconds) of the input spectrogram patches. Set it small for real-time applications.
Note: This is the length of the data that is going to be fed to the model. In other words, this parameter defines the temporal resolution of the taggram.
Recommended value: 3, because the models were trained with 3 second inputs.
Observation: thevgg
models do not allow for different input lengths. For this reason, thevgg
models' input_length needs to be set to 3. However,musicnn
models allow for different input lengths: see this jupyter notebook.
Data format: floating point number.
Example: 3.1 - input_overlap: ammount of overlap (in seconds) of the input spectrogram patches.
Note: Set it considering theinput_length
.
Data format: floating point number.
Example: 1.0 - print_tags: set it
True
for printing the tags.
Note: although you don't print the tags, these will be returned by themusicnn.tagger.top_tags()
function.
Data format: boolean.
Options:False
(for NOT printing the tags),True
(for printing the tags). - save_tags: Path where to store/save the tags.
Data format: string.
Example: 'file_name.tags'
Output
- tags:
topN
most likely tags of the music-clip infile_name
considering the selectedmodel
.
Data format: list.
Example: ['synth', 'techno']
taggram, tags, features = musicnn.extractor.extractor(file_name, model='MTT_musicnn', input_length=3, input_overlap=None, extract_features=True)
Extract the taggram
(the temporal evolution of tags) and features
(intermediate representations of the model) of the music-clip in file_name
with the selected model
.
Input
- file_name: path to the music file to tag.
Data format: string.
Example: './audio/TRWJAZW128F42760DD_test.mp3' - model: select a music audio tagging model.
Data format: string.
Options:'MTT_musicnn'
,'MTT_vgg'
,'MSD_musicnn'
,'MSD_musicnn_big'
or'MSD_vgg'
.
MTT
models are trained with the MagnaTagATune dataset.
MSD
models are trained with the Million Song Dataset.
To know more about these models, check our musicnn / vgg examples, and the FAQs.
Important!'MSD_musicnn_big'
is only available if you install from source:python setup.py install
. - input_length: length (in seconds) of the input spectrogram patches. Set it small for real-time applications.
Note: This is the length of the data that is going to be fed to the model. In other words, this parameter defines the temporal resolution of the taggram.
Recommended value: 3, because the models were trained with 3 second inputs.
Observation: thevgg
models do not allow for different input lengths. For this reason, thevgg
models' input_length needs to be set to 3. However,musicnn
models allow for different input lengths: see this jupyter notebook.
Data format: floating point number.
Example: 3.1 - input_overlap: ammount of overlap (in seconds) of the input spectrogram patches.
Note: Set it considering theinput_length
.
Data format: floating point number.
Example: 1.0 - extract_features: set it
True
for extracting the intermediate representations of the model.
Data format: boolean.
Options:False
(for NOT extracting the features),True
(for extracting the features).
Output
- taggram: expresses the temporal evolution of the tags likelihood.
Data format: 2D np.ndarray (time, tags).
Example: see our basic / advanced examples. - tags: list of tags corresponding to the tag-indices of the taggram.
Data format: list.
Example: see our FAQs page for the complete tags list. - features: if
extract_features = True
, it outputs a dictionary containing the activations of the different layers the selected model has.
Data format: dictionary.
Keys (musicnn models): ['timbral', 'temporal', 'cnn1', 'cnn2', 'cnn3', 'mean_pool', 'max_pool', 'penultimate']
Keys (vgg models): ['pool1', 'pool2', 'pool3', 'pool4', 'pool5']
Example: see our musicnn and vgg examples.
python -m musicnn.tagger file_name --model 'MTT_musicnn' --topN 3 --length 3 --overlap 3 --print --save file.tags
Predict the topN
tags of the music-clip in file_name
with the selected model
.
Arguments
- file_name: path to the music file to tag.
Data format: string.
Example:python -m musicnn.tagger music.mp3
- --model (-mod): select a music audio tagging model.
Data format: string.
Options:MTT_musicnn
,MTT_vgg
,MSD_musicnn
,MSD_musicnn_big
orMSD_vgg
.
MTT
models are trained with the MagnaTagATune dataset.
MSD
models are trained with the Million Song Dataset.
To know more about these models, check our musicnn / vgg examples, and the FAQs.
Important!MSD_musicnn_big
is only available if you install from source:python setup.py install
.
Default:MTT_musicnn
Example:python -m musicnn.tagger music.mp3 --model MSD_musicnn
- --topN (-n): extract N most likely tags according to the selected model.
Data format: integer.
Default: 3
Example:python -m musicnn.tagger music.mp3 --topN 10
- --length (-len): length (in seconds) of the input spectrogram patches. Set it small for real-time applications.
Note: This is the length of the data that is going to be fed to the model. In other words, this parameter defines the temporal resolution of the taggram.
Recommended value: 3, because the models were trained with 3 second inputs.
Observation: thevgg
models do not allow for different input lengths. For this reason, thevgg
models' input_length needs to be set to 3. However,musicnn
models allow for different input lengths: see this jupyter notebook.
Data format: floating point number.
Default: 3.0
Example:python -m musicnn.tagger music.mp3 -len 3.1
- --overlap (-ov): ammount of overlap (in seconds) of the input spectrogram patches.
Note: Set it considering theinput_length
.
Data format: floating point number.
Default: 3.0
Example:python -m musicnn.tagger music.mp3 -ov 1.0
- --print (-p): employ this flag for printing the tags.
Data format: boolean.
Example:python -m musicnn.tagger music.mp3 --print
- --save (-s): Path where to store/save the tags.
Data format: string.
Output data format: csv.
Example:python -m musicnn.tagger music.mp3 --save out.tags
Output
- tags:
topN
most likely tags of the music-clip infile_name
considering the selectedmodel
.