This repository provides a basic approach for predicting the music genre from WAV files. This is done using a deep convolutional network trained on the well-known GTZAN dataset.
A Flask application and a minimal Dash web application run a simple test for prediction, on jazz, reggae and metal musics. The prediction is done in real-time during playing the music.
References:
- Audio Deep Learning Made Simple: Sound Classification, Step-by-Step
- Tzanetakis, G. and Essl, G. and Cook, P. (2001). Automatic Musical Genre Classification Of Audio Signals. The International Society for Music Information Retrieval.
- The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use (2013). arXiv preprint
Dataset
The dataset consists of 1000 audio tracks each 30 seconds long. It contains 10 genres, each represented by 100 tracks. The tracks are all 22050Hz Mono 16-bit audio files in .wav format.
The genres are:
- blues
- classical
- country
- disco
- hiphop
- jazz
- metal
- pop
- reggae
- rock
This database is the most widely-used in the benchmark but it is also known that there are many issues (sound quality, repetitions, mislabelling, etc.), see here. Despite this, this is a good starting point for testing deep learning techniques. See here for an extensive list of Music Information Retrieval datasets.
Overall approach
First, we have to keep in mind that sound can be represented as images, thanks to signal processing techniques such as the well-known Short Time Fourier Transform. So the natural way to learning from music is to train a CNN on the spectrogram images derived from the musics.
Our approach if based on a two-blocks convolutional model :
- two 2D convolutional layers (resp. 32 and 128 channels) followed by a max-pooling
- a 20% dropout layer
- a global average pooling layer. This avoids the explosion of the number of parameters in comparison with a simple flatten layer.
- a 512 fully connected layer
A deeper network was tested but did not show significantly better results.
Note that the model is trained on sequences of 3 seconds of music in order to control the size of the images.
Model performances
The model is trained on a sample of 80% of the data randomly chosen. The remaining 20% are used for the validation and test sets.
The training procedure has been repeated five times for estimating the model accuracy with random train, validation and testing sets. The final model performances are:
- Mean accuracy: 94.61% (+/- 0.40%)
- Mean loss: 0.2192 (+/- 0.0292)
Model error during the training
The following figure represents the accuracy of the model during the training step. A slight overfitting seems to appear after 20 epochs. We should add some regularization layers or we should augment the dataset for better results.