Skip to content

Latest commit

 

History

History
412 lines (266 loc) · 21 KB

File metadata and controls

412 lines (266 loc) · 21 KB

Deep Learning and Digital Signal Processing for Environmental Sound Classification


Introduction


Automatic environmental sound classification (ESC) based on ESC-50 dataset (and ESC-10 subset) built by Karol Piczak and described in the following article:

"Karol J. Piczak. 2015. "ESC: Dataset for Environmental Sound Classification." In Proceedings of the 23rd ACM international conference on Multimedia (MM '15). Association for Computing Machinery, New York, NY, USA, 1015–1018. https://doi.org/10.1145/2733373.2806390".

ESC-50 dataset is available from Dr. Piczak's Github: https://github.com/karoldvl/ESC-50/ The following recent article is a descriptive survey for Environmental sound classification (ESC) detailing datasets, preprocessing techniques, features and classifiers. And their accuracy.

Anam Bansal, Naresh Kumar Garg, "Environmental Sound Classification: A descriptive review of the literature, Intelligent Systems with Applications, Volume 16, 2022, 200115, ISSN 2667-3053, https://doi.org/10.1016/j.iswa.2022.200115.

Dr. Piczak maintains a Table with best results in his Github, with authors, publication, method used. We reproduce the top of the Table here, for supervised classification.

Title Notes Accuracy Paper Code
BEATs: Audio Pre-Training with Acoustic Tokenizers Transformer model pretrained with acoustic tokenizers 98.10% chen2022 📜
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection Transformer model with hierarchical structure and token-semantic modules 97.00% chen2022 📜
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 96.70% elizalde2022 📜
AST: Audio Spectrogram Transformer Pure Attention Model Pretrained on AudioSet 95.70% gong2021 📜
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer A Transformer model pretrained w/ visual image supervision 95.70% zhao2022 📜

We develop our own pre-processing techniques for achieving best accuracy results based on Dr. Piczak Table and Bansal et al.
At that point, and before we start working on more advanced techniques:

  • we work with the ESC-10 sub-dataset.
  • we test mel-spectrograms and wavelet transforms.

We will train a Convolution Neural Network with grayscale spectrograms and scalograms. We target an accuracy >>90 %.
When tests with the most effective CNN algorithm implementation are completed, we will run predictions with various audio clips downloaded from Youtube. And eventually update CNN hyperparameters.

ESC-10 Type of sounds/noises


The ESC-10 dataset contains 5 seconds long 400 Ogg Vorbis audio clips: sampling frequency: 44.1 kHz, 32- bits float, and 10 classes.
40 audio clips per class.
The 10 Sound/Noise classes are:

  • Class = 01-Dogbark, Label = 0
  • Class = 02-Rain, Label = 1
  • Class = 03-Seawaves, Label = 2
  • Class = 04-Babycry, Label = 3
  • Class = 05-Clocktick, Label = 4
  • Class = 06-Personsneeze, Label = 5
  • Class = 07-Helicopter, Label = 6
  • Class = 08-Chainsaw, Label = 7
  • Class = 09-Rooster, Label = 8
  • Class = 10-Firecrackling, Label = 9

Quick analysis of the type of sound/noise:

  • dogbarking, babycry, person sneeze, rooster, involve non-linear vibration and resonance of vocal (or nasal) tract and cords, a bit like speech, and is considered non-stationnary.
  • Rain, sea waves are somewhat stationary, rain sounds a bit like white noise. Pseudo-stationnary because in various audio clips other noises are involved at times.
  • Helicopter, chainsaw: pseudo-stationary. If the engine r.p.m does not change in a timeframe, the process is stationary. With harmonics linked to the engine rpm, number of cylinders, and the number of rotor blades (helicopter).
  • Fire crakling: impulsive noise. But with pseudo-stationary background noise.
  • Clock tick: It depends. Impulsive every second (frequency= 1 Hz). But in some audio clips, there are several "pulsations" in a 1 second time frame. And the ticks have the signature of a non-linear mechanical vibration that radiates sound, with harmonics.

Methodology

  • In an effort to reduce the size of the problem and computation time, while retaining relevant information, we:
    • reduce audio sampling frequency from 44.1 kHz to 22.05 kHz.
    • reduce the size of audio clips, to 1.25s, based on signal power considerations. Too many audio clips have occurences of the same sound phenomenon: dog barking, baby crying for example and most of the signal is "silence".
  • Normalize audio signal amplitude to 1. (0 dBFS).
  • Compute mel-spectrograms or Wavelet transforms in the 10 classes. We empirically optimized wavelet selection. And wavelet transform parameters.
  • Reduce the size of scalograms in the time domain (some details are lost).
  • Train a CNN on 256x256 grayscale mel-spectrograms or 2 series of 128x128 grayscale scalograms: magnitude and phase. Train/Test split: 80/20 %

We tested three methods:

  • Mel-spectrograms.
  • Complex Continuous Wavelet Transforms (complex CWT).
  • Fusion mel-spectrograms + complex CWT.

After a 80%/20% train/test sets split, we train a Convolutional Neural Networks with 32-64-128-256 neurons hidden layers. Parameters are detailed in the notebooks CNN section.
Note: Although mel-spectrograms and wavelet transforms are shown in color, the CNN is trained with grayscale images.

ESC-10 Results Synthesis

Best accuracy with 3 different methods are synthesized in the the Table below.

Method Accuracy
256x256 Mel-spectrograms 92.5 %
128x128 Complex CWT Scalograms Magnitude + Phase 94 %
128x128 Fusion Complex CWT + Mel-Spectrograms 99 %

Details of the best result with the "Fusion" method:

Classification report

Confusion matrix



Jupyter Notebooks


All Jupyter Notebooks share the same structure.
The classification method and the CNN model were updated. Older Jupyter Notebooks: Part I. II. III. are at the bottom.

This notebook is an improved version of Part III. We implement a 2-stages classification process:

STAGE I: Pre-classification

We define two sounds classes A,B:

  • “Harmonics sounds”: Dog barking, Baby cry, Clock tick, Person sneeze, Helicopter, Chainsaw, Rooster.
  • “Non-harmonics sounds”: Rain, Seawaves, Fire crackling.

Methodology: Stage I

Results:

Classification report Confusion matrix

A 100% acuracy classification was achieved with Mel-spectrograms defined between 0-2000Hz and a CNN model.
At the moment this stage is left as an exercise. We will propose a simpler method.

STAGE II: Classification:

We apply two sets of complex continuous wavelets to each sound class A, B and run the whole classification problem with a multi-feature CNN: CWT Magntitude and Phase + Mel-spectrograms

Methodology Stage II

RESULTS:

Classification report Confusion matrix

The remaining confusion: "sea wave" "rain" is solved by developping a transform of the CWT: aT-CWT .
Preliminary results are presented in Part V. The aT-CWT transform is currently confidential.

    Part V: the aT-CWT transform

Discriminating "sea wave" and "rain" is a challenge given the quasi Gaussian nature of the sound in both cases.
We were able to solve it with a criteria replacing the unwrap CWT phase and we achieved 100% accuracy.
This criteria is an advanced Transform of the CWT that we called aT-CWT
The new aT-CWT transform:

  • can have the dimensions of the other features: Mel spectrograms, CWT magnitude and phase. In the present study: 128x128.
  • the time localization info of the CWT is lost. aT-CWT makes sense for:
    • stationary, pseudo-stationary sounds even for large period of time (here 1.25s) which is the case with the "no-harmonics" sounds in the present ESC-10 dataset.
    • any type of signals (including non stationary) in very short time frames. For example: speech, frame= 32 ms, fs= 16kHz (512 points).

Using the strategy decribed in this Notebook, and replacing the unwrap CWT phase with the new aT-CWT Transform in the "no-harmonics" subset, we were able to reach 100% accuracy.

'cgau5' CWT of a ESC-10 'Sea Wave' (116): Magnitude + Phase

'cgau5' CWT of a ESC-10 'Rain' (48): Magnitude + Phase

Sea wave aT-CWT transform
Units are hidden.
Rain aT-CWT transform
Units are hidden.

At the moment, the aT-CWT Transform is confidential.
The aT-CWT Transform may help with the difficult "cocktail party problem" and Speech (or Voice) Activity Detection.
At some point, the transform will be published and the ESC-10 notebook with 100% accuracy will be made public.

Classification report Confusion matrix

Older Notebooks



Initial tests with Mel-Spectrogram, complex CWT, and multi-feature Mel-spectrogram + complex CWT CNN models.

Reduction of audio clips length and optimization of mel-spectrogram parameters for best discrimination of sound categories. We train the CNN with 256x256 grayscale images. Accuracy: ~92.5%

Mel-spectrograms (dB)


Optimization of wavelet selection and parameters for best discrimination of sound classes.
Wavelet selection: the difficulty here is the selection of the right wavelet suited to the full range of noise types: pseudo-stationary, non-stationary, transient/impulsive.
Applying different wavelets to each type of sound significantly improves classification accuracy. We train the CNN with 2 128x128 grayscale images per audio clip: scalogram magnitude and phase. Accuracy ~ 94%.

Scalograms magnitude (dB)

Scalograms phase (rad)


Combining Mel-Spectrograms (Part I) with Complex Wavelets Transforms (Part II) enhances accuracy with features that are difficult to discriminate. We train the CNN with 3 128x128 grayscale images per audio clip. Accuracy. ~ 99%.

Rooster: Scalogram Magnitude (dB), Phase (rad) + Mel-spectrogram (dB)




    Download 

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.

The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:

Class Label
Dog bark 0
Rain 1
Seawaves 2
Baby cry 3
Clock tick 4
Person sneeze 5
Helicopter 6
Chainsaw 7
Rooster 8
Firecrackling 9

Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project. The dataset has been prearranged into 5 folds for comparable cross-validation, making sure that fragments from the same original source file are contained in a single fold.

A more thorough description of the dataset is available in the original paper with some supplementary materials on GitHub: ESC: Dataset for Environmental Sound Classification - paper replication data.

Repository content

  • audio/*.wav

    2000 audio recordings in WAV format (5 seconds, 44.1 kHz, mono) with the following naming convention:

    {FOLD}-{CLIP_ID}-{TAKE}-{TARGET}.wav

    • {FOLD} - index of the cross-validation fold,
    • {CLIP_ID} - ID of the original Freesound clip,
    • {TAKE} - letter disambiguating between different fragments from the same Freesound clip,
    • {TARGET} - class in numeric format [0, 49].
  • meta/esc50.csv

    CSV file with the following structure:

    filename fold target category esc10 src_file take

    The esc10 column indicates if a given file belongs to the ESC-10 subset (10 selected classes, CC BY license).

  • meta/esc50-human.xlsx

    Additional data pertaining to the crowdsourcing experiment (human classification accuracy).

Package Requirements

To run the code and reproduce the results,

  1. Download Anaconda for Python 3: https://www.anaconda.com/products/individual
  2. Install Jupyter Lab: conda install -c conda-forge jupyterlab
  3. Install Jupyer Notebook: conda install -c conda-forge notebook
  4. Upload the prepared conda environment: conda env create -f stephane_dedieu_sound_classification.yml Activate the environment: conda activate stephane_dedieu_sound_classification
  5. Alternative to 4: pip install -r requirements.txt Sometimes librosa will not install, you can try then: -conda install -c numba numba -conda install -c conda-forge librosa
  6. Run the notebook: jupyter notebook

Ensure you have the following Python packages installed:

  • pandas
  • matplotlib
  • numpy
  • scikit-learn
  • keras
  • pydot
  • tensorflow
  • librosa
  • glob2
  • keras
  • notebook
  • librosa
  • seaborn
  • scikit-image

You can install these packages using pip:

pip install numpy pandas matplotlib seaborn scikit-learn tensorflow librosa

Using conda you can replicate the environment in stephane_dedieu_sound_classification.yml:
conda env create -n ENVNAME --file stephane_dedieu_sound_classification.yml

License

The dataset is available under the terms of the Creative Commons Attribution Non-Commercial license.
A smaller subset (clips tagged as ESC-10) is distributed under CC BY (Attribution).
Attributions for each clip are available in the LICENSE file.

Citing

Download paper in PDF format

If you find this dataset useful in an academic setting please cite:

K. J. Piczak. ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd Annual ACM Conference on Multimedia, Brisbane, Australia, 2015.

[DOI: http://dx.doi.org/10.1145/2733373.2806390]

@inproceedings{piczak2015dataset,
  title = {{ESC}: {Dataset} for {Environmental Sound Classification}},
  author = {Piczak, Karol J.},
  booktitle = {Proceedings of the 23rd {Annual ACM Conference} on {Multimedia}},
  date = {2015-10-13},
  url = {http://dl.acm.org/citation.cfm?doid=2733373.2806390},
  doi = {10.1145/2733373.2806390},
  location = {{Brisbane, Australia}},
  isbn = {978-1-4503-3459-4},
  publisher = {{ACM Press}},
  pages = {1015--1018}
}

The ESC-10 subset is licensed as a Creative Commons Attribution 3.0 Unported
(https://creativecommons.org/licenses/by/3.0/) dataset.

Licensing/attribution details for individual audio clips are available in file:

License