Deep Learning and Digital Signal Processing for Environmental Sound Classification

Introduction

Automatic environmental sound classification (ESC) based on ESC-50 dataset (and ESC-10 subset) built by Karol Piczak and described in the following article:

"Karol J. Piczak. 2015. "ESC: Dataset for Environmental Sound Classification." In Proceedings of the 23rd ACM international conference on Multimedia (MM '15). Association for Computing Machinery, New York, NY, USA, 1015–1018. https://doi.org/10.1145/2733373.2806390".

ESC-50 dataset is available from Dr. Piczak's Github: https://github.com/karoldvl/ESC-50/ The following recent article is a descriptive survey for Environmental sound classification (ESC) detailing datasets, preprocessing techniques, features and classifiers. And their accuracy.

Anam Bansal, Naresh Kumar Garg, "Environmental Sound Classification: A descriptive review of the literature, Intelligent Systems with Applications, Volume 16, 2022, 200115, ISSN 2667-3053, https://doi.org/10.1016/j.iswa.2022.200115.

Dr. Piczak maintains a Table with best results in his Github, with authors, publication, method used. We reproduce the top of the Table here, for supervised classification.

_Title	_Notes	_Accuracy	_Paper	_Code
_{BEATs: Audio Pre-Training with Acoustic Tokenizers}	_{Transformer model pretrained with acoustic tokenizers}	_98.10%	_chen2022	📜
_{HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection}	_{Transformer model with hierarchical structure and token-semantic modules}	_97.00%	_chen2022	📜
_{CLAP: Learning Audio Concepts From Natural Language Supervision}	_{CNN model pretrained by natural language supervision}	_96.70%	_elizalde2022	📜
_{AST: Audio Spectrogram Transformer}	_{Pure Attention Model Pretrained on AudioSet}	_95.70%	_gong2021	📜
_{Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer}	_{A Transformer model pretrained w/ visual image supervision}	_95.70%	_zhao2022	📜

We develop our own pre-processing techniques for achieving best accuracy results based on Dr. Piczak Table and Bansal et al.
At that point, and before we start working on more advanced techniques:

we work with the ESC-10 sub-dataset.

we test mel-spectrograms and wavelet transforms.

We will train a Convolution Neural Network with grayscale spectrograms and scalograms. We target an accuracy >>90 %.
When tests with the most effective CNN algorithm implementation are completed, we will run predictions with various audio clips downloaded from Youtube. And eventually update CNN hyperparameters.

ESC-10 Type of sounds/noises

The ESC-10 dataset contains 5 seconds long 400 Ogg Vorbis audio clips: sampling frequency: 44.1 kHz, 32- bits float, and 10 classes.
40 audio clips per class.
The 10 Sound/Noise classes are:

Class = 01-Dogbark, Label = 0
Class = 02-Rain, Label = 1
Class = 03-Seawaves, Label = 2
Class = 04-Babycry, Label = 3
Class = 05-Clocktick, Label = 4
Class = 06-Personsneeze, Label = 5
Class = 07-Helicopter, Label = 6
Class = 08-Chainsaw, Label = 7
Class = 09-Rooster, Label = 8
Class = 10-Firecrackling, Label = 9

Quick analysis of the type of sound/noise:

dogbarking, babycry, person sneeze, rooster, involve non-linear vibration and resonance of vocal (or nasal) tract and cords, a bit like speech, and is considered non-stationnary.
Rain, sea waves are somewhat stationary, rain sounds a bit like white noise. Pseudo-stationnary because in various audio clips other noises are involved at times.
Helicopter, chainsaw: pseudo-stationary. If the engine r.p.m does not change in a timeframe, the process is stationary. With harmonics linked to the engine rpm, number of cylinders, and the number of rotor blades (helicopter).
Fire crakling: impulsive noise. But with pseudo-stationary background noise.
Clock tick: It depends. Impulsive every second (frequency= 1 Hz). But in some audio clips, there are several "pulsations" in a 1 second time frame. And the ticks have the signature of a non-linear mechanical vibration that radiates sound, with harmonics.

Methodology

In an effort to reduce the size of the problem and computation time, while retaining relevant information, we:
- reduce audio sampling frequency from 44.1 kHz to 22.05 kHz.
- reduce the size of audio clips, to 1.25s, based on signal power considerations. Too many audio clips have occurences of the same sound phenomenon: dog barking, baby crying for example and most of the signal is "silence".
Normalize audio signal amplitude to 1. (0 dBFS).
Compute mel-spectrograms or Wavelet transforms in the 10 classes. We empirically optimized wavelet selection. And wavelet transform parameters.
Reduce the size of scalograms in the time domain (some details are lost).
Train a CNN on 256x256 grayscale mel-spectrograms or 2 series of 128x128 grayscale scalograms: magnitude and phase. Train/Test split: 80/20 %

We tested three methods:

Mel-spectrograms.
Complex Continuous Wavelet Transforms (complex CWT).
Fusion mel-spectrograms + complex CWT.

After a 80%/20% train/test sets split, we train a Convolutional Neural Networks with 32-64-128-256 neurons hidden layers. Parameters are detailed in the notebooks CNN section.
Note: Although mel-spectrograms and wavelet transforms are shown in color, the CNN is trained with grayscale images.

ESC-10 Results Synthesis

Best accuracy with 3 different methods are synthesized in the the Table below.

_Method	_Accuracy
_{256x256 Mel-spectrograms}	_{92.5 %}
_{128x128 Complex CWT Scalograms Magnitude + Phase}	_{94 %}
_{128x128 Fusion Complex CWT + Mel-Spectrograms}	_{99 %}

Details of the best result with the "Fusion" method:


_{Classification report}	_{Confusion matrix}

Jupyter Notebooks

All Jupyter Notebooks share the same structure.
The classification method and the CNN model were updated. Older Jupyter Notebooks: Part I. II. III. are at the bottom.

Part IV: Fusion: Complex Wavelet Transforms + Mel-Spectrograms and CNN

This notebook is an improved version of Part III. We implement a 2-stages classification process:

STAGE I: Pre-classification

We define two sounds classes A,B:

“Harmonics sounds”: Dog barking, Baby cry, Clock tick, Person sneeze, Helicopter, Chainsaw, Rooster.
“Non-harmonics sounds”: Rain, Seawaves, Fire crackling.


Methodology: Stage I

Results:


Classification report	Confusion matrix

A 100% acuracy classification was achieved with Mel-spectrograms defined between 0-2000Hz and a CNN model.
At the moment this stage is left as an exercise. We will propose a simpler method.

STAGE II: Classification:

We apply two sets of complex continuous wavelets to each sound class A, B and run the whole classification problem with a multi-feature CNN: CWT Magntitude and Phase + Mel-spectrograms


Methodology Stage II

RESULTS:


Classification report	Confusion matrix

The remaining confusion: "sea wave" "rain" is solved by developping a transform of the CWT: aT-CWT .
Preliminary results are presented in Part V. The aT-CWT transform is currently confidential.

Part V: the aT-CWT transform

Discriminating "sea wave" and "rain" is a challenge given the quasi Gaussian nature of the sound in both cases.
We were able to solve it with a criteria replacing the unwrap CWT phase and we achieved 100% accuracy.
This criteria is an advanced Transform of the CWT that we called aT-CWT
The new aT-CWT transform:

can have the dimensions of the other features: Mel spectrograms, CWT magnitude and phase. In the present study: 128x128.
the time localization info of the CWT is lost. aT-CWT makes sense for:
- stationary, pseudo-stationary sounds even for large period of time (here 1.25s) which is the case with the "no-harmonics" sounds in the present ESC-10 dataset.
- any type of signals (including non stationary) in very short time frames. For example: speech, frame= 32 ms, fs= 16kHz (512 points).

Using the strategy decribed in this Notebook, and replacing the unwrap CWT phase with the new aT-CWT Transform in the "no-harmonics" subset, we were able to reach 100% accuracy.


'cgau5' CWT of a ESC-10 'Sea Wave' (116): Magnitude + Phase


'cgau5' CWT of a ESC-10 'Rain' (48): Magnitude + Phase


Sea wave aT-CWT transform Units are hidden.	Rain aT-CWT transform Units are hidden.

At the moment, the aT-CWT Transform is confidential.
The aT-CWT Transform may help with the difficult "cocktail party problem" and Speech (or Voice) Activity Detection.
At some point, the transform will be published and the ESC-10 notebook with 100% accuracy will be made public.


Classification report	Confusion matrix

Older Notebooks

Initial tests with Mel-Spectrogram, complex CWT, and multi-feature Mel-spectrogram + complex CWT CNN models.

Part I: Mel-Spectrograms and Convolutional Neural Networks (CNN)

Reduction of audio clips length and optimization of mel-spectrogram parameters for best discrimination of sound categories. We train the CNN with 256x256 grayscale images. Accuracy: ~92.5%


_{Mel-spectrograms (dB)}

Part II: Complex Wavelet Transform and Convolutional Neural Networks (CNN)

Optimization of wavelet selection and parameters for best discrimination of sound classes.
Wavelet selection: the difficulty here is the selection of the right wavelet suited to the full range of noise types: pseudo-stationary, non-stationary, transient/impulsive.
Applying different wavelets to each type of sound significantly improves classification accuracy. We train the CNN with 2 128x128 grayscale images per audio clip: scalogram magnitude and phase. Accuracy ~ 94%.


_{Scalograms magnitude (dB)}


_{Scalograms phase (rad)}

Part III: Fusion: Complex Wavelet Transforms + Mel-Spectrograms and CNN

Combining Mel-Spectrograms (Part I) with Complex Wavelets Transforms (Part II) enhances accuracy with features that are difficult to discriminate. We train the CNN with 3 128x128 grayscale images per audio clip. Accuracy. ~ 99%.


_{Rooster: Scalogram Magnitude (dB), Phase (rad) + Mel-spectrogram (dB)}

Overview | Download | Results | Repository content | License | Citing | Caveats | Changelog

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.

The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:

Class	Label
Dog bark	0
Rain	1
Seawaves	2
Baby cry	3
Clock tick	4
Person sneeze	5
Helicopter	6
Chainsaw	7
Rooster	8
Firecrackling	9

Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project. The dataset has been prearranged into 5 folds for comparable cross-validation, making sure that fragments from the same original source file are contained in a single fold.

A more thorough description of the dataset is available in the original paper with some supplementary materials on GitHub: ESC: Dataset for Environmental Sound Classification - paper replication data.

Repository content

audio/*.wav

2000 audio recordings in WAV format (5 seconds, 44.1 kHz, mono) with the following naming convention:

{FOLD}-{CLIP_ID}-{TAKE}-{TARGET}.wav
- {FOLD} - index of the cross-validation fold,
- {CLIP_ID} - ID of the original Freesound clip,
- {TAKE} - letter disambiguating between different fragments from the same Freesound clip,
- {TARGET} - class in numeric format [0, 49].
meta/esc50.csv

CSV file with the following structure:

_filename _fold _target _category _esc10 _{src_file} _take

The esc10 column indicates if a given file belongs to the ESC-10 subset (10 selected classes, CC BY license).
meta/esc50-human.xlsx

Additional data pertaining to the crowdsourcing experiment (human classification accuracy).

Package Requirements

To run the code and reproduce the results,

Download Anaconda for Python 3: https://www.anaconda.com/products/individual
Install Jupyter Lab: conda install -c conda-forge jupyterlab
Install Jupyer Notebook: conda install -c conda-forge notebook
Upload the prepared conda environment: conda env create -f stephane_dedieu_sound_classification.yml Activate the environment: conda activate stephane_dedieu_sound_classification
Alternative to 4: pip install -r requirements.txt Sometimes librosa will not install, you can try then: -conda install -c numba numba -conda install -c conda-forge librosa
Run the notebook: jupyter notebook

Ensure you have the following Python packages installed:

pandas
matplotlib
numpy
scikit-learn
keras
pydot
tensorflow
librosa
glob2
keras
notebook
librosa
seaborn
scikit-image

You can install these packages using pip:

pip install numpy pandas matplotlib seaborn scikit-learn tensorflow librosa

Using conda you can replicate the environment in stephane_dedieu_sound_classification.yml:
conda env create -n ENVNAME --file stephane_dedieu_sound_classification.yml

License

The dataset is available under the terms of the Creative Commons Attribution Non-Commercial license.
A smaller subset (clips tagged as ESC-10) is distributed under CC BY (Attribution).
Attributions for each clip are available in the LICENSE file.

Citing

If you find this dataset useful in an academic setting please cite:

K. J. Piczak. ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd Annual ACM Conference on Multimedia, Brisbane, Australia, 2015.

[DOI: http://dx.doi.org/10.1145/2733373.2806390]

@inproceedings{piczak2015dataset,
  title = {{ESC}: {Dataset} for {Environmental Sound Classification}},
  author = {Piczak, Karol J.},
  booktitle = {Proceedings of the 23rd {Annual ACM Conference} on {Multimedia}},
  date = {2015-10-13},
  url = {http://dl.acm.org/citation.cfm?doid=2733373.2806390},
  doi = {10.1145/2733373.2806390},
  location = {{Brisbane, Australia}},
  isbn = {978-1-4503-3459-4},
  publisher = {{ACM Press}},
  pages = {1015--1018}
}

The ESC-10 subset is licensed as a Creative Commons Attribution 3.0 Unported
(https://creativecommons.org/licenses/by/3.0/) dataset.

Licensing/attribution details for individual audio clips are available in file:

License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Deep Learning and Digital Signal Processing for Environmental Sound Classification

Introduction

ESC-10 Type of sounds/noises

Methodology

ESC-10 Results Synthesis

Jupyter Notebooks

Part IV: Fusion: Complex Wavelet Transforms + Mel-Spectrograms and CNN

Part V: the aT-CWT transform

Older Notebooks

Part I: Mel-Spectrograms and Convolutional Neural Networks (CNN)

Part II: Complex Wavelet Transform and Convolutional Neural Networks (CNN)

Part III: Fusion: Complex Wavelet Transforms + Mel-Spectrograms and CNN

Overview | Download | Results | Repository content | License | Citing | Caveats | Changelog

Repository content

Package Requirements

License

Citing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Deep Learning and Digital Signal Processing for Environmental Sound Classification

Introduction

ESC-10 Type of sounds/noises

Methodology

ESC-10 Results Synthesis

Jupyter Notebooks

Part IV: Fusion: Complex Wavelet Transforms + Mel-Spectrograms and CNN

Part V: the aT-CWT transform

Older Notebooks

Part I: Mel-Spectrograms and Convolutional Neural Networks (CNN)

Part II: Complex Wavelet Transform and Convolutional Neural Networks (CNN)

Part III: Fusion: Complex Wavelet Transforms + Mel-Spectrograms and CNN

Overview | Download | Results | Repository content | License | Citing | Caveats | Changelog

Repository content

Package Requirements

License

Citing