A project focused on the super-resolution of audio signals i.e improving sound quality of a digital recording, be it a vocal recording or music.
The datasets used will be VCTK and some music dataset, possibly MagnaTagATune or The Million Song dataset.
- Audio Super-Resolution Using Neural Nets
- Time-frequency Networks For Audio Super-Resolution
- Bandwidth extension on raw audio via generative adversarial networks
-
Phase-aware music super-resolution using generative adversarial networks
-
A two-stage U-Net for high-fidelity denoising of historical recordings
-
Realistic Gramophone Noise Synthesis Using A Diffusion Model
-
BEHM-GAN: Bandwidth Extension of Historical Music using Generative Adversarial Networks
-
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
-
Imaging Time-Series to Improve Classification and Imputation
-
Data augmentation approaches for improving animal audio classification
-
Audiomentations - Python library for audio data augmentation
- Deep Learning for Audio Signal Processing
- Adversarial Training for Speech Super-Resolution
- MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation
- On the evaluation of generative models in music
- INCO-GAN: Variable-Length Music Generation Method Based on Inception Model-Based Conditional GAN
- WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution
- Self-Attention for Audio Super-Resolution
- On Filter Generalization for Music Bandwidth Extension Using Deep Neural Networks
- NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling
- Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulation
- Learning Continuous Representation of Audio for Arbitrary Scale Super Resolution
- An investigation of pre-upsampling generative modelling and Generative Adversarial Networks in audio super resolution
- TUNet: A Block-online Bandwidth Extension Model based on Transformers and Self-supervised Pretraining
- Vision-Infused Deep Audio Inpainting
- VoiceFixer: Toward General Speech Restoration with Neural Vocoder
- Sound field reconstruction in rooms: inpainting meets super-resolution
- High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram
- Enabling Real-time On-chip Audio Super Resolution for Bone Conduction Microphones
- Speech bandwidth expansion based on Deep Neural Networks
- Speech Audio Super-Resolution For Speech Recognition
- WaveNet: A Generative Model for Raw Audio
- Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
- Multi-Scale Inception Based Super-Resolution Using Deep Learning Approach
- Imaging Time-Series to Improve Classification and Imputation
- Use of recurrence plots for identification and extraction of patterns in humpback whale song recordings
- A Brief Introduction to Nonlinear Time Series Analysis and Recurrence Plots
- Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift
- Fast Fourier Convolution
- Competitive Multi-scale Convolution
- Efficient and Generic 1D Dilated Convolution Layer for Deep Learning
- Multi-Scale Convolutional Neural Networks for Time Series Classification
- Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
- Article on normalizing the root mean square error
- Understanding the Inception architecture
- Ordering of batch normalization and dropout
- Using Deep-Learning to Reconstruct High-Resolution Audio
- Separate Music Tracks with Deep Learning
- A Simple Guide to the Versions of the Inception Network
- Ablation studies video
- What is an ablation study?
- Steps to start training your custom Tensorflow model in AWS SageMaker
- Loss suddenly increasing
- NaN loss
- Gradient clipping in Keras
- How to use Learning Curves to Diagnose Machine Learning Model Performance
- Validation Error less than training error
- How to successfully add large data sets to Google Drive
- Signal-to-noise ratio
- Validation loss when using Dropout
- My validation loss is lower than my training loss, should I get rid of regularization?
- Dropout rate guidance for hidden layers in a convolution neural network
- Your validation loss is lower than the training loss? This is why.
- Dropout makes performance worse
- How many concurrent requests does a single Flask process receive?
- Python: How to create a zip archive from multiple files
- Java - How to download a zip file from a URL
- Improving sound quality of music
- insanely experimental and slightly unrealistic idea: use Deezer's model for separating tracks of a song and apply transfer learning to upsample each track separately (train a model with low-res/high-res guitar tracks, one with bass tracks, one for drums and another for voice tracks)
- Voice-over-IP applications
- Improving speech recognition
- Remastering audio from old movies
- Reconstructing old phonograph recordings (This one might be a bit far-fetched, since a lot of data is needed. It might actually be impossible.)
The Python module librosa can transform an audio signal into a spectrogram and vice-versa by using the Griffin-Lim algorithm.
- What is convolution? This is the easiest way to understand
- Introducing Convolutions
- Valerio Velardo's Audio Signal Processing for Machine Learning Playlist
- Steve Brunton's Fourier Analysis Playlist
- Sampling, Aliasing & Nyquist Theorem
- Interpolation
- Wave phase
- What is phase in audio?
- Difference between logged-power spectrum and power spectrum
- Linear and logarithmic scales
- Decibel conversion
- Why do we modulate signals?
- Modulation vs. convolution
- Discrete Fourier Transform explained with example
- DC offset
- The FFT algorithm - simple step by step explanation
- What is a good signal-to-noise ratio?
- Log-uniform distribution
-
How to Read AI (Audio) Research Papers Like a Rockstar
-
Skimming
- read the abstract
- read introduction + conclusion
- check out figures and tables
- ignore details
-
Reading the details
- What's the state-of-the-art?
- What are the tools/techniques used?
- How did the authors evaluate their solution?
- What are the results of the experiments?
-
Question everything
- Do I understand everything the authors say?
- Is the paper sound?
- What would I have done differently?
- Read referenced research papers
- Internalise the math
- Re-think the problem
- Explain the paper to friends/colleagues
-
Check the code
- Check the author's implementation
- Re-implement the proposed solution
- Run experiments
-
-
How to Conduct Literature Review Effectively
Select resource → Read resource → Take notes (topic, approach, results, contributions, limitations/weaknesses) → Keep track of reference → Summarise literature review findings
-
Introduction
- Problem statement (what, why, how)
-
Theory
- Neural nets/CNNs
- Image super-resolution
- Time series super-resolution
- Autoencoders and U-Nets (state-of-the-art) - Audio super-resolution - literature review
- Time-series analysis - basic concepts (frequency spectrum, Fourier transform, spectrograms, sample rate)
-
My contributions
- Multiscale convolutions or
- Encoding the time-domain representation as images (GAFs, MTFs, recurrence plots or some other time-series imaging technique)
-
Experiments and results evaluation (metrics, tables etc.) -> dataset details (exploratory data analysis: histograms, spectrograms etc.)
-
Spring/Angular project description
-
Conclusion
A quick implementation of "Audio Super Resolution Using Neural Networks (Kuleshov, Enam, Ermon, 2017)"
- Write the data generator scripts (obtaining the low-res/high-res pairs of audio clips)
- Write the training/testing scripts
- read more about why the loss approaches the nan value during training
- usually, it is either because of an exploding gradient or a vanishing gradient (in my case, I accidentally used NRMSE as a loss function instead of using it only as a metric, the number was so small that Keras displayed the loss as being "nan")
- relevant StackOverflow post: https://stackoverflow.com/questions/37232782/nan-loss-when-training-regression-network
- create plots with the training and validation loss
- train 100 epochs
- adjust the data split to use all of the generated data from the first 100 tracks of VCTK for training/validation/testing
- Finish testing/prediction script:
- Downsample test audio track
- Feed chunks of 256 samples of the audio to the model
- Display spectrogram of the output
- Save the low-res/high-res/super-res numpy arrays as audio and compare them
- Display the 5-number summary / boxplot of the generated dataset
- Create line plot of a couple of random samples (both in time-domain and in frequency-domain)
- Create histogram of a single sample
- Check what size a chunk could have (transform the numpy array chunk of size n to WAV and listen to the result)
- Implement checkpoint restoration
- Generate the dataset again with a chunk size of 4800 samples (100ms), an overlap of 2400 samples (50ms) and a downsampling factor of 4 (the reason for this patch size is that the paper mentions a similar size of 6000 samples used in 400 training epochs)
- Use 4 processes to speed up the data generation
- Add 3 more upsampling blocks (a subpixel layer is not enough to upsample a 1200-sample tensor to 4800 samples and has no parameters)
- Reconstruct the low-resolution signal without applying interpolation (e.g a 100-sample chunk downsampled with a factor of 4 becomes a chunk of length 25, which will be fed to the network as-is without interpolation)
- Get evaluation metrics (NRMSE comparisons for the high-res/super-res signals and the high-res/interpolated signals)
- Display some scatterplots
- Compare the 5e-4 learning rate model with the 100-epoch 1e-4 model and the 200-epoch 1e-4 model
- Increase the number of filters in the Conv1D layers from 32 to 64 and train
- Set the filter no. configuration to 64-128-256-256-256-128-64, with the last 2 upsampling blocks retaining the 64-64 arrangement
- Add BatchNorm to the downsampling blocks
- Perform an ablation study to analyze how much each residual block contributes to the noise found in the model output
- Replace the last two upsampling blocks with 1D Inception modules as an experiment
The values are the averages of the outputs for each test sample. These were computed with tf.keras.evaluate(...).
3-layer model | 4-layer model | 5-layer model | ||||||
---|---|---|---|---|---|---|---|---|
MSE | SNR | NRMSE | MSE | SNR | NRMSE | MSE | SNR | NRMSE |
67026 | 15.31 | 0.1779 | 68838 | 15.45 | 0.1667 | 64744 | 15.46 | 0.1689 |
3-layer model | 4-layer model | 5-layer model | ||||||
---|---|---|---|---|---|---|---|---|
MSE | SNR | NRMSE | MSE | SNR | NRMSE | MSE | SNR | NRMSE |
170675 | 12.88 | 0.4275 | 81340 | 14.16 | 0.2289 | 75183 | 14.28 | 0.2341 |
- remove the dropout layers from the Inception blocks and notice the effect
5-layer model (Inception modules with dropout) | 5-layer model (Inception modules without dropout) | ||||
---|---|---|---|---|---|
MSE | SNR | NRMSE | MSE | SNR | NRMSE |
75183 | 14.28 | 0.2341 | 42914 | 16.62 | 0.1116 |
- remove the dropout layers from the last two upsampling blocks of the normal 5-layer model and notice the effect
5-layer model (Inception modules with dropout) | 5-layer model (Inception modules without dropout) | 5-layer model (no Inception modules, no dropout in the last 2 blocks) | ||||||
---|---|---|---|---|---|---|---|---|
MSE | SNR | NRMSE | MSE | SNR | NRMSE | MSE | SNR | NRMSE |
75183 | 14.28 | 0.2341 | 42914 | 16.62 | 0.1116 | 45324 | 16.57 | 0.1125 |
- remove the dropout layers from the last two upsampling blocks of the normal 3-layer and 4-layer models and notice the effect
3-blocks model | 4-blocks model | 5-blocks model | |||||||
---|---|---|---|---|---|---|---|---|---|
MSE | SNR | NRMSE | MSE | SNR | NRMSE | MSE | SNR | NRMSE | |
upsampling blocks with dropout | 67026 | 22.78 | 0.1779 | 66944 | 22.92 | 0.1737 | 64744 | 23.07 | 0.1689 |
upsampling blocks without dropout | 51023 | 25.43 | 0.1162 | 47582 | 25.74 | 0.1125 | 45324 | 25.73 | 0.1125 |
with Inception modules, with dropout | 170675 | 15.23 | 0.4275 | 81340 | 19.79 | 0.2289 | 75183 | 20.27 | 0.2341 |
with Inception modules, without dropout | 49434 | 25.64 | 0.1143 | 55401 | 25.41 | 0.1194 | 42914 | 25.87 | 0.1116 |
The linear and cubic spline baselines have been computed as the means of the signal-to-noise ratio for all test samples. To compute the SNR for the model with Inception modules containing no dropout, a batch size of 1 is used, since Keras.evaluate technically computes a mean over the output metrics for each test sample (using a larger batch size would skew the result).
Linear interpolation | Cubic spline interpolation | 5-layer model with Inception blocks containing no dropout |
---|---|---|
19.67 | 26.80 | 27.57 |