Skip to content

sutgeorge/Audio-Super-Resolution

Repository files navigation

Audio Super-Resolution

A project focused on the super-resolution of audio signals i.e improving sound quality of a digital recording, be it a vocal recording or music.

Dataset

The datasets used will be VCTK and some music dataset, possibly MagnaTagATune or The Million Song dataset.

Main references

MSc research project

Papers

Data augmentation

Data transformation

Other resources

Data

Related papers that could be used as references

Audio signals

Image super-resolution

Time-series

Generative models

Deep Learning

Blogs

Datasets

Other

Possible uses

  • Improving sound quality of music
    • insanely experimental and slightly unrealistic idea: use Deezer's model for separating tracks of a song and apply transfer learning to upsample each track separately (train a model with low-res/high-res guitar tracks, one with bass tracks, one for drums and another for voice tracks)
  • Voice-over-IP applications
  • Improving speech recognition
  • Remastering audio from old movies
  • Reconstructing old phonograph recordings (This one might be a bit far-fetched, since a lot of data is needed. It might actually be impossible.)

Converting a spectrogram to audio (might be useful later)

The Python module librosa can transform an audio signal into a spectrogram and vice-versa by using the Griffin-Lim algorithm.

Math and signal processing links

Research advices

  • How to Read AI (Audio) Research Papers Like a Rockstar

    1. Skimming

      • read the abstract
      • read introduction + conclusion
      • check out figures and tables
      • ignore details
    2. Reading the details

      • What's the state-of-the-art?
      • What are the tools/techniques used?
      • How did the authors evaluate their solution?
      • What are the results of the experiments?
    3. Question everything

      • Do I understand everything the authors say?
      • Is the paper sound?
      • What would I have done differently?
      • Read referenced research papers
      • Internalise the math
      • Re-think the problem
      • Explain the paper to friends/colleagues
    4. Check the code

      • Check the author's implementation
      • Re-implement the proposed solution
      • Run experiments
  • How to Conduct Literature Review Effectively

    Select resource → Read resource → Take notes (topic, approach, results, contributions, limitations/weaknesses) → Keep track of reference → Summarise literature review findings

  • How to Select AI (Audio) Papers Effectively

  • How to Summarize a Research Article

Outline idea

  • Introduction

    • Problem statement (what, why, how)
  • Theory

    • Neural nets/CNNs
    • Image super-resolution
    • Time series super-resolution
    • Autoencoders and U-Nets (state-of-the-art) - Audio super-resolution - literature review
    • Time-series analysis - basic concepts (frequency spectrum, Fourier transform, spectrograms, sample rate)
  • My contributions

    • Multiscale convolutions or
    • Encoding the time-domain representation as images (GAFs, MTFs, recurrence plots or some other time-series imaging technique)
  • Experiments and results evaluation (metrics, tables etc.) -> dataset details (exploratory data analysis: histograms, spectrograms etc.)

  • Spring/Angular project description

  • Conclusion

First stage

Data generation, training and testing diagram in stage 1

  • Write the data generator scripts (obtaining the low-res/high-res pairs of audio clips)
  • Write the training/testing scripts
  • read more about why the loss approaches the nan value during training
  • create plots with the training and validation loss
  • train 100 epochs
  • adjust the data split to use all of the generated data from the first 100 tracks of VCTK for training/validation/testing
  • Finish testing/prediction script:
    • Downsample test audio track
    • Feed chunks of 256 samples of the audio to the model
    • Display spectrogram of the output
    • Save the low-res/high-res/super-res numpy arrays as audio and compare them

Second stage

Data generation, training and testing diagram in stage 2

  • Display the 5-number summary / boxplot of the generated dataset
  • Create line plot of a couple of random samples (both in time-domain and in frequency-domain)
  • Create histogram of a single sample
  • Check what size a chunk could have (transform the numpy array chunk of size n to WAV and listen to the result)
  • Implement checkpoint restoration
  • Generate the dataset again with a chunk size of 4800 samples (100ms), an overlap of 2400 samples (50ms) and a downsampling factor of 4 (the reason for this patch size is that the paper mentions a similar size of 6000 samples used in 400 training epochs)
    • Use 4 processes to speed up the data generation
  • Add 3 more upsampling blocks (a subpixel layer is not enough to upsample a 1200-sample tensor to 4800 samples and has no parameters)
  • Reconstruct the low-resolution signal without applying interpolation (e.g a 100-sample chunk downsampled with a factor of 4 becomes a chunk of length 25, which will be fed to the network as-is without interpolation)
  • Get evaluation metrics (NRMSE comparisons for the high-res/super-res signals and the high-res/interpolated signals)
  • Display some scatterplots
  • Compare the 5e-4 learning rate model with the 100-epoch 1e-4 model and the 200-epoch 1e-4 model

3rd stage

  • Increase the number of filters in the Conv1D layers from 32 to 64 and train

4th stage

  • Set the filter no. configuration to 64-128-256-256-256-128-64, with the last 2 upsampling blocks retaining the 64-64 arrangement

5th stage

  • Add BatchNorm to the downsampling blocks

6th stage

  • Perform an ablation study to analyze how much each residual block contributes to the noise found in the model output

7th stage

  • Replace the last two upsampling blocks with 1D Inception modules as an experiment

The values are the averages of the outputs for each test sample. These were computed with tf.keras.evaluate(...).

Evaluation metrics
3-layer model 4-layer model 5-layer model
MSE SNR NRMSE MSE SNR NRMSE MSE SNR NRMSE
67026 15.31 0.1779 68838 15.45 0.1667 64744 15.46 0.1689
Evaluation metrics (with Inception blocks)
3-layer model 4-layer model 5-layer model
MSE SNR NRMSE MSE SNR NRMSE MSE SNR NRMSE
170675 12.88 0.4275 81340 14.16 0.2289 75183 14.28 0.2341

8th stage

  • remove the dropout layers from the Inception blocks and notice the effect
Evaluation metrics
5-layer model (Inception modules with dropout) 5-layer model (Inception modules without dropout)
MSE SNR NRMSE MSE SNR NRMSE
75183 14.28 0.2341 42914 16.62 0.1116

9th stage

  • remove the dropout layers from the last two upsampling blocks of the normal 5-layer model and notice the effect
Evaluation metrics
5-layer model (Inception modules with dropout) 5-layer model (Inception modules without dropout) 5-layer model (no Inception modules, no dropout in the last 2 blocks)
MSE SNR NRMSE MSE SNR NRMSE MSE SNR NRMSE
75183 14.28 0.2341 42914 16.62 0.1116 45324 16.57 0.1125

10th stage

  • remove the dropout layers from the last two upsampling blocks of the normal 3-layer and 4-layer models and notice the effect

3-blocks model 4-blocks model 5-blocks model
MSE SNR NRMSE MSE SNR NRMSE MSE SNR NRMSE
upsampling blocks with dropout 67026 22.78 0.1779 66944 22.92 0.1737 64744 23.07 0.1689
upsampling blocks without dropout 51023 25.43 0.1162 47582 25.74 0.1125 45324 25.73 0.1125
with Inception modules, with dropout 170675 15.23 0.4275 81340 19.79 0.2289 75183 20.27 0.2341
with Inception modules, without dropout 49434 25.64 0.1143 55401 25.41 0.1194 42914 25.87 0.1116

Baseline comparisons

The linear and cubic spline baselines have been computed as the means of the signal-to-noise ratio for all test samples. To compute the SNR for the model with Inception modules containing no dropout, a batch size of 1 is used, since Keras.evaluate technically computes a mean over the output metrics for each test sample (using a larger batch size would skew the result).


Linear interpolation Cubic spline interpolation 5-layer model with Inception blocks containing no dropout
19.67 26.80 27.57

About

Audio super-resolution on speech recordings.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published