Lightweight End2End Sound Recognition Pipeline using Temporal Convolutional Networks (CNN-TCN architecture) and Data Augmentation by SpecAugment. The implementation features the UrbanSound8K dataset, available here: https://urbansounddataset.weebly.com/urbansound8k.html
You can also directly try the jupyter notebook in Google Colab without needing an own GPU : https://colab.research.google.com/drive/1wflMtZzj3wXiXYrH7Y3vR1m4fdf5TJyD
Disclaimer: This was part of a small university project of mine last summer. It's just been a 2 hrs/week project where the initial goal was to exercise reading papers in the domain of audio engineering and writing your own one (max. 4 pages) based on that. I couldn't resist overshooting the target a bit by doing AI stuff and implementing that.
It's not perfect and you should not see this as "real" research paper, but maybe somebody is interested in the code and architecture nevertheless, so I'm uploading this here. ;)