Skip to content

rpinsler/deep-speechgen

Repository files navigation

deep-speechgen: RNN for acoustic speech generation

This project was an attempt to generate human speech using a recurrent neural network (RNN) architecture, dating back to a time when there was no WaveNet yet, and when I had no experience with deep learning or speech processing at all. The project report can be found here.

In hindsight, the project was probably a bit too ambitious but I still learned an aweful lot.

Technical Details

Model

I use a mixture density network as the basic architecture, where the neural network is composed of multiple long short-term memory (LSTM) units. The approach is inspired by the work of Graves (2013), who applied similar techniques to generate handwriting.

Experimental Setup

4.5 hours of English speech from the Simple4All Tundra Corpus were used as training data. The audio files were downsampled from 44.1KHz to 16KHz. From that, 40 mel-cepstral coefficients (mcp) were extracted at a framerate of 80fps and a window size of 0.025s. In a first experiment, those features were utilized to generate novel mcp vectors, from which a spectrogram can be produced. This approach is later extended to generate speech waveforms. AhoCoder [Download] was used to encode and decode the speech signal. For more details, see the report.

About

RNN for acoustic speech generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published