A repository with comprehensive instructions for using the Festvox toolkit for generating emotional speech from text. This was done as a part of a course project for Speech Recognition and Understanding (ECE557/CSE5SRU) at IIIT Delhi during Winter 2020.
Dataset | No. of Speakers | Emotions | No. of utterances | No. of unique prompts | Duration | Language | Comments | Pros | Cons |
---|---|---|---|---|---|---|---|---|---|
TESS | 2 (2 female) | 7 (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral) | 2800 | 200 | ~2 hours | English |
|
|
|
EmoV-DB | 5 (3 male, 2 female) | 5 (neutral, amused, angry sleepy, disgust) | 6914 (1568, 1315, 1293, 1720, 1018) | 1150 | ~7 hours | English, French (1 male speaker) |
|
|
|
The HTS Toolkit is a go-to first step for HMM-based speech synthesis methods. We came across a lot of work which made use of HMM techniques to generate speech, which then referred to HTS for their implementation (this paper, this detailed lecture and this beginner's guide were extremely helpful)
- Even with the help of the HTS documentation, using and setting up HTS is not a cake-walk (which led us to build this README for a more structured approach) and due to the vast amount of parameteres to set, it gets extremely overwhelming for a beginner.
- When attempting to write the models from scratch, most of the techniques described in the papers above are incremental buildups of several other works, which was hard to trace and thus, implement
The next step was to try the Festvox Toolkit. We tried it on the TESS Dataset as detailed above.
- Even though we were able to setup the HMM Toolkit, the TESS Dataset has repeated base utterances - "Say the word", followed by a unique word
- After the "Say the word", the model would find it difficult to utter the next word.
- Models are able to capture (different) emotion and expressive levels to some degree, but seem to be falling short on the vocabulary, so the next step would be to train it on a larger emotional corpus with a richer vocabulary like EmoV-DB
The steps followed are documented in the following flowchart -
The EmoV-DB dataset was formatted in the format given in this section. Further details about training from scratch is given here.
Festvox project is part of the work at Carnegie Mellon University's speech group aimed at advancing the state of Speech Synthesis.
We will be using Festvox to train our HMM models and build voices.
- Docker
- Audio Files: The audio files to be used for training.
- File with utterances: A file which contains the path to the audio file and their transcripts. Schema is described below.
An already configured Docker Image is created by mjansche for the Text-to-Speech tutorial at SLTU 2016. We will be training our HMM models using this Docker Image.
The Docker Image can be pulled by
docker pull mjansche/tts-tutorial-sltu2016
After pulling the docker image, we need to setup flite which is an open source small fast run-time text to speech engine.
To setup flite
, run the docker image and once in the directory /usr/local/src
run the following commands
git clone https://github.com/festvox/flite.git
cd flite
./configure
make
The training requires PCM encoded 16bit mono wav audio files with a sampling rate of 16kHz. Please use ffmpeg
to convert the recorded audio files to the correct format by running the following
ffmpeg -i input.mp3 -acodec pcm_s16le -ac 1 -ar 16000 output.wav
For training you need to make a file named txt.done.data with the base filenames of all the utterances and the text of each utterance. e.g.
( audio_0001 "a whole joy was reaping." )
( audio_0002 "but they've gone south." )
( audio_0003 "you should fetch azure mike." )
Caution There is a space after/before the round braces and between the file name and the utterance. The utterance must be in double quotes.
The first step to train HMM is to prepare the directory. After running the docker image,
cd /usr/local/src/festvox/src/clustergen
mkdir cmu_us_ss
cd cmu_us_ss
$FESTVOXDIR/src/clustergen/setup_cg cmu us ss
Instead of "cmu" and "ss" you can pick any names you want, but please keep "us" so that Festival knows to use the US English pronunciation dictionary. For indic voices, use "indic" instead of "us".
Assuming that you have already prepared the audio files and the list of utterances,
cp -p WHATEVER/txt.done.data etc/
cp -p WHATEVER/wav/*.wav recording/
Assuming the recordings might not be as good as the could be you can power normalize them.
./bin/get_wavs recording/*.wav
Also synthesis builds (especially labeling) work best if there is only a limited amount of leading and trailing silence. We can do this by
./bin/prune_silence wav/*.wav
Note: If you do not require these three stages, you can put your wavefiles directly into wav/
For building voices, you can use an automated script that will do the feature extraction, build the models and generate some text examples.
./bin/build_cg_rfs_voice
Firsty build the prompts and label the data.
./bin/do_build build_prompts etc/txt.done.data
./bin/do_build label etc/txt.done.data
./bin/do_clustergen parallel build_utts etc/txt.done.data
./bin/do_clustergen generate_statename
./bin/do_clustergen generate_filters
Then do feature extraction
./bin/do_clustergen parallel f0_v_sptk
./bin/do_clustergen parallel mcep_sptk
./bin/do_clustergen parallel combine_coeffs_v
Build the models
./bin/traintest etc/txt.done.data
./bin/do_clustergen parallel cluster etc/txt.done.data.train
./bin/do_clustergen dur etc/txt.done.data.train
We will use flite to generate audio from the trained model.
rm -rf flite
$FLITEDIR/tools/setup_flite
./bin/build_flite cg
cd flite
make
flite requires .flitevox object to build the voices. Create the .flitevox object by
./flite_cmu_us_${NAME} -voicedump output.flitevox
Then audio can be easily generated for any utterance by
./flite_cmu_us_${NAME} "<sentence to utter>" output.wav
We also make our system demonstration publicaly available within the hmm_wrapper
directory. Further details are provided in the README of the directory.
We also make the trained models for the different emotions available here.
These models can be used for further fine-tuning or running the system provided in hmm_wrapper
directory.
Festvox : Festvox project developed by Carnegie Mellon University.
Docker : Festvox configured docker image.
Building Data : The format for utterance file.
Training : Steps to train the HMM Model.
Automated Script : Description of the automated script.
If you find any of the approches or code in this repository useful, please consider citing this repository:
@software{pranav_jain_2020_3876162,
author = {Pranav Jain and
Srija Anand and
Eshita and
Shruti Singh and
Aditya Chetan and
Brihi Joshi and
Pulkit Madaan},
title = {{An exploration into HMM-based methods for
Emotional Text-to-Speech}},
month = jun,
year = 2020,
publisher = {Zenodo},
version = {v1.0.0},
doi = {10.5281/zenodo.3876162},
url = {https://doi.org/10.5281/zenodo.3876162}
}
For any errors or help in running the project, please open an issue or write to any of the project members -
- Pranav Jain (pranav16255 [at] iiitd [dot] ac [dot] in)
- Srija Anand (srija17199 [at] iiitd [dot] ac [dot] in)
- Eshita (eshita17149 [at] iiitd [dot] ac [dot] in)
- Shruti Singh (shruti17211 [at] iiitd [dot] ac [dot] in)
- Pulkit Madaan (pulkit16257 [at] iiitd [dot] ac [dot] in)
- Aditya Chetan (aditya16217 [at] iiitd [dot] ac [dot] in)
- Brihi Joshi (brihi16142 [at] iiitd [dot] ac [dot] in)