Skip to content

Prompting Controlled Emotional TTS

Compare
Choose a tag to compare
@Flux9665 Flux9665 released this 10 Jun 14:03
bb4755b

In this release you can condition your TTS model on emotional prompts during training and transfer the emotion in any prompt to synthesized speech during inference.

Demo samples are available at https://anondemos.github.io/Prompting/
A demo space is available at https://huggingface.co/spaces/Thommy96/promptingtoucan

Using pretrained models:
You can use the pretrained models for inference by simply providing an instance of the sentence embedding extractor, a speaker id and a prompt (see run_sent_emb_test_suite.py).

Training your own model:
You will need to extract a number of prompts and their sentence embeddings for all emotion categories which you want to include during training (see e.g. extract_yelp_sent_embs.py).
Then in your training pipeline you need to load these sentence embeddings and pass them to the train loop. You should also provide the dimensionality of the embeddings in the instantiation of the TTS model and set static_speaker_embedding=True (see TrainingInterfaces\TrainingPipelines\ToucanTTS_Sent_Finetuning.py). Depending on how many speakers there are in the datasets you use for training, you need to adapt the dimensionality of the speaker embedding table in the TTS model. Finally you should check if the datasets you use are included in the functions for extracting emotion and speaker id from the filepath (Utility\utils.py).