-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Adding inferencing notebook * added multispeaker explanation and usecase and renamed the file * Adding training tutorial * fixed dummy paths * fixed review comments * fixed metadata extension Co-authored-by: Eren Gölge <erogol@hotmail.com>
- Loading branch information
1 parent
f70e82c
commit 68cef28
Showing
2 changed files
with
726 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,272 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "45ea3ef5", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"# Easy Inferencing with 🐸 TTS ⚡\n", | ||
"\n", | ||
"#### You want to quicly synthesize speech using Coqui 🐸 TTS model?\n", | ||
"\n", | ||
"💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡\n", | ||
"\n", | ||
"🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .\n", | ||
"\n", | ||
"In this notebook, we will: \n", | ||
"```\n", | ||
"1. List available pre-trained 🐸 TTS models\n", | ||
"2. Run a 🐸 TTS model\n", | ||
"3. Listen to the synthesized wave 📣\n", | ||
"4. Run multispeaker 🐸 TTS model \n", | ||
"```\n", | ||
"So, let's jump right in!\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a1e5c2a5-46eb-42fd-b550-2a052546857e", | ||
"metadata": {}, | ||
"source": [ | ||
"## Install 🐸 TTS ⬇️" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "fa2aec77", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"! pip install -U pip\n", | ||
"! pip install TTS" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8c07a273", | ||
"metadata": {}, | ||
"source": [ | ||
"## ✅ List available pre-trained 🐸 TTS models\n", | ||
"\n", | ||
"Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures. \n", | ||
"\n", | ||
"You can either use your own model or the release models under 🐸TTS.\n", | ||
"\n", | ||
"Use `tts --list_models` to find out the availble models.\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "608d203f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"! tts --list_models" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ed9dd7ab", | ||
"metadata": {}, | ||
"source": [ | ||
"## ✅ Run a 🐸 TTS model\n", | ||
"\n", | ||
"#### **First things first**: Using a release model and default vocoder:\n", | ||
"\n", | ||
"You can simply copy the full model name from the list above and use it \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "cc9e4608-16ec-4dcd-bd6b-bd10d62286f8", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!tts --text \"hello world\" \\\n", | ||
"--model_name \"tts_models/en/ljspeech/glow-tts\" \\\n", | ||
"--out_path output.wav\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "0ca2cb14-1aba-400e-a219-8ce44d9410be", | ||
"metadata": {}, | ||
"source": [ | ||
"## 📣 Listen to the synthesized wave 📣" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "5fe63ef4-9284-4461-9dda-1ca7483a8f9b", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import IPython\n", | ||
"IPython.display.Audio(\"output.wav\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "5e67d178-1ebe-49c7-9a47-0593251bdb96", | ||
"metadata": {}, | ||
"source": [ | ||
"### **Second things second**:\n", | ||
"\n", | ||
"🔶 A TTS model can be either trained on a single speaker voice or multispeaker voices. This training choice is directly reflected on the inference ability and the available speaker voices that can be used to synthesize speech. \n", | ||
"\n", | ||
"🔶 If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "87b18839-f750-4a61-bbb0-c964acaecab2", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# list the possible speaker IDs.\n", | ||
"!tts --model_name \"tts_models/en/vctk/vits\" \\\n", | ||
"--list_speaker_idxs \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c4365a9d-f922-4b14-88b0-d2b22a245b2e", | ||
"metadata": {}, | ||
"source": [ | ||
"## 💬 Synthesize speech using speaker ID 💬" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "52be0403-d13e-4d9b-99c2-c10b85154063", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!tts --text \"Trying out specific speaker voice\"\\\n", | ||
"--out_path spkr-out.wav --model_name \"tts_models/en/vctk/vits\" \\\n", | ||
"--speaker_idx \"p341\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "894a560a-f9c8-48ce-aaa6-afdf516c01f6", | ||
"metadata": {}, | ||
"source": [ | ||
"## 📣 Listen to the synthesized speaker specific wave 📣" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "ed485b0a-dfd5-4a7e-a571-ebf74bdfc41d", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import IPython\n", | ||
"IPython.display.Audio(\"spkr-out.wav\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "84636a38-097e-4dad-933b-0aeaee650e92", | ||
"metadata": {}, | ||
"source": [ | ||
"🔶 If you want to use an external speaker to synthesize speech, you need to supply `--speaker_wav` flag along with an external speaker encoder path and config file, as follows:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "cbdb15fa-123a-4282-a127-87b50dc70365", | ||
"metadata": {}, | ||
"source": [ | ||
"First we need to get the speaker encoder model, its config and a referece `speaker_wav`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e54f1b13-560c-4fed-bafd-e38ec9712359", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json\n", | ||
"!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar\n", | ||
"!wget https://github.com/coqui-ai/TTS/raw/speaker_encoder_model/tests/data/ljspeech/wavs/LJ001-0001.wav" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "6dac1912-5054-4a68-8357-6d20fd99cb10", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!tts --model_name tts_models/multilingual/multi-dataset/your_tts \\\n", | ||
"--encoder_path model_se.pth.tar \\\n", | ||
"--encoder_config config_se.json \\\n", | ||
"--speaker_wav LJ001-0001.wav \\\n", | ||
"--text \"Are we not allowed to dim the lights so people can see that a bit better?\"\\\n", | ||
"--out_path spkr-out.wav \\\n", | ||
"--language_idx \"en\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "92ddce58-8aca-4f69-84c3-645ae1b12e7d", | ||
"metadata": {}, | ||
"source": [ | ||
"## 📣 Listen to the synthesized speaker specific wave 📣" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "cc889adc-9c71-4232-8e85-bfc8f76476f4", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import IPython\n", | ||
"IPython.display.Audio(\"spkr-out.wav\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "29101d01-0b01-4153-a216-5dae415a5dd6", | ||
"metadata": {}, | ||
"source": [ | ||
"## 🎉 Congratulations! 🎉 You now know how to use a TTS model to synthesize speech! \n", | ||
"Follow up with the next tutorials to learn more adnavced material." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.10" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.