From 0e69367e197ad89ebbc1956345c9c901f23deb18 Mon Sep 17 00:00:00 2001 From: Aya Jafari Date: Fri, 20 May 2022 12:45:01 +0300 Subject: [PATCH 1/6] Adding inferencing notebook --- notebooks/use-pretrained-TTS.ipynb | 198 +++++++++++++++++++++++++++++ 1 file changed, 198 insertions(+) create mode 100644 notebooks/use-pretrained-TTS.ipynb diff --git a/notebooks/use-pretrained-TTS.ipynb b/notebooks/use-pretrained-TTS.ipynb new file mode 100644 index 0000000000..c7bd8bed7c --- /dev/null +++ b/notebooks/use-pretrained-TTS.ipynb @@ -0,0 +1,198 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "45ea3ef5", + "metadata": { + "tags": [] + }, + "source": [ + "# Easy inferencing with 🐸 TTS ⚡\n", + "\n", + "You want to quicly synthesize speech using Coqui (🐸) TTS model?\n", + "\n", + "💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡\n", + "\n", + "🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .\n", + "\n", + "\n", + "In this notebook, we will:\n", + "\n", + "1. Download a pre-trained TTS english model.\n", + "\n", + "\n", + "So, let's jump right in!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa2aec77", + "metadata": {}, + "outputs": [], + "source": [ + "## Install Coqui STT\n", + "! pip install -U pip\n", + "! pip install TTS" + ] + }, + { + "cell_type": "markdown", + "id": "8c07a273", + "metadata": {}, + "source": [ + "## ✅ List available pre-trained 🐸 TTS models\n", + "\n", + "Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures. \n", + "\n", + "You can either use your own model or the release models under 🐸TTS.\n", + "\n", + "Use `tts --list_models` to find out the availble models.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "608d203f", + "metadata": {}, + "outputs": [], + "source": [ + "! tts --list_models" + ] + }, + { + "cell_type": "markdown", + "id": "ed9dd7ab", + "metadata": {}, + "source": [ + "## ✅ Run a TTS model\n", + "\n", + "### **First things first**: Using a release model and default vocoder:\n", + "\n", + "#### You can simply copy the full model name from the list above and use it \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc9e4608-16ec-4dcd-bd6b-bd10d62286f8", + "metadata": {}, + "outputs": [], + "source": [ + "!tts --text \"hello world\" \\\n", + "--model_name \"tts_models/en/ljspeech/glow-tts\" \\\n", + "--out_path output.wav\n" + ] + }, + { + "cell_type": "markdown", + "id": "0ca2cb14-1aba-400e-a219-8ce44d9410be", + "metadata": {}, + "source": [ + "## 📣 Listen to the synthesized wave 📣" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5fe63ef4-9284-4461-9dda-1ca7483a8f9b", + "metadata": {}, + "outputs": [], + "source": [ + "import IPython\n", + "IPython.display.Audio(\"output.wav\")" + ] + }, + { + "cell_type": "markdown", + "id": "5e67d178-1ebe-49c7-9a47-0593251bdb96", + "metadata": {}, + "source": [ + "### **Second things second**:\n", + "\n", + "#### If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87b18839-f750-4a61-bbb0-c964acaecab2", + "metadata": {}, + "outputs": [], + "source": [ + "# list the possible speaker IDs.\n", + "!tts --model_name \"tts_models/en/vctk/vits\" \\\n", + "--list_speaker_idxs \n" + ] + }, + { + "cell_type": "markdown", + "id": "c4365a9d-f922-4b14-88b0-d2b22a245b2e", + "metadata": {}, + "source": [ + "## 💬 Synthesize speech using speaker ID 💬" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "52be0403-d13e-4d9b-99c2-c10b85154063", + "metadata": {}, + "outputs": [], + "source": [ + "!tts --text \"Trying out specific speaker voice\"\\\n", + "--out_path spkr-out.wav --model_name \"tts_models/en/vctk/vits\" \\\n", + "--speaker_idx \"p341\"" + ] + }, + { + "cell_type": "markdown", + "id": "894a560a-f9c8-48ce-aaa6-afdf516c01f6", + "metadata": {}, + "source": [ + "## 📣 Listen to the synthesized speaker specific wave 📣" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed485b0a-dfd5-4a7e-a571-ebf74bdfc41d", + "metadata": {}, + "outputs": [], + "source": [ + "import IPython\n", + "IPython.display.Audio(\"spkr-out.wav\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9116400-aff7-4a04-810f-7f89e66d2950", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From d06a7307df83886c2d55c8fe6562c53682aed63e Mon Sep 17 00:00:00 2001 From: Aya Jafari Date: Tue, 24 May 2022 16:47:11 +0300 Subject: [PATCH 2/6] added multispeaker explanation and usecase and renamed the file --- notebooks/Tutorial_1_use-pretrained-TTS.ipynb | 297 ++++++++++++++++++ notebooks/use-pretrained-TTS.ipynb | 198 ------------ 2 files changed, 297 insertions(+), 198 deletions(-) create mode 100644 notebooks/Tutorial_1_use-pretrained-TTS.ipynb delete mode 100644 notebooks/use-pretrained-TTS.ipynb diff --git a/notebooks/Tutorial_1_use-pretrained-TTS.ipynb b/notebooks/Tutorial_1_use-pretrained-TTS.ipynb new file mode 100644 index 0000000000..238f272784 --- /dev/null +++ b/notebooks/Tutorial_1_use-pretrained-TTS.ipynb @@ -0,0 +1,297 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "45ea3ef5", + "metadata": { + "tags": [] + }, + "source": [ + "# Easy Inferencing with 🐸 TTS ⚡\n", + "\n", + "### You want to quicly synthesize speech using Coqui 🐸 TTS model?\n", + "\n", + "### 💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡\n", + "\n", + "#### 🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .\n", + "\n", + "#### In this notebook, we will: \n", + "```\n", + "1. List available pre-trained 🐸 TTS models\n", + "2. Run a 🐸 TTS model\n", + "3. Listen to the synthesized wave 📣\n", + "4. Run multispeaker 🐸 TTS model \n", + "```\n", + "#### So, let's jump right in!\n" + ] + }, + { + "cell_type": "markdown", + "id": "a1e5c2a5-46eb-42fd-b550-2a052546857e", + "metadata": {}, + "source": [ + "## Install 🐸 TTS ⬇️" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa2aec77", + "metadata": {}, + "outputs": [], + "source": [ + "! pip install -U pip\n", + "! pip install TTS" + ] + }, + { + "cell_type": "markdown", + "id": "8c07a273", + "metadata": {}, + "source": [ + "## ✅ List available pre-trained 🐸 TTS models\n", + "\n", + "#### Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures. \n", + "\n", + "#### You can either use your own model or the release models under 🐸TTS.\n", + "\n", + "#### Use `tts --list_models` to find out the availble models.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "608d203f", + "metadata": {}, + "outputs": [], + "source": [ + "! tts --list_models" + ] + }, + { + "cell_type": "markdown", + "id": "ed9dd7ab", + "metadata": {}, + "source": [ + "## ✅ Run a 🐸 TTS model\n", + "\n", + "### **First things first**: Using a release model and default vocoder:\n", + "\n", + "#### You can simply copy the full model name from the list above and use it \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc9e4608-16ec-4dcd-bd6b-bd10d62286f8", + "metadata": {}, + "outputs": [], + "source": [ + "!tts --text \"hello world\" \\\n", + "--model_name \"tts_models/en/ljspeech/glow-tts\" \\\n", + "--out_path output.wav\n" + ] + }, + { + "cell_type": "markdown", + "id": "0ca2cb14-1aba-400e-a219-8ce44d9410be", + "metadata": {}, + "source": [ + "## 📣 Listen to the synthesized wave 📣" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5fe63ef4-9284-4461-9dda-1ca7483a8f9b", + "metadata": {}, + "outputs": [], + "source": [ + "import IPython\n", + "IPython.display.Audio(\"output.wav\")" + ] + }, + { + "cell_type": "markdown", + "id": "5e67d178-1ebe-49c7-9a47-0593251bdb96", + "metadata": {}, + "source": [ + "### **Second things second**:\n", + "\n", + "#### 🔶 A TTS model can be either trained on a single speaker voice or multispeaker voices. This training choice is directly reflected on the inference ability and the available speaker voices that can be used to synthesize speech. \n", + "\n", + "#### 🔶 If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87b18839-f750-4a61-bbb0-c964acaecab2", + "metadata": {}, + "outputs": [], + "source": [ + "# list the possible speaker IDs.\n", + "!tts --model_name \"tts_models/en/vctk/vits\" \\\n", + "--list_speaker_idxs \n" + ] + }, + { + "cell_type": "markdown", + "id": "c4365a9d-f922-4b14-88b0-d2b22a245b2e", + "metadata": {}, + "source": [ + "## 💬 Synthesize speech using speaker ID 💬" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "52be0403-d13e-4d9b-99c2-c10b85154063", + "metadata": {}, + "outputs": [], + "source": [ + "!tts --text \"Trying out specific speaker voice\"\\\n", + "--out_path spkr-out.wav --model_name \"tts_models/en/vctk/vits\" \\\n", + "--speaker_idx \"p341\"" + ] + }, + { + "cell_type": "markdown", + "id": "894a560a-f9c8-48ce-aaa6-afdf516c01f6", + "metadata": {}, + "source": [ + "## 📣 Listen to the synthesized speaker specific wave 📣" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "ed485b0a-dfd5-4a7e-a571-ebf74bdfc41d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import IPython\n", + "IPython.display.Audio(\"spkr-out.wav\")" + ] + }, + { + "cell_type": "markdown", + "id": "84636a38-097e-4dad-933b-0aeaee650e92", + "metadata": {}, + "source": [ + "#### 🔶 If you want to use an external speaker to synthesize speech, you need to supply `--speaker_wave` flag along with an external speaker encoder path and config file, as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6dac1912-5054-4a68-8357-6d20fd99cb10", + "metadata": {}, + "outputs": [], + "source": [ + "!tts --model_name tts_models/multilingual/multi-dataset/your_tts \\\n", + "--encoder_path \"path/to/speaker/encoder/model_se.pth.tar\" \\\n", + "--encoder_config \"path/to/speaker/encoder/config_se.json\" \\\n", + "--speaker_wav \"path/to/speaker/wave/file.wav\" \\\n", + "--text \"Are we not allowed to dim the lights so people can see that a bit better?\"\\\n", + "--out_path spkr-out.wav" + ] + }, + { + "cell_type": "markdown", + "id": "92ddce58-8aca-4f69-84c3-645ae1b12e7d", + "metadata": {}, + "source": [ + "## 📣 Listen to the synthesized speaker specific wave 📣" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "cc889adc-9c71-4232-8e85-bfc8f76476f4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import IPython\n", + "IPython.display.Audio(\"spkr-out.wav\")" + ] + }, + { + "cell_type": "markdown", + "id": "29101d01-0b01-4153-a216-5dae415a5dd6", + "metadata": {}, + "source": [ + "## 🎉 Congratulations! 🎉 You now know how to use a TTS model to synthesize speech! \n", + "### Follow up with the next tutorials to learn more adnavced material." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3723d62f-85e5-4f7f-8a61-4a3409477f34", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/use-pretrained-TTS.ipynb b/notebooks/use-pretrained-TTS.ipynb deleted file mode 100644 index c7bd8bed7c..0000000000 --- a/notebooks/use-pretrained-TTS.ipynb +++ /dev/null @@ -1,198 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "45ea3ef5", - "metadata": { - "tags": [] - }, - "source": [ - "# Easy inferencing with 🐸 TTS ⚡\n", - "\n", - "You want to quicly synthesize speech using Coqui (🐸) TTS model?\n", - "\n", - "💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡\n", - "\n", - "🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .\n", - "\n", - "\n", - "In this notebook, we will:\n", - "\n", - "1. Download a pre-trained TTS english model.\n", - "\n", - "\n", - "So, let's jump right in!\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fa2aec77", - "metadata": {}, - "outputs": [], - "source": [ - "## Install Coqui STT\n", - "! pip install -U pip\n", - "! pip install TTS" - ] - }, - { - "cell_type": "markdown", - "id": "8c07a273", - "metadata": {}, - "source": [ - "## ✅ List available pre-trained 🐸 TTS models\n", - "\n", - "Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures. \n", - "\n", - "You can either use your own model or the release models under 🐸TTS.\n", - "\n", - "Use `tts --list_models` to find out the availble models.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "608d203f", - "metadata": {}, - "outputs": [], - "source": [ - "! tts --list_models" - ] - }, - { - "cell_type": "markdown", - "id": "ed9dd7ab", - "metadata": {}, - "source": [ - "## ✅ Run a TTS model\n", - "\n", - "### **First things first**: Using a release model and default vocoder:\n", - "\n", - "#### You can simply copy the full model name from the list above and use it \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cc9e4608-16ec-4dcd-bd6b-bd10d62286f8", - "metadata": {}, - "outputs": [], - "source": [ - "!tts --text \"hello world\" \\\n", - "--model_name \"tts_models/en/ljspeech/glow-tts\" \\\n", - "--out_path output.wav\n" - ] - }, - { - "cell_type": "markdown", - "id": "0ca2cb14-1aba-400e-a219-8ce44d9410be", - "metadata": {}, - "source": [ - "## 📣 Listen to the synthesized wave 📣" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5fe63ef4-9284-4461-9dda-1ca7483a8f9b", - "metadata": {}, - "outputs": [], - "source": [ - "import IPython\n", - "IPython.display.Audio(\"output.wav\")" - ] - }, - { - "cell_type": "markdown", - "id": "5e67d178-1ebe-49c7-9a47-0593251bdb96", - "metadata": {}, - "source": [ - "### **Second things second**:\n", - "\n", - "#### If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "87b18839-f750-4a61-bbb0-c964acaecab2", - "metadata": {}, - "outputs": [], - "source": [ - "# list the possible speaker IDs.\n", - "!tts --model_name \"tts_models/en/vctk/vits\" \\\n", - "--list_speaker_idxs \n" - ] - }, - { - "cell_type": "markdown", - "id": "c4365a9d-f922-4b14-88b0-d2b22a245b2e", - "metadata": {}, - "source": [ - "## 💬 Synthesize speech using speaker ID 💬" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "52be0403-d13e-4d9b-99c2-c10b85154063", - "metadata": {}, - "outputs": [], - "source": [ - "!tts --text \"Trying out specific speaker voice\"\\\n", - "--out_path spkr-out.wav --model_name \"tts_models/en/vctk/vits\" \\\n", - "--speaker_idx \"p341\"" - ] - }, - { - "cell_type": "markdown", - "id": "894a560a-f9c8-48ce-aaa6-afdf516c01f6", - "metadata": {}, - "source": [ - "## 📣 Listen to the synthesized speaker specific wave 📣" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ed485b0a-dfd5-4a7e-a571-ebf74bdfc41d", - "metadata": {}, - "outputs": [], - "source": [ - "import IPython\n", - "IPython.display.Audio(\"spkr-out.wav\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c9116400-aff7-4a04-810f-7f89e66d2950", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From 441222a99a5e9a12826e99452b0487e389d4f177 Mon Sep 17 00:00:00 2001 From: Aya Jafari Date: Fri, 27 May 2022 16:36:12 +0300 Subject: [PATCH 3/6] Adding training tutorial --- ...utorial_2_train_your_first_TTS_model.ipynb | 438 ++++++++++++++++++ 1 file changed, 438 insertions(+) create mode 100644 notebooks/Tutorial_2_train_your_first_TTS_model.ipynb diff --git a/notebooks/Tutorial_2_train_your_first_TTS_model.ipynb b/notebooks/Tutorial_2_train_your_first_TTS_model.ipynb new file mode 100644 index 0000000000..860c61508e --- /dev/null +++ b/notebooks/Tutorial_2_train_your_first_TTS_model.ipynb @@ -0,0 +1,438 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f79d99ef", + "metadata": {}, + "source": [ + "# Train your first 🐸 TTS model 💫\n", + "\n", + "### 👋 Hello and welcome to Coqui (🐸) TTS\n", + "\n", + "The goal of this notebook is to show you a **typical workflow** for **training** and **testing** a TTS model with 🐸.\n", + "\n", + "Let's train a very small model on a very small amount of data so we can iterate quickly.\n", + "\n", + "In this notebook, we will:\n", + "\n", + "1. Download data and format it for 🐸 TTS.\n", + "2. Configure the training and testing runs.\n", + "3. Train a new model.\n", + "4. Test the model and display its performance.\n", + "\n", + "So, let's jump right in!\n", + "\n", + "*PS - If you just want a working, off-the-shelf model, check out the [🐸 Model Zoo](https://www.coqui.ai/models)*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa2aec78", + "metadata": {}, + "outputs": [], + "source": [ + "## Install Coqui TTS\n", + "! pip install -U pip\n", + "! pip install TTS" + ] + }, + { + "cell_type": "markdown", + "id": "be5fe49c", + "metadata": {}, + "source": [ + "## ✅ Data Preparation\n", + "\n", + "### **First things first**: we need some data.\n", + "\n", + "We're training a Text-to-Speech model, so we need some _text_ and we need some _speech_. Specificially, we want _transcribed speech_. The speech must be divided into audio clips and each clip needs transcription. \n", + "\n", + "If you have a single audio file and you need to **split** it into clips. It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using **wav** file format.\n", + "\n", + "The data format we will be adopting for this tutorial is taken from widely-used the **LJSpeech** dataset, where **waves** are collected under a folder:\n", + "\n", + "\n", + "/wavs
\n", + "  | - audio1.wav
\n", + "  | - udio2.wav
\n", + "  | - audio3.wav
\n", + " ...
\n", + "
\n", + "\n", + "and a **metdata.txt** file will have the audioname in parallel to the transcript, delimeted by `|`: \n", + " \n", + "\n", + "# metadata.txt
\n", + "audio1|This is my sentence.
\n", + "audio2|This is maybe my sentence.
\n", + "audio3|This is certainly my sentence.
\n", + "audio4|Let this be your sentence.
\n", + "...\n", + "
\n", + "\n", + "In the end, we should have the following **folder structure**:\n", + "\n", + "\n", + "/MyTTSDataset
\n", + " |
\n", + " | -> metadata.txt
\n", + " | -> /wavs
\n", + "  | -> audio1.wav
\n", + "  | -> audio2.wav
\n", + "  | ...
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "69501a10-3b53-4e75-ae66-90221d6f2271", + "metadata": {}, + "source": [ + "🐸TTS already provides tooling for the _LJSpeech_. if you use the same format, you can start training your models right away.
\n", + "\n", + "After you collect and format your dataset, you need to check two things. Whether you need a **_formatter_** and a **_text_cleaner_**.
The **_formatter_** loads the text file (created above) as a list and the **_text_cleaner_** performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format).\n", + "\n", + "If you use a different dataset format then the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own **_formatter_** and **_text_cleaner_**." + ] + }, + { + "cell_type": "markdown", + "id": "e7f226c8-4e55-48fa-937b-8415d539b17c", + "metadata": {}, + "source": [ + "## ⏳️ Loading your dataset\n", + "Load one of the dataset supported by 🐸TTS.\n", + "\n", + "For this tutorial we will be using LJSpeech dataset.\n", + "We will start by defining dataset config and setting LJSpeech as our target dataset and define its path.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3cb0191-b8fc-4158-bd26-8423c2a8ba66", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# BaseDatasetConfig: defines name, formatter and path of the dataset.\n", + "from TTS.tts.configs.shared_configs import BaseDatasetConfig\n", + "\n", + "output_path = \"tts_train_dir\"\n", + "if not os.path.exists(output_path):\n", + " os.makedirs(output_path)\n", + "\n", + "dataset_config = BaseDatasetConfig(\n", + " name=\"ljspeech\", meta_file_train=\"metadata.csv\", path=os.path.join(output_path, \"LJSpeech-1.1/\")\n", + ")\n", + "# You need to download LJSpeech inside output_path\n" + ] + }, + { + "cell_type": "markdown", + "id": "ae82fd75", + "metadata": {}, + "source": [ + "## ✅ Train a new model\n", + "\n", + "Let's kick off a training run 🚀🚀🚀.\n", + "\n", + "Deciding on the model architecture you'd want to use is based on your needs and available resources. Each model architecture has it's pros and cons that define the run-time efficiency and the voice quality.\n", + "We have many recipes under `TTS/recipes/` that provide a good starting point. For this tutorial, we will be using `GlowTTS`." + ] + }, + { + "cell_type": "markdown", + "id": "f5876e46-2aee-4bcf-b6b3-9e3c535c553f", + "metadata": {}, + "source": [ + "We will begin by initializing the model training configuration." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5483ca28-39d6-49f8-a18e-4fb53c50ad84", + "metadata": {}, + "outputs": [], + "source": [ + "# GlowTTSConfig: all model related values for training, validating and testing.\n", + "from TTS.tts.configs.glow_tts_config import GlowTTSConfig\n", + "config = GlowTTSConfig(\n", + " batch_size=32,\n", + " eval_batch_size=16,\n", + " num_loader_workers=4,\n", + " num_eval_loader_workers=4,\n", + " run_eval=True,\n", + " test_delay_epochs=-1,\n", + " epochs=100,\n", + " text_cleaner=\"phoneme_cleaners\",\n", + " use_phonemes=True,\n", + " phoneme_language=\"en-us\",\n", + " phoneme_cache_path=os.path.join(output_path, \"phoneme_cache\"),\n", + " print_step=25,\n", + " print_eval=False,\n", + " mixed_precision=True,\n", + " output_path=output_path,\n", + " datasets=[dataset_config],\n", + " save_step=1000,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "b93ed377-80b7-447b-bd92-106bffa777ee", + "metadata": {}, + "source": [ + "Next we will initialize the audio processor which is used for feature extraction and audio I/O." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b1b12f61-f851-4565-84dd-7640947e04ab", + "metadata": {}, + "outputs": [], + "source": [ + "from TTS.utils.audio import AudioProcessor\n", + "ap = AudioProcessor.init_from_config(config)" + ] + }, + { + "cell_type": "markdown", + "id": "1d461683-b05e-403f-815f-8007bda08c38", + "metadata": {}, + "source": [ + "Next we will initialize the tokenizer which is used to convert text to sequences of token IDs. If characters are not defined in the config, default characters are passed to the config." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "014879b7-f18d-44c0-b24a-e10f8002113a", + "metadata": {}, + "outputs": [], + "source": [ + "from TTS.tts.utils.text.tokenizer import TTSTokenizer\n", + "tokenizer, config = TTSTokenizer.init_from_config(config)" + ] + }, + { + "cell_type": "markdown", + "id": "df3016e1-9e99-4c4f-94e3-fa89231fd978", + "metadata": {}, + "source": [ + "Next we will load data samples. Each sample is a list of ```[text, audio_file_path, speaker_name]```. You can define your custom sample loader returning the list of samples." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cadd6ada-c8eb-4f79-b8fe-6d72850af5a7", + "metadata": {}, + "outputs": [], + "source": [ + "from TTS.tts.datasets import load_tts_samples\n", + "train_samples, eval_samples = load_tts_samples(\n", + " dataset_config,\n", + " eval_split=True,\n", + " eval_split_max_size=config.eval_split_max_size,\n", + " eval_split_size=config.eval_split_size,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "db8b451e-1fe1-4aa3-b69e-ab22b925bd19", + "metadata": {}, + "source": [ + "Now we're ready to initialize the model.\n", + "\n", + "Models take a config object and a speaker manager as input. Config defines the details of the model like the number of layers, the size of the embedding, etc. Speaker manager is used by multi-speaker models." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac2ffe3e-ad0c-443e-800c-9b076ee811b4", + "metadata": {}, + "outputs": [], + "source": [ + "from TTS.tts.models.glow_tts import GlowTTS\n", + "model = GlowTTS(config, ap, tokenizer, speaker_manager=None)" + ] + }, + { + "cell_type": "markdown", + "id": "e2832c56-889d-49a6-95b6-eb231892ecc6", + "metadata": {}, + "source": [ + "Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training, distributed training, etc." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0f609945-4fe0-4d0d-b95e-11d7bfb63ebe", + "metadata": {}, + "outputs": [], + "source": [ + "from trainer import Trainer, TrainerArgs\n", + "trainer = Trainer(\n", + " TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "5b320831-dd83-429b-bb6a-473f9d49d321", + "metadata": {}, + "source": [ + "### AND... 3,2,1... START TRAINING 🚀🚀🚀" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d4c07f99-3d1d-4bea-801e-9f33bbff0e9f", + "metadata": {}, + "outputs": [], + "source": [ + "trainer.fit()" + ] + }, + { + "cell_type": "markdown", + "id": "4cff0c40-2734-40a6-a905-e945a9fb3e98", + "metadata": {}, + "source": [ + "#### 🚀 Run the Tensorboard. 🚀\n", + "On the notebook and Tensorboard, you can monitor the progress of your model. Also Tensorboard provides certain figures and sample outputs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5a85cd3b-1646-40ad-a6c2-49323e08eeec", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install tensorboard\n", + "!tensorboard --logdir=tts_train_dir" + ] + }, + { + "cell_type": "markdown", + "id": "9f6dc959", + "metadata": {}, + "source": [ + "## ✅ Test the model\n", + "\n", + "We made it! 🙌\n", + "\n", + "Let's kick off the testing run, which displays performance metrics.\n", + "\n", + "We're committing the cardinal sin of ML 😈 (aka - testing on our training data) so you don't want to deploy this model into production. In this notebook we're focusing on the workflow itself, so it's forgivable 😇\n", + "\n", + "You can see from the test output that our tiny model has overfit to the data, and basically memorized this one sentence.\n", + "\n", + "When you start training your own models, make sure your testing data doesn't include your training data 😅" + ] + }, + { + "cell_type": "markdown", + "id": "99fada7a-592f-4a09-9369-e6f3d82de3a0", + "metadata": {}, + "source": [ + "Let's get the latest saved checkpoint. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6dd47ed5-da8e-4bf9-b524-d686630d6961", + "metadata": {}, + "outputs": [], + "source": [ + "import glob, os\n", + "output_path = \"tts_train_dir\"\n", + "ckpts = sorted([f for f in glob.glob(output_path+\"/*/*.pth\")])\n", + "configs = sorted([f for f in glob.glob(output_path+\"/*/*.json\")])\n", + "os.environ['test_ckpt'] = ckpts[-1]\n", + "os.environ['test_config'] = configs[-1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd42bc7a", + "metadata": {}, + "outputs": [], + "source": [ + " !tts --text \"Text for TTS\" \\\n", + " --model_path $test_ckpt \\\n", + " --config_path $test_config \\\n", + " --out_path out.wav" + ] + }, + { + "cell_type": "markdown", + "id": "81cbcb3f-d952-469b-a0d8-8941cd7af670", + "metadata": {}, + "source": [ + "## 📣 Listen to the synthesized wave 📣" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0000bd6-6763-4a10-a74d-911dd08ebcff", + "metadata": {}, + "outputs": [], + "source": [ + "import IPython\n", + "IPython.display.Audio(\"out.wav\")" + ] + }, + { + "cell_type": "markdown", + "id": "13914401-cad1-494a-b701-474e52829138", + "metadata": {}, + "source": [ + "## 🎉 Congratulations! 🎉 You now have trained your first TTS model! \n", + "Follow up with the next tutorials to learn more adnavced material." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "950d9fc6-896f-4a2c-86fd-8fd1fcbbb3f7", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 878a95d7936ab7c2a780e12addbde792cd29f450 Mon Sep 17 00:00:00 2001 From: Aya Jafari Date: Mon, 30 May 2022 16:14:04 +0300 Subject: [PATCH 4/6] fixed dummy paths --- notebooks/Tutorial_1_use-pretrained-TTS.ipynb | 111 +++++++----------- 1 file changed, 43 insertions(+), 68 deletions(-) diff --git a/notebooks/Tutorial_1_use-pretrained-TTS.ipynb b/notebooks/Tutorial_1_use-pretrained-TTS.ipynb index 238f272784..87d04c499d 100644 --- a/notebooks/Tutorial_1_use-pretrained-TTS.ipynb +++ b/notebooks/Tutorial_1_use-pretrained-TTS.ipynb @@ -9,20 +9,20 @@ "source": [ "# Easy Inferencing with 🐸 TTS ⚡\n", "\n", - "### You want to quicly synthesize speech using Coqui 🐸 TTS model?\n", + "#### You want to quicly synthesize speech using Coqui 🐸 TTS model?\n", "\n", - "### 💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡\n", + "💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡\n", "\n", - "#### 🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .\n", + "🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .\n", "\n", - "#### In this notebook, we will: \n", + "In this notebook, we will: \n", "```\n", "1. List available pre-trained 🐸 TTS models\n", "2. Run a 🐸 TTS model\n", "3. Listen to the synthesized wave 📣\n", "4. Run multispeaker 🐸 TTS model \n", "```\n", - "#### So, let's jump right in!\n" + "So, let's jump right in!\n" ] }, { @@ -51,11 +51,11 @@ "source": [ "## ✅ List available pre-trained 🐸 TTS models\n", "\n", - "#### Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures. \n", + "Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures. \n", "\n", - "#### You can either use your own model or the release models under 🐸TTS.\n", + "You can either use your own model or the release models under 🐸TTS.\n", "\n", - "#### Use `tts --list_models` to find out the availble models.\n", + "Use `tts --list_models` to find out the availble models.\n", "\n" ] }, @@ -76,9 +76,9 @@ "source": [ "## ✅ Run a 🐸 TTS model\n", "\n", - "### **First things first**: Using a release model and default vocoder:\n", + "#### **First things first**: Using a release model and default vocoder:\n", "\n", - "#### You can simply copy the full model name from the list above and use it \n" + "You can simply copy the full model name from the list above and use it \n" ] }, { @@ -119,9 +119,9 @@ "source": [ "### **Second things second**:\n", "\n", - "#### 🔶 A TTS model can be either trained on a single speaker voice or multispeaker voices. This training choice is directly reflected on the inference ability and the available speaker voices that can be used to synthesize speech. \n", + "🔶 A TTS model can be either trained on a single speaker voice or multispeaker voices. This training choice is directly reflected on the inference ability and the available speaker voices that can be used to synthesize speech. \n", "\n", - "#### 🔶 If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech." + "🔶 If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech." ] }, { @@ -166,29 +166,10 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "id": "ed485b0a-dfd5-4a7e-a571-ebf74bdfc41d", "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " " - ], - "text/plain": [ - "" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "import IPython\n", "IPython.display.Audio(\"spkr-out.wav\")" @@ -199,7 +180,27 @@ "id": "84636a38-097e-4dad-933b-0aeaee650e92", "metadata": {}, "source": [ - "#### 🔶 If you want to use an external speaker to synthesize speech, you need to supply `--speaker_wave` flag along with an external speaker encoder path and config file, as follows:" + "🔶 If you want to use an external speaker to synthesize speech, you need to supply `--speaker_wav` flag along with an external speaker encoder path and config file, as follows:" + ] + }, + { + "cell_type": "markdown", + "id": "cbdb15fa-123a-4282-a127-87b50dc70365", + "metadata": {}, + "source": [ + "First we need to get the speaker encoder model, its config and a referece `speaker_wav`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e54f1b13-560c-4fed-bafd-e38ec9712359", + "metadata": {}, + "outputs": [], + "source": [ + "!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json\n", + "!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar\n", + "!wget https://github.com/coqui-ai/TTS/raw/speaker_encoder_model/tests/data/ljspeech/wavs/LJ001-0001.wav" ] }, { @@ -210,11 +211,12 @@ "outputs": [], "source": [ "!tts --model_name tts_models/multilingual/multi-dataset/your_tts \\\n", - "--encoder_path \"path/to/speaker/encoder/model_se.pth.tar\" \\\n", - "--encoder_config \"path/to/speaker/encoder/config_se.json\" \\\n", - "--speaker_wav \"path/to/speaker/wave/file.wav\" \\\n", + "--encoder_path model_se.pth.tar \\\n", + "--encoder_config config_se.json \\\n", + "--speaker_wav LJ001-0001.wav \\\n", "--text \"Are we not allowed to dim the lights so people can see that a bit better?\"\\\n", - "--out_path spkr-out.wav" + "--out_path spkr-out.wav \\\n", + "--language_idx \"en\"" ] }, { @@ -227,29 +229,10 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": null, "id": "cc889adc-9c71-4232-8e85-bfc8f76476f4", "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " " - ], - "text/plain": [ - "" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "import IPython\n", "IPython.display.Audio(\"spkr-out.wav\")" @@ -261,16 +244,8 @@ "metadata": {}, "source": [ "## 🎉 Congratulations! 🎉 You now know how to use a TTS model to synthesize speech! \n", - "### Follow up with the next tutorials to learn more adnavced material." + "Follow up with the next tutorials to learn more adnavced material." ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3723d62f-85e5-4f7f-8a61-4a3409477f34", - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { From f729fc0a4c7f16801242e75abd22617603e60dea Mon Sep 17 00:00:00 2001 From: Aya Jafari Date: Mon, 30 May 2022 16:14:35 +0300 Subject: [PATCH 5/6] fixed review comments --- ...utorial_2_train_your_first_TTS_model.ipynb | 46 +++++++++++++------ 1 file changed, 31 insertions(+), 15 deletions(-) diff --git a/notebooks/Tutorial_2_train_your_first_TTS_model.ipynb b/notebooks/Tutorial_2_train_your_first_TTS_model.ipynb index 860c61508e..9e4f8ad04c 100644 --- a/notebooks/Tutorial_2_train_your_first_TTS_model.ipynb +++ b/notebooks/Tutorial_2_train_your_first_TTS_model.ipynb @@ -20,9 +20,7 @@ "3. Train a new model.\n", "4. Test the model and display its performance.\n", "\n", - "So, let's jump right in!\n", - "\n", - "*PS - If you just want a working, off-the-shelf model, check out the [🐸 Model Zoo](https://www.coqui.ai/models)*" + "So, let's jump right in!\n" ] }, { @@ -46,21 +44,21 @@ "\n", "### **First things first**: we need some data.\n", "\n", - "We're training a Text-to-Speech model, so we need some _text_ and we need some _speech_. Specificially, we want _transcribed speech_. The speech must be divided into audio clips and each clip needs transcription. \n", + "We're training a Text-to-Speech model, so we need some _text_ and we need some _speech_. Specificially, we want _transcribed speech_. The speech must be divided into audio clips and each clip needs transcription. More details about data requirements such as recording characteristics, background noise abd vocabulary coverage can be found in the [🐸TTS documentation](https://tts.readthedocs.io/en/latest/formatting_your_dataset.html).\n", "\n", "If you have a single audio file and you need to **split** it into clips. It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using **wav** file format.\n", "\n", - "The data format we will be adopting for this tutorial is taken from widely-used the **LJSpeech** dataset, where **waves** are collected under a folder:\n", + "The data format we will be adopting for this tutorial is taken from the widely-used **LJSpeech** dataset, where **waves** are collected under a folder:\n", "\n", "\n", "/wavs
\n", "  | - audio1.wav
\n", - "  | - udio2.wav
\n", + "  | - audio2.wav
\n", "  | - audio3.wav
\n", " ...
\n", "
\n", "\n", - "and a **metdata.txt** file will have the audioname in parallel to the transcript, delimeted by `|`: \n", + "and a **metadata.csv** file will have the audio file name in parallel to the transcript, delimited by `|`: \n", " \n", "\n", "# metadata.txt
\n", @@ -104,13 +102,12 @@ "## ⏳️ Loading your dataset\n", "Load one of the dataset supported by 🐸TTS.\n", "\n", - "For this tutorial we will be using LJSpeech dataset.\n", "We will start by defining dataset config and setting LJSpeech as our target dataset and define its path.\n" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 33, "id": "b3cb0191-b8fc-4158-bd26-8423c2a8ba66", "metadata": {}, "outputs": [], @@ -123,11 +120,32 @@ "output_path = \"tts_train_dir\"\n", "if not os.path.exists(output_path):\n", " os.makedirs(output_path)\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ae6b7019-3685-4b48-8917-c152e288d7e3", + "metadata": {}, + "outputs": [], + "source": [ + "# Download and extract LJSpeech dataset.\n", "\n", + "!wget -O $output_path/LJSpeech-1.1.tar.bz2 https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 \n", + "!tar -xf $output_path/LJSpeech-1.1.tar.bz2 -C $output_path" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "76cd3ab5-6387-45f1-b488-24734cc1beb5", + "metadata": {}, + "outputs": [], + "source": [ "dataset_config = BaseDatasetConfig(\n", " name=\"ljspeech\", meta_file_train=\"metadata.csv\", path=os.path.join(output_path, \"LJSpeech-1.1/\")\n", - ")\n", - "# You need to download LJSpeech inside output_path\n" + ")" ] }, { @@ -359,9 +377,7 @@ "import glob, os\n", "output_path = \"tts_train_dir\"\n", "ckpts = sorted([f for f in glob.glob(output_path+\"/*/*.pth\")])\n", - "configs = sorted([f for f in glob.glob(output_path+\"/*/*.json\")])\n", - "os.environ['test_ckpt'] = ckpts[-1]\n", - "os.environ['test_config'] = configs[-1]" + "configs = sorted([f for f in glob.glob(output_path+\"/*/*.json\")])" ] }, { @@ -402,7 +418,7 @@ "metadata": {}, "source": [ "## 🎉 Congratulations! 🎉 You now have trained your first TTS model! \n", - "Follow up with the next tutorials to learn more adnavced material." + "Follow up with the next tutorials to learn more advanced material." ] }, { From cfec154ce3fbb18d135fc5a5e61646e02f2214c8 Mon Sep 17 00:00:00 2001 From: Aya Jafari Date: Tue, 31 May 2022 12:51:59 +0300 Subject: [PATCH 6/6] fixed metadata extension --- notebooks/Tutorial_2_train_your_first_TTS_model.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/Tutorial_2_train_your_first_TTS_model.ipynb b/notebooks/Tutorial_2_train_your_first_TTS_model.ipynb index 9e4f8ad04c..7f324bec55 100644 --- a/notebooks/Tutorial_2_train_your_first_TTS_model.ipynb +++ b/notebooks/Tutorial_2_train_your_first_TTS_model.ipynb @@ -61,7 +61,7 @@ "and a **metadata.csv** file will have the audio file name in parallel to the transcript, delimited by `|`: \n", " \n", "\n", - "# metadata.txt
\n", + "# metadata.csv
\n", "audio1|This is my sentence.
\n", "audio2|This is maybe my sentence.
\n", "audio3|This is certainly my sentence.
\n",