Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix typos #3368

Merged
merged 6 commits into from
Dec 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,4 +56,4 @@ ModelConfig()

In the example above, ```ModelConfig()``` is the final configuration that the model receives and it has all the fields necessary for the model.

We host pre-defined model configurations under ```TTS/<model_class>/configs/```.Although we recommend a unified config class, you can decompose it as you like as for your custom models as long as all the fields for the trainer, model, and inference APIs are provided.
We host pre-defined model configurations under ```TTS/<model_class>/configs/```. Although we recommend a unified config class, you can decompose it as you like as for your custom models as long as all the fields for the trainer, model, and inference APIs are provided.
4 changes: 2 additions & 2 deletions docs/source/finetuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
Fine-tuning comes to the rescue in this case. You can take one of our pre-trained models and fine-tune it on your own
speech dataset and achieve reasonable results with only a couple of hours of data.

However, note that, fine-tuning does not ensure great results. The model performance is still depends on the
However, note that, fine-tuning does not ensure great results. The model performance still depends on the
{ref}`dataset quality <what_makes_a_good_dataset>` and the hyper-parameters you choose for fine-tuning. Therefore,
it still takes a bit of tinkering.

Expand All @@ -41,7 +41,7 @@ them and fine-tune it for your own dataset. This will help you in two main ways:
tts --list_models
```

The command above lists the the models in a naming format as ```<model_type>/<language>/<dataset>/<model_name>```.
The command above lists the models in a naming format as ```<model_type>/<language>/<dataset>/<model_name>```.

Or you can manually check the `.model.json` file in the project directory.

Expand Down
6 changes: 3 additions & 3 deletions docs/source/formatting_your_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ If you have a single audio file and you need to split it into clips, there are d

It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using `wav` file format.

Let's assume you created the audio clips and their transcription. You can collect all your clips under a folder. Let's call this folder `wavs`.
Let's assume you created the audio clips and their transcription. You can collect all your clips in a folder. Let's call this folder `wavs`.

```
/wavs
Expand All @@ -17,7 +17,7 @@ Let's assume you created the audio clips and their transcription. You can collec
...
```

You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each column must be delimitered by a special character separating the audio file name, the transcription and the normalized transcription. And make sure that the delimiter is not used in the transcription text.
You can either create separate transcription files for each clip or create a text file that maps each audio clip to its transcription. In this file, each column must be delimited by a special character separating the audio file name, the transcription and the normalized transcription. And make sure that the delimiter is not used in the transcription text.

We recommend the following format delimited by `|`. In the following example, `audio1`, `audio2` refer to files `audio1.wav`, `audio2.wav` etc.

Expand Down Expand Up @@ -55,7 +55,7 @@ For more info about dataset qualities and properties check our [post](https://gi

After you collect and format your dataset, you need to check two things. Whether you need a `formatter` and a `text_cleaner`. The `formatter` loads the text file (created above) as a list and the `text_cleaner` performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format).

If you use a different dataset format then the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`.
If you use a different dataset format than the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own `formatter`.

If your dataset is in a new language or it needs special normalization steps, then you need a new `text_cleaner`.

Expand Down
4 changes: 2 additions & 2 deletions docs/source/implementing_a_new_language_frontend.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

- Language frontends are located under `TTS.tts.utils.text`
- Each special language has a separate folder.
- Each folder containst all the utilities for processing the text input.
- Each folder contains all the utilities for processing the text input.
- `TTS.tts.utils.text.phonemizers` contains the main phonemizer for a language. This is the class that uses the utilities
from the previous step and used to convert the text to phonemes or graphemes for the model.
- After you implement your phonemizer, you need to add it to the `TTS/tts/utils/text/phonemizers/__init__.py` to be able to
map the language code in the model config - `config.phoneme_language` - to the phonemizer class and initiate the phonemizer automatically.
- You should also add tests to `tests/text_tests` if you want to make a PR.

We suggest you to check the available implementations as reference. Good luck!
We suggest you to check the available implementations as reference. Good luck!
4 changes: 2 additions & 2 deletions docs/source/implementing_a_new_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ class MyModel(BaseTTS):
Args:
ap (AudioProcessor): audio processor used at training.
batch (Dict): Model inputs used at the previous training step.
outputs (Dict): Model outputs generated at the previoud training step.
outputs (Dict): Model outputs generated at the previous training step.

Returns:
Tuple[Dict, np.ndarray]: training plots and output waveform.
Expand Down Expand Up @@ -183,7 +183,7 @@ class MyModel(BaseTTS):
...

def get_optimizer(self) -> Union["Optimizer", List["Optimizer"]]:
"""Setup an return optimizer or optimizers."""
"""Setup a return optimizer or optimizers."""
pass

def get_lr(self) -> Union[float, List[float]]:
Expand Down
6 changes: 3 additions & 3 deletions docs/source/marytts.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

## What is Mary-TTS?

[Mary (Modular Architecture for Research in sYynthesis) Text-to-Speech](http://mary.dfki.de/) is an open-source (GNU LGPL license), multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of [DFKI’s](http://www.dfki.de/web) Language Technology Lab and the [Institute of Phonetics](http://www.coli.uni-saarland.de/groups/WB/Phonetics/) at Saarland University, Germany. It is now maintained by the Multimodal Speech Processing Group in the [Cluster of Excellence MMCI](https://www.mmci.uni-saarland.de/) and DFKI.
[Mary (Modular Architecture for Research in sYnthesis) Text-to-Speech](http://mary.dfki.de/) is an open-source (GNU LGPL license), multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of [DFKI’s](http://www.dfki.de/web) Language Technology Lab and the [Institute of Phonetics](http://www.coli.uni-saarland.de/groups/WB/Phonetics/) at Saarland University, Germany. It is now maintained by the Multimodal Speech Processing Group in the [Cluster of Excellence MMCI](https://www.mmci.uni-saarland.de/) and DFKI.
MaryTTS has been around for a very! long time. Version 3.0 even dates back to 2006, long before Deep Learning was a broadly known term and the last official release was version 5.2 in 2016.
You can check out this OpenVoice-Tech page to learn more: https://openvoice-tech.net/index.php/MaryTTS

## Why Mary-TTS compatibility is relevant

Due to it's open-source nature, relatively high quality voices and fast synthetization speed Mary-TTS was a popular choice in the past and many tools implemented API support over the years like screen-readers (NVDA + SpeechHub), smart-home HUBs (openHAB, Home Assistant) or voice assistants (Rhasspy, Mycroft, SEPIA). A compatibility layer for Coqui-TTS will ensure that these tools can use Coqui as a drop-in replacement and get even better voices right away.
Due to its open-source nature, relatively high quality voices and fast synthetization speed Mary-TTS was a popular choice in the past and many tools implemented API support over the years like screen-readers (NVDA + SpeechHub), smart-home HUBs (openHAB, Home Assistant) or voice assistants (Rhasspy, Mycroft, SEPIA). A compatibility layer for Coqui-TTS will ensure that these tools can use Coqui as a drop-in replacement and get even better voices right away.

## API and code examples

Expand Down Expand Up @@ -40,4 +40,4 @@ You can enter the same URLs in your browser and check-out the results there as w
### How it works and limitations

A classic Mary-TTS server would usually show all installed locales and voices via the corresponding endpoints and accept the parameters `LOCALE` and `VOICE` for processing. For Coqui-TTS we usually start the server with one specific locale and model and thus cannot return all available options. Instead we return the active locale and use the model name as "voice". Since we only have one active model and always want to return a WAV-file, we currently ignore all other processing parameters except `INPUT_TEXT`. Since the gender is not defined for models in Coqui-TTS we always return `u` (undefined).
We think that this is an acceptable compromise, since users are often only interested in one specific voice anyways, but the API might get extended in the future to support multiple languages and voices at the same time.
We think that this is an acceptable compromise, since users are often only interested in one specific voice anyways, but the API might get extended in the future to support multiple languages and voices at the same time.
Loading