Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Text for synthesis needs to be normalized for languages with diacritics #63

Closed
JanX2 opened this issue Jul 24, 2024 · 2 comments · Fixed by #85
Closed
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@JanX2
Copy link

JanX2 commented Jul 24, 2024

Text for synthesis needs to be normalized for languages with diacritics or synthesis will be incorrect under certain ircumstances.

For diacritics, like German with its umlauts (äöü), there are often at least two ways to represent them in Unicode text: precomposed (a single code point: ä) and decomposed (a base code point modified by another: a + ¨). Some text sources, like piping a string into the tts command via xargs sourced from a text file may not convert from decomposed to precomposed. This is a problem, because the models I tested (i.e. "thorsten/tacotron2-DDC") only synthesize an umlaut in the precomposed form. They will just ignore the diacritics characters otherwise, synthesizing the base letter.

I’m not a Python dev. A hacky way of fixing this would be to modify "synthesize.py":

import unicodedata
…
args = parser.parse_args()
args.text = unicodedata.normalize('NFC', args.text)

Alternatively we could find some other way to make sure that the models are always supplied tokens that they can synthesize.

The text conversion could be optional via a command line argument.

@eginhard
Copy link
Member

Thank you for the suggestion. Yes, unicodedata.normalize() would be the correct approach for this and it should be added to the different cleaners in https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/tts/utils/text/cleaners.py, which handle other text preprocessing as well. A PR for this would be welcome!

@eginhard eginhard added enhancement New feature or request good first issue Good for newcomers labels Jul 24, 2024
@neurlang
Copy link

pygoruut can normalize for languages which need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants