[Feature request] Text for synthesis needs to be normalized for languages with diacritics #63

JanX2 · 2024-07-24T06:26:27Z

Text for synthesis needs to be normalized for languages with diacritics or synthesis will be incorrect under certain ircumstances.

For diacritics, like German with its umlauts (äöü), there are often at least two ways to represent them in Unicode text: precomposed (a single code point: ä) and decomposed (a base code point modified by another: a + ¨). Some text sources, like piping a string into the tts command via xargs sourced from a text file may not convert from decomposed to precomposed. This is a problem, because the models I tested (i.e. "thorsten/tacotron2-DDC") only synthesize an umlaut in the precomposed form. They will just ignore the diacritics characters otherwise, synthesizing the base letter.

I’m not a Python dev. A hacky way of fixing this would be to modify "synthesize.py":

import unicodedata
…
args = parser.parse_args()
args.text = unicodedata.normalize('NFC', args.text)

Alternatively we could find some other way to make sure that the models are always supplied tokens that they can synthesize.

The text conversion could be optional via a command line argument.

The text was updated successfully, but these errors were encountered:

eginhard · 2024-07-24T07:45:45Z

Thank you for the suggestion. Yes, unicodedata.normalize() would be the correct approach for this and it should be added to the different cleaners in https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/tts/utils/text/cleaners.py, which handle other text preprocessing as well. A PR for this would be welcome!

neurlang · 2024-09-30T16:27:50Z

pygoruut can normalize for languages which need it.

eginhard added enhancement New feature or request good first issue Good for newcomers labels Jul 24, 2024

shavit mentioned this issue Sep 28, 2024

Add normalizer type C to text cleaners #85

Merged

eginhard closed this as completed in #85 Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Text for synthesis needs to be normalized for languages with diacritics #63

[Feature request] Text for synthesis needs to be normalized for languages with diacritics #63

JanX2 commented Jul 24, 2024

eginhard commented Jul 24, 2024

neurlang commented Sep 30, 2024

[Feature request] Text for synthesis needs to be normalized for languages with diacritics #63

[Feature request] Text for synthesis needs to be normalized for languages with diacritics #63

Comments

JanX2 commented Jul 24, 2024

eginhard commented Jul 24, 2024

neurlang commented Sep 30, 2024