Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other languages? #71

Closed
friki67 opened this issue Nov 28, 2023 · 30 comments · Fixed by #72 or #121
Closed

Other languages? #71

friki67 opened this issue Nov 28, 2023 · 30 comments · Fixed by #72 or #121
Assignees
Labels
enhancement New feature or request

Comments

@friki67
Copy link

friki67 commented Nov 28, 2023

Hello. Just discovered this.

Is there a way to set language? Maybe changing speaker?

I'd like to read Spanish epub.

@aedocw
Copy link
Owner

aedocw commented Nov 28, 2023

I think the easiest approach would be to just change the model (line 267). I default to:
model_name = "tts_models/en/vctk/vits"
because in my testing that was the one that sounded best. I don't speak Spanish so I did not experiment with other languages, as I would not really be able to tell if they sound reasonable.

Let me know if you are up for trying this yourself, or if you'd like me to help out a little.

@friki67
Copy link
Author

friki67 commented Nov 30, 2023

Thank you.

I'm getting this error using "tts_models/es/css10/vits" or "tts_models/es/mai/tacotron2-DDC", the only two models that tts lists for Spanish.

> Model's license - bsd-3-clause
> Check https://opensource.org/licenses for more info.
> Using model: vits
> Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > log_func:np.log10
| > min_level_db:0
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:None
| > fft_size:1024
| > power:None
| > preemphasis:0.0
| > griffin_lim_iters:None
| > signal_norm:None
| > symmetric_norm:None
| > mel_fmin:0
| > mel_fmax:None
| > pitch_fmin:None
| > pitch_fmax:None
| > spec_gain:20.0
| > stft_pad_mode:reflect
| > max_norm:1.0
| > clip_norm:True
| > do_trim_silence:False
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:False
| > db_level:None
| > stats_path:None
| > base:10
| > hop_length:256
| > win_length:1024
> initialization of speaker-embedding layers.
> initialization of language-embedding layers.
Reading 0
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/epub2tts", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/.local/lib/python3.10/site-packages/epub2tts.py", line 373, in main
tts.tts_to_file(text = chapters_to_read[i], speaker = speaker_used, file_path = outputwav)
File "/home/ubuntu/.local/lib/python3.10/site-packages/TTS/api.py", line 391, in tts_to_file
self._check_arguments(speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/TTS/api.py", line 240, in _check_arguments
raise ValueError("Model is not multi-speaker but `speaker` is provided.")
ValueError: Model is not multi-speaker but `speaker` is provided.

Then I went to the pointed 347 line and removed the speaker parameter, and it worked (more or less, I got some jumps and things like that).

Now, could you help me please to use the XTTS thing? Where to set the language to Spanish (es) and where to get the sample.wav for my language?

As you see I don't really know what I'm doing, but I'm trying to!

EDIT: changed in line 344 language="en" to language="es". Now the only thing is where to get the sample.wav file, or how to generate it for Spanish.

@aedocw aedocw self-assigned this Nov 30, 2023
@aedocw aedocw added the enhancement New feature or request label Nov 30, 2023
@aedocw aedocw linked a pull request Nov 30, 2023 that will close this issue
@aedocw
Copy link
Owner

aedocw commented Nov 30, 2023

I just merged an improvement that will make things easier. For instance to use the vits spanish model, you would call it this way:
epub2tts my-book.epub --model tts_models/es/css10/vits

To use XTTS with spanish, nothing new/different is required. You just need to supply a voice sample wav file (30 seconds or less) that you would like XTTS to clone. For instance, I tested with a bit of spanish text and it sounded good:
epub2tts spanish-test.txt --xtts seth.wav

To get a sample for cloning, if you use Chrome this is by far the easiest approach. First install this plugin:
https://chrome.google.com/webstore/detail/voicemod-recorder/fbadphddacjeiiolifedihnmhmannloo

Then find a audio book sample on amazon that you really like. Listen to the sample and capture 10-30 seconds of the person speaking. Then convert that to a wav file ("ffmpeg -i voicemod-file.mp3 sample.wav") and you should be good to go! Let me know if this works out for you, or if you have any other questions!

@aedocw
Copy link
Owner

aedocw commented Nov 30, 2023

OH I forgot to mention regarding XTTS - a GPU is required. It might run with just CPU but I don't think it will, and even if it did I suspect it would be so slow as to be unusable.

@friki67
Copy link
Author

friki67 commented Dec 1, 2023

Thank you very much!!!!

@friki67
Copy link
Author

friki67 commented Dec 4, 2023

Hello again.
Using XTTS with Spanish voice samples gives me a strange result: it sounds like a German speaker reading a Spanish text. I've tried with different samples and texts.

I've searched the code and changed the occurences of language="en" to language="es" but the result remains the same. Lines 146 and 262 were changed.

What can I do to fix the model to Spanish?

The output of the command:

epub2tts cosa.txt --xtts voz1/voz1-1.wav,voz1/voz1-2.wav,voz1/voz1-3.wav
Namespace(sourcefile='cosa.txt', engine='tts', xtts='voz1/voz1-1.wav,voz1/voz1-2.wav,voz1/voz1-3.wav', openai='zzz', model='tts_models/en/vctk/vits', speaker='p335', scan=False, start=1, end=999, minratio=88, skiplinks=False, bitrate='69k', debug=False)
Corre el año 1834 y Madrid, una pequeña ciudad que trata de abrirse paso más allá de las murallas que la rodean, sufre una terrible epidemia de cólera. Pero la peste no es lo único que aterroriza a sus habitantes: en los arrabales aparecen cadáveres desmem
Saving to cosa-voz.m4b
Total characters: 384
Loading model: /home/ubuntu/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2
Computing speaker latents...
Reading from 0 to 1
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:46<00:00, 46.82s/it]
100.00% spoken so far.
Elapsed: 0 minutes, ETA: 0 minutes
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1474.14it/s]
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil      56. 70.100 / 56. 70.100
libavcodec     58.134.100 / 58.134.100
libavformat    58. 76.100 / 58. 76.100
libavdevice    58. 13.100 / 58. 13.100
libavfilter     7.110.100 /  7.110.100
libswscale      5.  9.100 /  5.  9.100
libswresample   3.  9.100 /  3.  9.100
libpostproc    55.  9.100 / 55.  9.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'cosa-voz.m4a':
Metadata:
major_brand     : M4A
minor_version   : 512
compatible_brands: M4A isomiso2
encoder         : Lavf58.76.100
Duration: 00:00:35.68, start: 0.000000, bitrate: 65 kb/s
Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 24000 Hz, mono, fltp, 64 kb/s (default)
Metadata:
handler_name    : SoundHandler
vendor_id       : [0][0][0][0]
Input #1, ffmetadata, from 'FFMETADATAFILE':
Metadata:
ARTIST          : Unknown
ALBUM           : cosa
Duration: 00:00:34.69, start: 0.000000, bitrate: 0 kb/s
Chapters:
Chapter #1:0: start 0.000000, end 34.693000
Metadata:
title           : Part 1
File 'cosa-voz.m4b' already exists. Overwrite? [y/N] y
Output #0, ipod, to 'cosa-voz.m4b':
Metadata:
ARTIST          : Unknown
ALBUM           : cosa
encoder         : Lavf58.76.100
Chapters:
Chapter #0:0: start 0.000000, end 34.693000
Metadata:
title           : Part 1
Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 24000 Hz, mono, fltp, 64 kb/s (default)
Metadata:
handler_name    : SoundHandler
vendor_id       : [0][0][0][0]
Stream mapping:
Stream #0:0 -> #0:0 (copy)
Press [q] to stop, [?] for help
size=     286kB time=00:00:35.66 bitrate=  65.7kbits/s speed=5.91e+03x
video:0kB audio:281kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 1.669647%
cosa-voz.m4b complete

@aedocw
Copy link
Owner

aedocw commented Dec 4, 2023

I'm not sure there's anything to be done, but I'll ask on the TTS discord. From what I've seen discussed by other people, the XTTS v2 model is multilingual and should pick up speaking characteristics from the samples provided.

The changes you made would likely have no impact on the TTS parts since that's just used to try to detect how to segment a given set of text into discrete sentences. It could help it segment sentences better though at least.

@aedocw aedocw reopened this Dec 4, 2023
@aedocw
Copy link
Owner

aedocw commented Dec 4, 2023

You could try this, from someone on Coqui TTS Discord who said they were not getting expected results in Spanish.
https://discord.com/channels/1037326658807533628/1062887209352581151/1178794254831718440

pip install --upgrade TTS
"After downloading the v2.0.2 model in a folder and now it works for me."

https://huggingface.co/coqui/XTTS-v2/raw/v2.0.2/config.json
https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/model.pth?download=true
https://huggingface.co/coqui/XTTS-v2/raw/v2.0.2/vocab.json

If you do this, you can try putting that downloaded model in a new folder, comment out line 56, and add a line:
self.xtts_model = "(full path to downloaded model)"

@lordraiden
Copy link

I am quite interested in this. Please if you finally find out how to make it work for Spanish I would appreciate if you can summarize all the changes and steps required

@friki67
Copy link
Author

friki67 commented Dec 12, 2023

I am quite interested in this. Please if you finally find out how to make it work for Spanish I would appreciate if you can summarize all the changes and steps required
Hello. Finally I've found time to try this again.
Installed last version from github.
I downloaded the model in /home/ubuntu/work/model-2.0.2, modified line 56, ran epub2tts cosa.txt --xtts voz1/voz1-1.wav,voz1/voz1-2.wav,voz1/voz1-3.wav and got

Namespace(sourcefile='cosa.txt', engine='tts', xtts='voz1/voz1-1.wav,voz1/voz1-2.wav,voz1/voz1-3.wav', openai=None, model='tts_models/en/vctk/vits', speaker='p335', scan=False, start=1, end=999, minratio=88, skiplinks=False, bitrate='69k', debug=False)
Saving to cosa-voz.m4b
Total characters: 385
Loading model: /home/ubuntu/work/model-2.0.2/tts_models--multilingual--multi-dataset--xtts_v2
> tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
> Using model: xtts
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/epub2tts", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/.local/lib/python3.10/site-packages/epub2tts.py", line 419, in main
mybook.read_book(voice_samples=args.xtts, engine=args.engine, openai=args.openai, model_name=args.model, speaker=args.speaker, bitrate=args.bitrate)
File "/home/ubuntu/.local/lib/python3.10/site-packages/epub2tts.py", line 259, in read_book
config.load_json(model_json)
File "/home/ubuntu/.local/lib/python3.10/site-packages/coqpit/coqpit.py", line 726, in load_json
with open(file_name, "r", encoding="utf8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/work/model-2.0.2/tts_models--multilingual--multi-dataset--xtts_v2/config.json'

So I moved model to /home/ubuntu/work/model-2.0.2/tts_models--multilingual--multi-dataset--xtts_v2 and ran again. It worked, but I'm getting same 'germanized' reading of the text.

epub2tts cosa.txt --xtts voz4/voz4-1.wav,voz4/voz4-2.wav,voz4/voz4-3.wav outputs

Namespace(sourcefile='cosa.txt', engine='tts', xtts='voz4/voz4-1.wav,voz4/voz4-2.wav,voz4/voz4-3.wav', openai=None, model='tts_models/en/vctk/vits', speaker='p335', scan=False, start=1, end=999, minratio=88, skiplinks=False, bitrate='69k', debug=False)
Saving to cosa-voz.m4b
Total characters: 385
Loading model: /home/ubuntu/work/model-2.0.2/tts_models--multilingual--multi-dataset--xtts_v2
> tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
> Using model: xtts
Computing speaker latents...
Reading from 1 to 1
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:32<00:00, 32.76s/it]
100.00% spoken so far.
Elapsed: 0 minutes, ETA: 0 minutes
Replacing silences longer than one second with one second of silence (1 files)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1463.64it/s]
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil      56. 70.100 / 56. 70.100
libavcodec     58.134.100 / 58.134.100
libavformat    58. 76.100 / 58. 76.100
libavdevice    58. 13.100 / 58. 13.100
libavfilter     7.110.100 /  7.110.100
libswscale      5.  9.100 /  5.  9.100
libswresample   3.  9.100 /  3.  9.100
libpostproc    55.  9.100 / 55.  9.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'cosa-voz.m4a':
Metadata:
major_brand     : M4A
minor_version   : 512
compatible_brands: M4A isomiso2
encoder         : Lavf58.76.100
Duration: 00:00:30.15, start: 0.000000, bitrate: 68 kb/s
Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 24000 Hz, mono, fltp, 67 kb/s (default)
Metadata:
handler_name    : SoundHandler
vendor_id       : [0][0][0][0]
Input #1, ffmetadata, from 'FFMETADATAFILE':
Metadata:
ARTIST          : Unknown
ALBUM           : cosa
Duration: 00:00:29.20, start: 0.000000, bitrate: 0 kb/s
Chapters:
Chapter #1:0: start 0.000000, end 29.200000
Metadata:
title           : Part 1
File 'cosa-voz.m4b' already exists. Overwrite? [y/N] Y
Output #0, ipod, to 'cosa-voz.m4b':
Metadata:
ARTIST          : Unknown
ALBUM           : cosa
encoder         : Lavf58.76.100
Chapters:
Chapter #0:0: start 0.000000, end 29.200000
Metadata:
title           : Part 1
Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 24000 Hz, mono, fltp, 67 kb/s (default)
Metadata:
handler_name    : SoundHandler
vendor_id       : [0][0][0][0]
Stream mapping:
Stream #0:0 -> #0:0 (copy)
Press [q] to stop, [?] for help
size=     252kB time=00:00:30.12 bitrate=  68.5kbits/s speed=6.17e+03x
video:0kB audio:248kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 1.692174%
cosa-voz.m4b complete

The text I'm using

Corre el año 1834 y Madrid, una pequeña ciudad que trata de abrirse paso más allá de las murallas que la rodean, sufre una terrible epidemia de cólera. Pero la peste no es lo único que aterroriza a sus habitantes: en los arrabales aparecen cadáveres desmembrados de niñas que nadie reclama. Todos los rumores apuntan a la Bestia, un ser a quien nadie ha visto pero al que todos temen

No matter which voice I use. One example.
https://github.com/aedocw/epub2tts/assets/60692795/66159809-93c7-4e10-80a1-6854e780c86d

I'm running this in a LXD (LXC) container. Tell me if I can help to get this working, please.

EDIT1: I've changed line 178 from "en" to "es" and got a good result! Now sounds Spanish.
I've found some annoyances. Our "ñ" (phonetic /ñ/ or /ɲ/) is pronounced as "n" (phonetic /n/). It is a complete different phoneme. I've found too that our "diéresis" (umlaut, diaeresis) that we use in "ü", to point that the "u" in words like "cigüeña" must be pronounced is not recognized. If don't set, in words like "gueto" (ghetto) the "u" is not pronounced.

EDIT2: I'm thinking that most orthographic or spelling exceptions are not recognized. Changing the line 56 back (using the 2.0 model) gives the same result. Sounds like "j" (in Spanish is like a hard "h" in he or have") are read like in a "j" in english, etc.

EDIT3: in the demo space for xtts https://huggingface.co/spaces/coqui/xtts, the "ñ" and the "ü" in words like "pequeña, cigüeña, niño, avergüenza, lingüistico, ordeña, jota, ajeno, jaleo" are well pronounced, so there must be some parameter to fix it?. Using the default English voice, I got this perfect pronunciation example

output.mp4

@aedocw
Copy link
Owner

aedocw commented Dec 12, 2023

This PR on Coquit-TTS will be merged soon, which will include language and speaker samples from what used to be their paid studio product (and those voices sounded great!)

As soon as that merges, I'll add a --language option that should be a big help for this.

@danielw97
Copy link

Not sure how closely you're watching the coqui repo, but just to let you know that it looks like that pull request has been murged a short time ago and is available in v0.22.0.

@friki67
Copy link
Author

friki67 commented Dec 13, 2023

Hello again. Thank you for your time and patience.
I upgraded and tried again but same thing happened. The text is readed with Spanish accent, but no spanish phonemes (ñ, j, etc)
Then I tried this

tts --model_path /home/ubuntu/work/model-2.0.2/tts_models--multilingual--multi-dataset--xtts_v2 \ 
--text "pequeña, cigüeña, niño, avergüenza, lingüistico, ordeña, jota, ajeno, jaleo" \
--speaker_wav /home/ubuntu/work/voz1/voz1-1.wav \
--language_idx es \
--use_cuda true \
--out_path speech.wav  \
--model_path /home/ubuntu/work/model-2.0.2/tts_models--multilingual--multi-dataset--xtts_v2 \
--config_path /home/ubuntu/work/model-2.0.2/tts_models--multilingual--multi-dataset--xtts_v2/config.json

And it worked fine!

I can see that read_chunk_xtts containts a call to model.inference_stream. And this is the max I can understand. I can't see why this is not working for other languages. Maybe they have better language detection or management in model.synthesize function or alike.

I openned a discussion in Coqui TTS coqui-ai/TTS#3426.

@danielw97
Copy link

I'm no expert, although there may be some changes that aedocw is planning on making to support a --language flag as well as the newly murged studio speakers, so the --language flag will probably help quite a bit with this.

@friki67
Copy link
Author

friki67 commented Dec 15, 2023

@aedocw The last update fixed all the annoyances. Thank you very much!

@DaitiDay
Copy link

Any help setting the language to italian? I've tried setting --language 'it' but it's still english reading an italian book ._.

@danielw97
Copy link

aedocw will probably chime in here, although just to say that I don't believe the language flag has been implemented in epub2tts yet, although I believe it is on the roadmap going by a previous comment that was made.

@DaitiDay
Copy link

Oh, I've read about it on the README.md, that's why I used it. If not using that flag, is there any other way/guide I can follow in order to set the desired language?

@danielw97
Copy link

Seems this commit slipped past my reading, and it looks as though indeed this flag has been implemented.
Others may have more experience as my primary language I use this tool with is English, although I would suggest making sure you are using an Italian tts voice, whether xtts or something else.
Xtts v2 is supposed to be able to switch between languages, although I'm not sure how accurate it is myself.

@aedocw
Copy link
Owner

aedocw commented Dec 18, 2023

I think I can get something up today to use the language flag with xtts language_idx. It just gets a little bit messy since there are two places within coqui-TTS to specify language - it comes back to whether or not you're using XTTS basically.

I'll be able to poke at this late this afternoon my time (I'm in pacific TZ) :)

@DaitiDay
Copy link

DaitiDay commented Dec 18, 2023

I don't want to take time away from you if it something you do not intend to implement in the near future. I'm fine with a temporary solution. Even better if said solution makes me think a bit about the code since I'm in the process of learning python :)

PS: I was not using XTTS (just launching epub2tts epub.epub), should I switch to it? In that case I'm quite confused by the sample thing, where can I get those?

@aedocw
Copy link
Owner

aedocw commented Dec 18, 2023

No not at all, this is work I definitely planned to implement, I'm glad you are asking about it!

XTTSv2 has MUCH more human sounding voices. I added one sample to the repo (sample-shadow-coquiXTTS.m4b), but I really need to put more samples in a permanently accessible place. For instance here are samples of all Coqui's studio voices using XTTS: https://drive.google.com/drive/folders/1roXMrd7peX-zApvyogqfsjyi9nPNKNF_?usp=sharing

XTTS is a model that allows you to relatively easily clone voices. With some fine tuning, you can take 8-10 minutes of a recorded voice and get a speech model that sounds VERY VERY much like that person.

A really important thing to know though is that you need a GPU to use XTTS voices. If you don't have a GPU, technically I think it's possible but it's so slow as to be useless. (The work happening with StyleTTS2 is probably going to allow for very life-like voices without requiring a GPU, but that is probably a few months away from being read to use here.)

@DaitiDay
Copy link

DaitiDay commented Dec 18, 2023

So, if I understand correctly, if I get 3 audio sample of around 30sec each with a voice of my choice, then pass them to epub2tts using --xtts <sample-1.wav>,<sample-2.wav>,<sample-3.wav> --language 'it' book.epub (once the --language flag has been implemented obviously) I should get the epub read by the voice I chose? FOR REAL REAL?
PS: I have a 3070, is it ok?

Update: I've tried the previous command and it works fine, the audio is super clear. Only downside, the voice keeps spelling out "dot" quite often. Is there a way to correct that?

Edit 2: Supposing the script is using the GPU, is it normal for the conversion to take around 15-16 hours for a 800-ish pages book?

@aedocw
Copy link
Owner

aedocw commented Dec 18, 2023

Yes, it will work like that for reals! A 3070 should be just fine (as it seems you've discovered).

Regarding the voice spelling out "dot", can you paste a sample of text that leads to that? I'm always finding characters or things that confuse the synthesizer, and I add them to a section that replaces that text with a comma (or deletes it entirely).

Using the GPU, I think 15-16 hours for an 800 page book is probably correct. For me I think it's reading around real-time, so a book that is 8 hours spoken takes around 8 hours.

@DaitiDay
Copy link

DaitiDay commented Dec 18, 2023

Ok perfect. Thanks a lot for the clarifications.
This is a sample of text where the voice spells out "dot":

«Seguitemi!» gridò Eragon. Levò Brisingr sopra la testa perché tutti la vedessero.

Both dots are spelled out.
In this part also the quotation brackets "«" and "»" are spelled out (cant understand what is being said, sounds like "ie" and "eu" or something like that, not a real word).

Edit: I've modified the epub changing the quotation brackets in the usual " and the problem is gone. Now I'm doing the same with the dot (maybe the character is not the "normal" dot).

@aedocw
Copy link
Owner

aedocw commented Dec 18, 2023

Hmm, this could get complicated with other languages. I had some very harsh text reformatting going on that would try to match everything to unicode, and remove pretty much any special character. With that code, here's what would happen:
«Seguitemi!» gridò Eragon. Levò Brisingr sopra la testa perché tutti la vedessero.
becomes:
Seguitemi! grido Eragon. Levo Brisingr sopra la testa perche tutti la vedessero.

This basically assumes you're trying to speak english, so gridò and perché would probably be mispronounced.

At the very least characters like "«" and "»" can safely be removed since the text-to-speech doesn't do anything special with a phrase in quotes (as far as I can tell). It's also possible that the period in your text is in a different character set so it's confusing the TTS. I'll see if I can confirm that, and maybe at least translate the period.

@DaitiDay
Copy link

DaitiDay commented Dec 18, 2023

Actually, both gridò and perché are not mispronounced, so that's not a problem. If you're able to add the quotation brackets to the reformatting method would be super. As for the period, I tried changing every period to the '.' on my keyboard, but the result does not change. I don't know how, but if you can solve that then the tts would be actually perfect.

Edit: I forgot to mention, not all periods are spelled out, but I can't find anything in common between the spelled period vs the non spelled ones.

Edit 2: After some tests, I've noticed that something that may be useful: when the period is spelled the result can be either:

  1. The period is just spelled as an additional word
  2. The period is spelled followed by other indistinguishable sounds (usually 2-3 vowels). Then, the word prior to the period, the period (with the weird sound) and the word after the period are pronounced as they were a single word, ruining the comprehensibility of the text

Edit 3: Last update for today: after listening to the audio for a million times, it sounds like the voice is sometimes saying "punto tondo" that means "round point"/"round dot" in Italian, which can mean the actual symbol is not being recognize as a proper period. The weird sound reported in Edit 2 is still there in some other circumstances. Furthermore, I noticed that changing the samples changes the result, meaning that some periods that were spelled are treated normally after changing samples, but at the same time other periods get spelled. Are there any advices about how the samples should be chosen?

@aedocw
Copy link
Owner

aedocw commented Dec 19, 2023

The branch "more-text-cleanup" has what seems to work for me. It does seem like the text-to-speech wants to pronounce periods as punto. I tried replacing it with unicode character for full-stop (chr(0x002E)) but that had no impact. Replacing periods with commas seems to work though, and still causes TTS to take a beat before speaking the next part (vs. just removing the period which then makes all sentences run together).

If you could try this branch and let me know how it sounds to you, I would appreciate it. Thanks!

@aedocw
Copy link
Owner

aedocw commented Dec 19, 2023

Oh BTW I tested with the following command:
python ~/repos/epub2tts/epub2tts.py it-test.txt --language it --xtts /home/doc/voices/adam-1.wav --debug

@DaitiDay
Copy link

DaitiDay commented Dec 19, 2023

I'll try it tomorrow morning as soon as I have some spare time and update here. In the meanwhile thank you man, really appreciate the work.

Update: Sorry for the late update, but I did some testing following your directions.

  1. The new brach removes the problem with the spelled periods BUT in many cases breaks the rhythm of the sentence too much, making it difficult to understand
  2. The solution for the quotation brackets works perfectly
  3. As I said, I've edited the epub several times to get the best possible result (using the main branch to avoid the period to comma substitution), and this is what I found:
    • changing evert period . into comma+period ,. allows the voice to read the entire sentence with normal pronunciation.
    • In some cases the period is still spelled but it is very rare (in a 40 minute audio it personally only happened about ten times).
    • More frequently, the pause between two paragraphs is cut out (still not a problem, the audio is still understandable, but its kinda annoying in some cases)
    • Sometimes longer phrases have a weird inflection mid sentence, which I attribute to the character limit. I'd suggest increasing the character limit to obtain better comprehensibility of the text, if this does not create any other problem ofc. The current limit is 213 (looking at the output of the command), listening to the audio output I'd suggest a number around 225-230-ish.

This said, even just addressing the first 2 points makes the audio perfectly comprehensible with some minor annoyance tbh.
The command I used was python ~/path-to-main-branch/epub2tts.py --language it --xtts ~/path-to-voice-sample.wav --start 3 --end 4 ~/path-to-epub.epub

Let me know if you need something else.

Edit: One note: I did the substitution (. to ,.) in 2 parts: first time I substituted all periods with comma+period, THEN I substituted back all ,.,.,. with ... in order to avoid weird artifact during the audio generation phase.

@aedocw aedocw linked a pull request Dec 23, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants