Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process with a VTT or SRT in realtime or not #140

Open
ROBERT-MCDOWELL opened this issue Oct 4, 2024 · 6 comments
Open

Process with a VTT or SRT in realtime or not #140

ROBERT-MCDOWELL opened this issue Oct 4, 2024 · 6 comments

Comments

@ROBERT-MCDOWELL
Copy link

It would be fantastic to use RealtimeTTS from a VTT or SRT file (or other subtitle formats) to let the engine respect the start time of each segment, so as this we can have a direct audio translation in realtime audio or recorded on an audio file (aac, wav or mp3 for example)
Unless it's already possible to do it?

@KoljaB
Copy link
Owner

KoljaB commented Oct 4, 2024

https://github.com/KoljaB/TurnVoice/blob/main/turnvoice%2Fcore%2Fsynthesis.py#L272

This does something very similar.
I think the idea to process VTT and SRT is great. But hard to do in real-time. Might more be an add-on project.

@ROBERT-MCDOWELL
Copy link
Author

ROBERT-MCDOWELL commented Oct 4, 2024

well, even if it's not realtime it will help a lot already ;). I'm working on it for now but my biggest issue is to make a dummy device working as my computer does not have any soundcard....
how you could use synthesis.py in the VTT/SRT context?

@KoljaB
Copy link
Owner

KoljaB commented Oct 4, 2024

I'd parse the file for lengths to get the duration and put this as desired_duration parameter to the synthesize_duration method. So I get the text spoken in the correct time. Fill up with silence for the parts where nothing is spoken and you're good I guess.

@KoljaB
Copy link
Owner

KoljaB commented Oct 4, 2024

It's hard to make this realtime. Because the final duration of the synthesis generation is unknown beforehand (especially with neural TTS engines with a nondeterministic synthesis output) we testsynthesize here, measure the duration of the result and apply a speed correction factor afterwards. So we stretch the audio in place. But we need the full audio generated to do this, that's far away from realtime.

@ROBERT-MCDOWELL
Copy link
Author

ROBERT-MCDOWELL commented Oct 4, 2024

oh my! sorry I just realized the link you sent is another repo. turnvoice is already a very good start indeed!
about realtime, indeed only pre chunks can do the trick, it won't be realtime but a kind of 1 to 3 sec latency. anyhow even in a presential meeting with human translator there is always a latency ;).

@ROBERT-MCDOWELL
Copy link
Author

ROBERT-MCDOWELL commented Oct 5, 2024

@KoljaB I opened a new discussion on turnvoice repo to discuss about vtt/srt import as I think it's a better repo to add an option to import SRT/VTT rather than video/audio then bypass STT, translation, and keep TTS as the only process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants