A complete end-to-end Deep Learning system to generate high quality human like speech in English for Korean Drama. (WIP)
Check Projects.
There are various steps, I came up with
Step 0: Preprocessing subtitles to get sentences
The relies heavily on Subtitles for the dubbing procedure to work, i.e., the subs should match the intended audio in the video file.
If they don't use shift
parameter of DeepdubSentence
constructor. These sentences (stored in sentence_df
) are used to create audio segments in step 1.
Step 1: Generating audio segments
The sentence_df
can then be used to create audio segments, more than enough accurate mapping of sentences to spoken audio. We also create segments which does not contain any spoken sentence/dialog as per preprocessed subtitles and writing them to <hash>.wav
(hash of the start and ending timestamp of sentence from sentence_df
. All of these file names are written to audio_segments_list.txt
to concatenate back the generated audio and other audio segments which doesn't contain any spoken dialog. audio_df
dataframe stores all the audio segments information, storing exact start and stop time stamp.
Step 2: Source separation/separating accompaniments and vocals.
The background sound effects/accompaniments will behave as noise for our next step 3, (and possibly step 4). This problem is solved by using source separation technique (using Spleeter) spliting original audio containing a speech, into <hash>_vocals.wav
and <hash>_accopaniments.wav
. This step is performed only for the audio segments containing known speech (i.e., based on sentence_df
), completely retaining background sound effects for audio segments which doensn't contain any speech.
Step 3: Clustering audio Segments for speaker Diarization
We don't know who spoke a particular audio segment just from subtitles. We need to give labels to audio segments so that we can dub that particular audio segment into that particular speaker's voice. For this I have applied clustering to speaker embeddings of audio segments, creating labels.
We already know which audio segment is spoken by which speaker in previous step. We can use these speech segments for that particular speaker for voice adaptation, generating speech (<hash>_gen.wav
) using a TTS (Text-To-Speech) model and preprocessed subs (sentences).
Step 5: Accompaniments Overlay and Concatenation of audio segments.
The generated speech (<hash>.wav
) is overlayed with accompaniments (<hash>_accompaniments.wav
) to get <hash>_gen.wav
. This ensures that we have speech in intended language + sound effects are preserved. At last we use audio_segments_list.txt
to concatenate back the audio segments and produce the final output audio.
Look into issues. You can begin with issue tagged good first issue
or if you want to suggest something else, open a new issue.
- This project uses Spleeter for source separation.
@article{spleeter2020,
doi = {10.21105/joss.02154},
url = {https://doi.org/10.21105/joss.02154},
year = {2020},
publisher = {The Open Journal},
volume = {5},
number = {50},
pages = {2154},
author = {Romain Hennequin and Anis Khlif and Felix Voituret and Manuel Moussallam},
title = {Spleeter: a fast and efficient music source separation tool with pre-trained models},
journal = {Journal of Open Source Software},
note = {Deezer Research}
}
Install dependencies for Spleeter:
conda install -c conda-forge ffmpeg libsndfile
pip install spleeter
- This project also uses Deep Speaker for speaker identification. Install it's requirements by:
pip install -r deep_speaker/requirements.txt
Download pretrained models weights from here or from here and put in ./pretrained_models
folder of current directory.