Deepdubpy

A complete end-to-end Deep Learning system to generate high quality human like speech in English for Korean Drama. (WIP)

Status

Check Projects.

What am I doing here?

There are various steps, I came up with

Step 0: Preprocessing subtitles to get sentences

The relies heavily on Subtitles for the dubbing procedure to work, i.e., the subs should match the intended audio in the video file. If they don't use shift parameter of DeepdubSentence constructor. These sentences (stored in sentence_df) are used to create audio segments in step 1.

Step 1: Generating audio segments

The sentence_df can then be used to create audio segments, more than enough accurate mapping of sentences to spoken audio. We also create segments which does not contain any spoken sentence/dialog as per preprocessed subtitles and writing them to <hash>.wav (hash of the start and ending timestamp of sentence from sentence_df. All of these file names are written to audio_segments_list.txt to concatenate back the generated audio and other audio segments which doesn't contain any spoken dialog. audio_df dataframe stores all the audio segments information, storing exact start and stop time stamp.

Step 2: Source separation/separating accompaniments and vocals.

The background sound effects/accompaniments will behave as noise for our next step 3, (and possibly step 4). This problem is solved by using source separation technique (using Spleeter) spliting original audio containing a speech, into <hash>_vocals.wav and <hash>_accopaniments.wav. This step is performed only for the audio segments containing known speech (i.e., based on sentence_df), completely retaining background sound effects for audio segments which doensn't contain any speech.

Step 3: Clustering audio Segments for speaker Diarization

We don't know who spoke a particular audio segment just from subtitles. We need to give labels to audio segments so that we can dub that particular audio segment into that particular speaker's voice. For this I have applied clustering to speaker embeddings of audio segments, creating labels.

Step 4: Voice Reproduction

We already know which audio segment is spoken by which speaker in previous step. We can use these speech segments for that particular speaker for voice adaptation, generating speech (<hash>_gen.wav) using a TTS (Text-To-Speech) model and preprocessed subs (sentences).

Step 5: Accompaniments Overlay and Concatenation of audio segments.

The generated speech (<hash>.wav) is overlayed with accompaniments (<hash>_accompaniments.wav) to get <hash>_gen.wav. This ensures that we have speech in intended language + sound effects are preserved. At last we use audio_segments_list.txt to concatenate back the audio segments and produce the final output audio.

Want to Contribute?

Look into issues. You can begin with issue tagged good first issue or if you want to suggest something else, open a new issue.

This project uses Spleeter for source separation.

@article{spleeter2020,
  doi = {10.21105/joss.02154},
  url = {https://doi.org/10.21105/joss.02154},
  year = {2020},
  publisher = {The Open Journal},
  volume = {5},
  number = {50},
  pages = {2154},
  author = {Romain Hennequin and Anis Khlif and Felix Voituret and Manuel Moussallam},
  title = {Spleeter: a fast and efficient music source separation tool with pre-trained models},
  journal = {Journal of Open Source Software},
  note = {Deezer Research}
}

Install dependencies for Spleeter:

conda install -c conda-forge ffmpeg libsndfile
pip install spleeter

This project also uses Deep Speaker for speaker identification. Install it's requirements by:

pip install -r deep_speaker/requirements.txt

Download pretrained models weights from here or from here and put in ./pretrained_models folder of current directory.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
deep_speaker		deep_speaker
deepdubpy		deepdubpy
pretrained_models		pretrained_models
res/images		res/images
.gitignore		.gitignore
0_Sentence_generation_from_Subtitles.ipynb		0_Sentence_generation_from_Subtitles.ipynb
1_Generating_Audio_Segments.ipynb		1_Generating_Audio_Segments.ipynb
2_Source_separation_for_audio_segments.ipynb		2_Source_separation_for_audio_segments.ipynb
3_Clustering_audio_segments_for_speaker_diarization.ipynb		3_Clustering_audio_segments_for_speaker_diarization.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deepdubpy

Status

What am I doing here?

Step 0: Preprocessing subtitles to get sentences

Step 1: Generating audio segments

Step 2: Source separation/separating accompaniments and vocals.

Step 3: Clustering audio Segments for speaker Diarization

Step 4: Voice Reproduction

Step 5: Accompaniments Overlay and Concatenation of audio segments.

Want to Contribute?

About

Releases

Packages

Languages

License

adhadse/Deepdubpy

Folders and files

Latest commit

History

Repository files navigation

Deepdubpy

Status

What am I doing here?

Step 0: Preprocessing subtitles to get sentences

Step 1: Generating audio segments

Step 2: Source separation/separating accompaniments and vocals.

Step 3: Clustering audio Segments for speaker Diarization

Step 4: Voice Reproduction

Step 5: Accompaniments Overlay and Concatenation of audio segments.

Want to Contribute?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages