AdaSpeech 1 - PyTorch Implementation

This is a unofficial PyTorch implementation of Microsoft's text-to-speech system AdaSpeech: Adaptive Text to Speech for Custom Voice.

This project is based on ming024's implementation of FastSpeech. Feel free to use/modify the code.

Quickstart

Dependencies

Linux
Python 3.7+
PyTorch 1.10.1 or higher and CUDA

a. Create a conda virtual environment and activate it.

conda create -n adaspeech1 python=3.7
conda activate adaspeech1

b. Install PyTorch and torchvision following the official instructions

c. Clone this repository.

git clone https://github.com/yw0nam/Adaspeech/
cd Adaspeech

d. Install requirments.

pip install -r requirements.txt

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13,100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
KSS: a single-speaker korean dataset 12,853 audio clips. it consists of audio files recorded by a professional female voice actress and their aligned text extracted from b
Kokoro: a single-speaker japanese dataset. It contains 43,253 short audio clips of a single speaker reading 14 novel books. The audio clips were split and transcripts were aligned automatically by Kokoro-Align.

You can train other datasets including multi-speaker datasets. But, you need to make textgrid yourself.

We take LJSpeech as an example hereafter.

Preprocessing

First, run

python3 prepare_align.py config/LJSpeech/preprocess.yaml

For make .lab and .wav file in raw_data folder

As described in the paper, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments of the supported datasets are provided here. (Note, this is same file of ming024)

You can find other dataset's alignments here

You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/

After that, run the preprocessing script by

python3 preprocess.py config/LJSpeech/preprocess.yaml

Alternately, you can align the corpus by yourself. Download the official MFA package and run

./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech

or

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech

to align the corpus and then run the preprocessing script.

python3 preprocess.py config/LJSpeech/preprocess.yaml

Training

Training the model using below code.

CUDA_VISIBLE_DEVICES=0,1 python3 train_pl.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

Infernce

Here is the pretrained model. you can inference by

python3 inference.py

I will release inference code using the reference audio and given text soon.

TensorBoard

Use

tensorboard --logdir output/log/LJSpeech/version_0/

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms and audios are shown.

Note

The conditional layer normalization is implemented in this repository. But, adapting pre-trained models on new datasets haven't implemented.

References

AdaSpeech: Adaptive Text to Speech for Custom Voice, Mingjian Chen, et al.
ming024's FastSpeech2 implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

AdaSpeech 1 - PyTorch Implementation

Quickstart

Dependencies

Training

Datasets

Preprocessing

Training

Infernce

TensorBoard

Note

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

AdaSpeech 1 - PyTorch Implementation

Quickstart

Dependencies

Training

Datasets

Preprocessing

Training

Infernce

TensorBoard

Note

References