This task requires models to convert input text sequences, descriptions, or narrations produced by a single speaker, to synthesized audio clips sharing the same timbre as the provided speaker. We adopt Mean Cepstral Distortion (MCD) and Mel Spectral Distortion (MSD) for objective evaluation.
Noncausal Self
FastSpeech 2 | Transformer-TTS | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Causal Self
Transformer-TTS | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Causal Cross
Transformer-TTS | |||||||||
---|---|---|---|---|---|---|---|---|---|
|
We incorporates the LJSpeech dataset whose audio clips are sampled with 22,050 Hz. Under this set of relatively high sample rates, the average sequence length of processed audio clips is 559.
We use non-autoregressive FastSpeech 2 and autoregressive Transformer-TTS as backbone networks.
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
Download LJSpeech, create splits and generate audio manifests with
AUDIO_DATA_ROOT=<path>
AUDIO_MANIFEST_ROOT=<path>
NUMEXPR_MAX_THREADS=20 python -m examples.speech_synthesis.preprocessing.get_ljspeech_audio_manifest \
--output-data-root ${AUDIO_DATA_ROOT} \
--output-manifest-root ${AUDIO_MANIFEST_ROOT}
Because FastSpeech 2 needs prediction of duration. We here provide two duration computation tools for duration extraction:
If using g2pE
to compute durations, download g2pE
and set TEXT_GRID_ZIP_PATH
to the path of ljspeech_mfa.zip
.
AUDIO_MANIFEST_ROOT=<path>
FEATURE_MANIFEST_ROOT=<path>
TEXT_GRID_ZIP_PATH=<path>
NUMEXPR_MAX_THREADS=20 python -m examples.speech_synthesis.preprocessing.get_feature_manifest \
--audio-manifest-root ${AUDIO_MANIFEST_ROOT} \
--output-root ${FEATURE_MANIFEST_ROOT} \
--ipa-vocab --use-g2p --add-fastspeech-targets \
--textgrid-zip ${TEXT_GRID_ZIP_PATH}
If using units
to compute durations, download units
and set ID_TO_UNIT_TSV
to the path of ljspeech_hubert.tsv
.
AUDIO_MANIFEST_ROOT=<path>
FEATURE_MANIFEST_ROOT=<path>
ID_TO_UNIT_TSV=<path>
NUMEXPR_MAX_THREADS=20 python -m examples.speech_synthesis.preprocessing.get_feature_manifest \
--audio-manifest-root ${AUDIO_MANIFEST_ROOT} \
--output-root ${FEATURE_MANIFEST_ROOT} \
--ipa-vocab --use-g2p --add-fastspeech-targets \
--id-to-units-tsv ${ID_TO_UNIT_TSV}
You can also generate durations by yourself using
a different software or model:
Montreal Forced Aligner to get g2pE
duration or
HuBERT to get units
duration.
NOTE: In our paper, results of FastSpeech 2 are produced by g2pE
durations.
python -m examples.speech_synthesis.preprocessing.get_feature_manifest \
--audio-manifest-root ${AUDIO_MANIFEST_ROOT} \
--output-root ${FEATURE_MANIFEST_ROOT} \
--ipa-vocab --use-g2p &
We use 1×80GB A100 to train both FastSpeech 2 and Transformer TTS models.
fairseq-train ${FEATURE_MANIFEST_ROOT} --save-dir ${SAVE_DIR} \
--config-yaml config.yaml --train-subset train --valid-subset dev \
--num-workers 4 --max-sentences 6 --max-update 200000 \
--task text_to_speech --criterion tacotron2 --arch tts_transformer \
--clip-norm 5.0 --n-frames-per-step 4 --bce-pos-weight 5.0 \
--dropout 0.1 --attention-dropout 0.1 --activation-dropout 0.1 \
--encoder-normalize-before --decoder-normalize-before \
--optimizer adam --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--seed 1 --update-freq 8 --eval-inference --best-checkpoint-metric mcd_loss
where SAVE_DIR
is the checkpoint root path. We set --update-freq 8
to simulate 8 GPUs with 1 GPU. You may want to
update it accordingly when using more than 1 GPU.
fairseq-train ${FEATURE_MANIFEST_ROOT} --save-dir ${SAVE_DIR} \
--config-yaml config.yaml --train-subset train --valid-subset dev \
--num-workers 4 --max-sentences 6 --max-update 200000 \
--task text_to_speech --criterion fastspeech2 --arch fastspeech2 \
--clip-norm 5.0 --n-frames-per-step 1 \
--dropout 0.1 --attention-dropout 0.1 \
--optimizer adam --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--seed 1 --update-freq 8 --eval-inference --best-checkpoint-metric mcd_loss
Average the last 5 checkpoints, generate the test split spectrogram and waveform using the default Griffin-Lim vocoder:
SPLIT=test
CHECKPOINT_NAME=avg_last_5
CHECKPOINT_PATH=${SAVE_DIR}/checkpoint_${CHECKPOINT_NAME}.pt
python scripts/average_checkpoints.py --inputs ${SAVE_DIR} \
--num-epoch-checkpoints 5 \
--output ${CHECKPOINT_PATH}
NOTE: you can just use the best checkpoint as our paper's setting. In this case, CHECKPOINT_PATH=${SAVE_DIR}/checkpoint_best.pt
.
Then generate the waveforms to the EVAL_OUTPUT_ROOT
:
EVAL_OUTPUT_ROOT=$SAVE_DIR/avg
python -m examples.speech_synthesis.generate_waveform ${FEATURE_MANIFEST_ROOT} \
--config-yaml config.yaml --gen-subset ${SPLIT} --task text_to_speech \
--path ${CHECKPOINT_PATH} --max-tokens 50000 --spec-bwd-max-iter 32 \
--dump-waveforms --dump-target --results-path $EVAL_OUTPUT_ROOT
We only use MCD/MSD metrics. You can also use other automatic metrics following the guidance of original files.
First generate the evaluation file:
python -m examples.speech_synthesis.evaluation.get_eval_manifest \
--generation-root ${EVAL_OUTPUT_ROOT} \
--audio-manifest ${AUDIO_MANIFEST_ROOT}/${SPLIT}.audio.tsv \
--output-path ${EVAL_OUTPUT_ROOT}/eval.tsv \
--vocoder griffin_lim --audio-format wav \
--use-resynthesized-target
python -m examples.speech_synthesis.evaluation.eval_sp \
${EVAL_OUTPUT_ROOT}/eval.tsv --mcd --msd
The numbers in dist_per_syn_frm
column is the final results.