In our recent paper, we propose Llama-VITS for enhanced TTS synthesis with semantic awareness extracted from a large-scale language model. This repository is the PyTorch implementation of Llama-VITS. Please visit our demo or demo github for audio samples.
Model with Weights:
- Llama-VITS
- BERT-VITS
- ORI-VITS
Evaluation Metrics:
- ESMOS
- UTMOS
- MCD
- ASR (CER, WER)
Datasets:
- full LJSpeech
- 1-hour LJSpeech
- EmoV_DB_bea_sem
- Clone this repository
git clone git@github.com:xincanfeng/vitsGPT.git
- Install requirements.
cd vitsGPT pip install pdm pdm install
- You may need to install espeak first:
sudo apt-get update sudo apt-get install espeak
- You may need to install espeak first:
- Download datasets
- Download and extract the LJSpeech dataset from its official page, then rename or use absolute paths to create soft links to your data to make it easier to access:
ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/DUMMY1 ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/vits/DUMMY1 ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/vits/ori_vits/DUMMY1 ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/vits/emo_vits/DUMMY1 ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/vits/sem_vits/DUMMY1
- Download and extract our EmoV_DB_bea_sem dataset from here, then rename or create a link to the dataset folder:
ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/DUMMY5 ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/vits/DUMMY5 ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/vits/ori_vits/DUMMY5 ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/vits/emo_vits/DUMMY5 ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/vits/sem_vits/DUMMY5
- You can also download EmoV_DB and filter it yourself by refering to our preprocess_EmoV_DB_bea_filter.py. We also provide code for other necessary data preprocessing in the folder datasets.
Note that the
gt_test_wav
folder includes all test audios that we have processed to the same sampling rate with those generated by{method}_VITS
methods. You can also process it by your own if using other datasets. - We do not provide 1-hour LJSpeech dataset explicitly. Because after the full LJSpeech is downloaded, you will be able to train on 1-hour LJSpeech by directly using its correspoding filtered filelist in our
filelists
folder. Or you can also randomly filter it yourself from the full LJSpeech.
- Download and extract the LJSpeech dataset from its official page, then rename or use absolute paths to create soft links to your data to make it easier to access:
- Download filelists which contains semantic embeddings extracted from Llama and various BERT models.
vitsGPT/vits/filelists
folder contains exact training information for every dataset along with corresponding semantic embeddings in our experiments. - Create more soft links to facilitate access to common configurations among different
{method}_VITS
methods.- Create soft links to the
vitsGPT/vits/configs
for each{method}_VITS
method.ln -s vitsGPT/vits/configs vitsGPT/vits/ori_vits/ ln -s vitsGPT/vits/configs vitsGPT/vits/emo_vits/ ln -s vitsGPT/vits/configs vitsGPT/vits/sem_vits/
- Create soft links to the
filelists/
for each{method}_VITS
method.ln -s vitsGPT/vits/filelists vitsGPT/vits/ori_vits/ ln -s vitsGPT/vits/filelists vitsGPT/vits/emo_vits/ ln -s vitsGPT/vits/filelists vitsGPT/vits/sem_vits/
- Create soft links to the
- Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
Please refer to preprocess_own_data.sh for configurations on different datasets.
# Cython-version Monotonoic Alignment Search cd monotonic_align python setup.py build_ext --inplace # Preprocessing (g2p) for your own datasets. # python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt
Note that we have provided preprocessed phonemes for LJSpeech, 1-hour LJSpeech, and EmoV_DB_bea_sem infilelists
named as{dataset}_audio_text_{train/val/test/all}_filelist.txt.cleaned
.
Note that we have provided all extracted semantic embeddings from Llama or various BERT models in filelists
named as {dataset}_audio_{token}_{dimension}.pt
.
But if you want to process your own data, we also provide the code to extract semantic embeddings from Llama or various BERT models as below.
-
Use the Llama implementation in our repository which includes codes to extract the semantic embeddings in the final hidden layer. But you can always refer to Llama repository if there are further related questions.
-
First, in the
vitsGPT/llama
directory run:cd vitsGPT/llama pip install -e .
-
Then, download the Llama weights and tokenizer from Meta website and accept their License.
-
Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download. (Pre-requisites: Make sure you have
wget
andmd5sum
installed. Then run the script:./download.sh
.)- Make sure to grant execution permissions to the
download.sh
script - During this process, you will be prompted to enter the URL from the email.
- Do not use the “Copy Link” option but rather make sure to manually copy the link from the email.
- Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as
403: Forbidden
, you can always re-request a link.
- Make sure to grant execution permissions to the
-
Once the models you want have been downloaded, you can run the models locally. Below is one example command:
torchrun --nproc_per_node 1 example_chat_completion.py \ --ckpt_dir llama-2-7b-chat/ \ --tokenizer_path tokenizer.model \ --max_seq_len 512 --max_batch_size 6
-
You can refer to inference.sh to learn more examples we created to run Llama inference. You can use inference_ave.sh, inference_last.sh, inference_pca.sh, inference_mat_phone.sh, inference_mat_text.sh, inference_sentence.sh, and inference_word.sh scripts to infer and extract corresponding specific semantic embeddings in our paper.
As you can read from the
inference_{token}.sh
script,example_{llama-model}_{method}_{token}.py
in thellama/examples/{dataset}_examples
folder is used to tell Llama how to extract different semantic embeddings, what input transcripts to follow, and where to output. So, remember to check the correspondingexample_{llama-model}_{method}_{token}.py
file for configurations of the variableinput_file
,output_file
, andaudiopath
that you want to process.
You can configure in get_embedding.sh to extract BERT embedding. When configuring, don't forget to set correct filelist_dir
in corresponding get_embedding_{token}.py
files.
You can train the VITS model w/ or w/o semantic tokens using the scripts below.
Note that we also provide part of our pretrained models.
cd ori_vits
python train.py -c configs/ljs_base.json -m ljs_base
Please refer to train.sh for specific configurations of different datasets.
cd emo_vits
python emo_train.py -c configs/ljs_sem_ave.json -m ljs_emo_add_ave
Please refer to emo_train.sh for specific configurations of different datasets and global tokens.
cd sem_vits
python sem_train.py -c configs/ljs_sem_mat_text.json -m ljs_sem_mat_text
Please refer to sem_train.sh for specific configurations of different datasets and sequential tokens.
(In case you are interested in naming details, "mat" in the sequential tokens' file name means "matrix", because compared to global token which is mathematically represented by a single vector, sequential token is represented by a matrix for each sentence transcript.)
See inference.ipynb as an easy example to understand how to inference on any text.
Configure the model weights w/ or w/o extracted semantic tokens in the files below for inference according to specific model. Then you can inference on test data transcripts and generate a folder named after the checkpoint, e.g., G_100000
, including a folder named source_model_test_wav
which saves all the generated audios in the correspoding checkpoint directory. Specifically,
Use infer_test.ipynb for inferencing with no semantic tokens on test data transcripts.
Use emo_infer_test.ipynb for inferencing with global semantic tokens on test data transcripts.
Use sem_infer_test.ipynb for inferencing with sequential semantic tokens on test data transcripts.
Note that, in the source_model_test_wav
file, the saved audio samples are named in the generation order instead of the corresponding transcript key for convenience.
- Clone and install ESPnet according to its repository.
- Copy and configure eval.sh into
espnet/egs2/libritts/tts1/eval.sh
. - install whisper for calculating ASR (CER, WER)
pip install git+https://github.com/openai/whisper.git
- Use run_eval_ljs.sh and run_eval_emovdb.sh, respectively, for evaluation on LJSpeech or EmoV_DB or their subsets.
As you can learn from
run_eval_{dataset}.sh
, for example, not only eval.sh are used, but also eval_1_make_kaldi_style_files.py and other processes in eval_datasets are used to process and eval on inferenced audio. Specifically,-
Run
eval_1_make_kaldi_style_files.py
to rename the generated audio samples in thesource_model_test_wav
file corresponding to its transcript key. And generate related scp files.python3 vits/eval_datasets/eval_{dataset}/eval_1_make_kaldi_style_files.py ${method} ${model} ${step}
-
Run
eval_2_unify_and_eval.sh
to downsample both model generated audios and ground truth audios to ensure they have the the same sampling rate.. vits/eval_datasets/eval_{dataset}/eval_2_unify_and_eval.sh ${method} ${model} ${step}
-
Run
eval.sh
to evaluate MCD,ASR,F0 using the ESPnet framework. (You can also run this step after the step 4.)CUDA_VISIBLE_DEVICES=0 . espnet/egs2/libritts/tts1/eval.sh ${method} ${model} ${step}
Because this step may take some time, it is recommended to run this process in the background using:
CUDA_VISIBLE_DEVICES=0 nohup espnet/egs2/libritts/tts1/eval.sh ${method} ${model} ${step} > eval.log 2>&1 &
-
Run
eval_3_mos.py
to evaluate UTMOS using the SpeechMOS framework.CUDA_VISIBLE_DEVICES=0 python3 vits/eval_datasets/eval_{dataset}/eval_3_mos.py ${method} ${model} ${step}
-
We made paired random examples to receive ESMOS score using AMT. You can refer to human_evaluation to check out how we prepared for this evaluation.
If our work is useful to you, please cite our paper: "Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness". paper
@misc{feng2024llamavits,
title={Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness},
author={Xincan Feng and Akifumi Yoshimoto},
year={2024},
eprint={2404.06714},
archivePrefix={arXiv},
primaryClass={cs.CL}
}