Llama-VITS

In our recent paper, we propose Llama-VITS for enhanced TTS synthesis with semantic awareness extracted from a large-scale language model. This repository is the PyTorch implementation of Llama-VITS. Please visit our demo or demo github for audio samples.

Implemented Features:

Model with Weights:

Llama-VITS
BERT-VITS
ORI-VITS

Evaluation Metrics:

ESMOS
UTMOS
MCD
ASR (CER, WER)

Datasets:

full LJSpeech
1-hour LJSpeech
EmoV_DB_bea_sem

Pre-requisites

Clone this repository

git clone git@github.com:xincanfeng/vitsGPT.git

Install requirements.

cd vitsGPT
pip install pdm
pdm install

You may need to install espeak first:

sudo apt-get update
sudo apt-get install espeak

Download datasets
1. Download and extract the LJSpeech dataset from its official page, then rename or use absolute paths to create soft links to your data to make it easier to access:
```
ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/DUMMY1
ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/vits/DUMMY1
ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/vits/ori_vits/DUMMY1
ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/vits/emo_vits/DUMMY1
ln -s /path/to/LJSpeech-1.1/wavs vitsGPT/vits/sem_vits/DUMMY1
```
2. Download and extract our EmoV_DB_bea_sem dataset from here, then rename or create a link to the dataset folder:
```
ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/DUMMY5
ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/vits/DUMMY5
ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/vits/ori_vits/DUMMY5
ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/vits/emo_vits/DUMMY5
ln -s /path/to/EmoV_DB_bea_sem/wavs_filtered vitsGPT/vits/sem_vits/DUMMY5
```
3. You can also download EmoV_DB and filter it yourself by refering to our preprocess_EmoV_DB_bea_filter.py. We also provide code for other necessary data preprocessing in the folder datasets. Note that the gt_test_wav folder includes all test audios that we have processed to the same sampling rate with those generated by {method}_VITS methods. You can also process it by your own if using other datasets.
4. We do not provide 1-hour LJSpeech dataset explicitly. Because after the full LJSpeech is downloaded, you will be able to train on 1-hour LJSpeech by directly using its correspoding filtered filelist in our filelists folder. Or you can also randomly filter it yourself from the full LJSpeech.
Download filelists which contains semantic embeddings extracted from Llama and various BERT models.
vitsGPT/vits/filelists folder contains exact training information for every dataset along with corresponding semantic embeddings in our experiments.

Create more soft links to facilitate access to common configurations among different {method}_VITS methods.

Create soft links to the vitsGPT/vits/configs for each {method}_VITS method.

ln -s vitsGPT/vits/configs vitsGPT/vits/ori_vits/
ln -s vitsGPT/vits/configs vitsGPT/vits/emo_vits/
ln -s vitsGPT/vits/configs vitsGPT/vits/sem_vits/

Create soft links to the filelists/ for each {method}_VITS method.

ln -s vitsGPT/vits/filelists vitsGPT/vits/ori_vits/
ln -s vitsGPT/vits/filelists vitsGPT/vits/emo_vits/
ln -s vitsGPT/vits/filelists vitsGPT/vits/sem_vits/

Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
```
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. 
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
```
Please refer to preprocess_own_data.sh for configurations on different datasets.
Note that we have provided preprocessed phonemes for LJSpeech, 1-hour LJSpeech, and EmoV_DB_bea_sem in filelists named as {dataset}_audio_text_{train/val/test/all}_filelist.txt.cleaned.

Extracting Semantic Embeddings

Note that we have provided all extracted semantic embeddings from Llama or various BERT models in filelists named as {dataset}_audio_{token}_{dimension}.pt. But if you want to process your own data, we also provide the code to extract semantic embeddings from Llama or various BERT models as below.

Extracting Semantic Embeddings From Llama

Use the Llama implementation in our repository which includes codes to extract the semantic embeddings in the final hidden layer. But you can always refer to Llama repository if there are further related questions.
First, in the vitsGPT/llama directory run:
```
cd vitsGPT/llama
pip install -e .
```
Then, download the Llama weights and tokenizer from Meta website and accept their License.
Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download. (Pre-requisites: Make sure you have wget and md5sum installed. Then run the script: ./download.sh.)
- Make sure to grant execution permissions to the download.sh script
- During this process, you will be prompted to enter the URL from the email.
- Do not use the “Copy Link” option but rather make sure to manually copy the link from the email.
- Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden, you can always re-request a link.

Once the models you want have been downloaded, you can run the models locally. Below is one example command:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

You can refer to inference.sh to learn more examples we created to run Llama inference. You can use inference_ave.sh, inference_last.sh, inference_pca.sh, inference_mat_phone.sh, inference_mat_text.sh, inference_sentence.sh, and inference_word.sh scripts to infer and extract corresponding specific semantic embeddings in our paper.

As you can read from the inference_{token}.sh script, example_{llama-model}_{method}_{token}.py in the llama/examples/{dataset}_examples folder is used to tell Llama how to extract different semantic embeddings, what input transcripts to follow, and where to output. So, remember to check the corresponding example_{llama-model}_{method}_{token}.py file for configurations of the variable input_file, output_file, and audiopath that you want to process.

Extracting Semantic Embeddings From various BERT models

You can configure in get_embedding.sh to extract BERT embedding. When configuring, don't forget to set correct filelist_dir in corresponding get_embedding_{token}.py files.

Training

You can train the VITS model w/ or w/o semantic tokens using the scripts below.
Note that we also provide part of our pretrained models.

Training VITS with no semantic tokens

cd ori_vits
python train.py -c configs/ljs_base.json -m ljs_base

Please refer to train.sh for specific configurations of different datasets.

Training VITS with global semantic tokens

cd emo_vits
python emo_train.py -c configs/ljs_sem_ave.json -m ljs_emo_add_ave

Please refer to emo_train.sh for specific configurations of different datasets and global tokens.

Training VITS with sequential semantic tokens

cd sem_vits
python sem_train.py -c configs/ljs_sem_mat_text.json -m ljs_sem_mat_text

Please refer to sem_train.sh for specific configurations of different datasets and sequential tokens.

(In case you are interested in naming details, "mat" in the sequential tokens' file name means "matrix", because compared to global token which is mathematically represented by a single vector, sequential token is represented by a matrix for each sentence transcript.)

Inferencing

See inference.ipynb as an easy example to understand how to inference on any text.

Configure the model weights w/ or w/o extracted semantic tokens in the files below for inference according to specific model. Then you can inference on test data transcripts and generate a folder named after the checkpoint, e.g., G_100000, including a folder named source_model_test_wav which saves all the generated audios in the correspoding checkpoint directory. Specifically,
Use infer_test.ipynb for inferencing with no semantic tokens on test data transcripts.
Use emo_infer_test.ipynb for inferencing with global semantic tokens on test data transcripts.
Use sem_infer_test.ipynb for inferencing with sequential semantic tokens on test data transcripts.

Note that, in the source_model_test_wav file, the saved audio samples are named in the generation order instead of the corresponding transcript key for convenience.

Evaluation

Eval MCD and ASR (CER, WER) using ESPnet, eval UTMOS using SpeechMOS

Clone and install ESPnet according to its repository.
Copy and configure eval.sh into espnet/egs2/libritts/tts1/eval.sh.

install whisper for calculating ASR (CER, WER)

pip install git+https://github.com/openai/whisper.git

Use run_eval_ljs.sh and run_eval_emovdb.sh, respectively, for evaluation on LJSpeech or EmoV_DB or their subsets. As you can learn from run_eval_{dataset}.sh, for example, not only eval.sh are used, but also eval_1_make_kaldi_style_files.py and other processes in eval_datasets are used to process and eval on inferenced audio. Specifically,
1. Run eval_1_make_kaldi_style_files.py to rename the generated audio samples in the source_model_test_wav file corresponding to its transcript key. And generate related scp files.
```
python3 vits/eval_datasets/eval_{dataset}/eval_1_make_kaldi_style_files.py ${method} ${model} ${step}
```
2. Run eval_2_unify_and_eval.sh to downsample both model generated audios and ground truth audios to ensure they have the the same sampling rate.
```
. vits/eval_datasets/eval_{dataset}/eval_2_unify_and_eval.sh ${method} ${model} ${step}
```
3. Run eval.sh to evaluate MCD，ASR，F0 using the ESPnet framework. (You can also run this step after the step 4.)
```
CUDA_VISIBLE_DEVICES=0 . espnet/egs2/libritts/tts1/eval.sh ${method} ${model} ${step} 
```
  Because this step may take some time, it is recommended to run this process in the background using:
```
CUDA_VISIBLE_DEVICES=0 nohup espnet/egs2/libritts/tts1/eval.sh ${method} ${model} ${step} > eval.log 2>&1 & 
```
4. Run eval_3_mos.py to evaluate UTMOS using the SpeechMOS framework.
```
CUDA_VISIBLE_DEVICES=0 python3 vits/eval_datasets/eval_{dataset}/eval_3_mos.py ${method} ${model} ${step}
```

Eval ESMOS using Amazon Mechanical Turk (AMT)

We made paired random examples to receive ESMOS score using AMT. You can refer to human_evaluation to check out how we prepared for this evaluation.

Citation

If our work is useful to you, please cite our paper: "Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness". paper

@misc{feng2024llamavits,
      title={Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness}, 
      author={Xincan Feng and Akifumi Yoshimoto},
      year={2024},
      eprint={2404.06714},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
berts		berts
datasets		datasets
eval_espnet		eval_espnet
llama		llama
vits		vits
.gitignore		.gitignore
DUMMY1		DUMMY1
DUMMY5		DUMMY5
LICENSE		LICENSE
README.md		README.md
pdm.lock		pdm.lock
pipeline.png		pipeline.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Llama-VITS

Implemented Features:

Pre-requisites

Extracting Semantic Embeddings

Extracting Semantic Embeddings From Llama

Extracting Semantic Embeddings From various BERT models

Training

Training VITS with no semantic tokens

Training VITS with global semantic tokens

Training VITS with sequential semantic tokens

Inferencing

Evaluation

Eval MCD and ASR (CER, WER) using ESPnet, eval UTMOS using SpeechMOS

Eval ESMOS using Amazon Mechanical Turk (AMT)

Citation

About

Releases

Packages

Languages

License

xincanfeng/vitsGPT

Folders and files

Latest commit

History

Repository files navigation

Llama-VITS

Implemented Features:

Pre-requisites

Extracting Semantic Embeddings

Extracting Semantic Embeddings From Llama

Extracting Semantic Embeddings From various BERT models

Training

Training VITS with no semantic tokens

Training VITS with global semantic tokens

Training VITS with sequential semantic tokens

Inferencing

Evaluation

Eval MCD and ASR (CER, WER) using ESPnet, eval UTMOS using SpeechMOS

Eval ESMOS using Amazon Mechanical Turk (AMT)

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages