VQMIVC Forked Repo, Adapting to other Dataset CHANGES WERE NOT COMMITED YET, DO NOT USE, AS IT WON'T WORK
one-shot/any-to-any voice conversion, which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference. Vector quantization with contrastive predictive coding (VQCPC) is used for content encoding and mutual information (MI) is introduced as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner.
Python 3.6 is used, install apex for speeding up training (optional), other requirements are listed in 'requirements.txt':
pip install -r requirements.txt
Don't use the Parallel WaveGan from it's github, instead:
pip install parallel_wavegan
Download the checkpoints from VQMIVC pre-trained models:
Then, make a conversion
python convert_example.py -s {source-wav} -r {reference-wav} -c {converted-wavs-save-path} -m {model-path}
For example:
python convert.py
The converted wav is put in 'converted' directory.
- Step1. Data preparation & preprocessing.
- Put the dataset under directory: 'Dataset/'
- Training/testing speakers split & feature (mel+lf0) extraction:
Here, a new code pre.py was added in order to replace preprocess.py. There was an error that due to the dataset size, numpy arrays couldn't be loaded into RAM, so lines from 141 to 145 were modified in order to work with only a portion of the wavs; they are about calculating the mean and std of the mels spectrograms in order to normalize the data. Also, the wavs are globbed from the dataset was changed. You may still need to adapt the glob logic in order to adapt to your dataset.
python pre.py
-
Step2. model training:
python train.py use_CSMI=True use_CPMI=True use_PSMI=True
Training was adapted to fine tune from the VCTK checkpoint, so download the checkpoint from the original paper and then change the checkpoint path at config/convert.yaml. Also, full paths are used in this training, so you will need to change the paths at config/train.yaml too.
- 2 problems were find while trying to use convert.py, so the made changes are:
- Line 70 from the original code @hydra.main(config_path="config/convert.yaml") was changed to @hydra.main(config_path="config", config_name='convert')
- Somehow, one of the packages are making the code lose the tracking of it's own path, so full paths are used instead of relative paths.
The utilized vocoder was the hifi-gan with adjustments and a checkpoint for pt_br from the following directory: https://github.com/freds0/hifi-gan The python file used to call the vocoder was: inference_e2e.py Changes made on inference_e2e on function inference()
- Changed os.listdir to glob
- Added a permute(1, 0) and unsqueeze(0) to match the model shape.
- Used string .split() function instead of os.path.splitext
If the code is used in your research, please Star our repo and cite our paper:
@inproceedings{wang21n_interspeech,
author={Disong Wang and Liqun Deng and Yu Ting Yeung and Xiao Chen and Xunying Liu and Helen Meng},
title={{VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion}},
year=2021,
booktitle={Proc. Interspeech 2021},
pages={1344--1348},
doi={10.21437/Interspeech.2021-283}
}