This repository provides an official PyTorch implementation for StarGAN-VC.
StarGAN-VC is a nonparallel many-to-many voice conversion (VC) method using star generative adversarial networks (StarGAN). The current version performs VC by first modifying the mel-spectrogram of input speech of an arbitrary speaker in accordance with a target speaker index, and then generating a waveform using a speaker-independent neural vocoder (HiFi-GAN or Parallel WaveGAN) from the modified mel-spectrogram.
Audio samples are available here.
-
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, "StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks," in Proc. 2018 IEEE Workshop on Spoken Language Technology (SLT 2018), pp. 266-273, Dec. 2018. [Paper]
-
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, "Nonparallel Voice Conversion With Augmented Classifier Star Generative Adversarial Networks" IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2982-2995, 2020. [Paper]
- See
requirements.txt
.
- Setup your training and test sets. The data structure should look like:
/path/to/dataset/training
├── spk_1
│ ├── utt1.wav
│ ...
├── spk_2
│ ├── utt1.wav
│ ...
└── spk_N
├── utt1.wav
...
/path/to/dataset/test
├── spk_1
│ ├── utt1.wav
│ ...
├── spk_2
│ ├── utt1.wav
│ ...
└── spk_N
├── utt1.wav
...
- Place a copy of the directory
parallel_wavegan
from https://github.com/kan-bayashi/ParallelWaveGAN inpwg/
. - HiFi-GAN models trained on several databases can be found here. Once these are downloaded, place them in
pwg/egs/
. Please contact me if you have any problems downloading. - Optionally, Parallel WaveGAN can be used instead for waveform generation. The trained models are available here. Once these are downloaded, place them in
pwg/egs/
.
To run all stages for model training, execute:
./recipes/run_train.sh [-g gpu] [-a arch_type] [-l loss_type] [-s stage] [-e exp_name]
-
Options:
-g: GPU device (default: -1) # -1 indicates CPU -a: Generator architecture type ("conv" or "rnn") # conv: 1D fully convolutional network (default) # rnn: Bidirectional long short-term memory network -l: Loss type ("cgan", "wgan", or "lsgan") # cgan: Cross-entropy GAN # wgan: Wasserstein GAN with the gradient penalty loss (default) # lsgan: Least squares GAN -s: Stage to start (0 or 1) # Stages 0 and 1 correspond to feature extraction and model training, respectively. -e: Experiment name (default: "conv_wgan_exp1") # This name will be used at test time to specify which trained model to load.
-
Examples:
# To run the training from scratch with the default settings: ./recipes/run_train.sh # To skip the feature extraction stage: ./recipes/run_train.sh -s 1 # To set the gpu device to, say, 0: ./recipes/run_train.sh -g 0 # To use a generator with a recurrent architecture: ./recipes/run_train.sh -a rnn -e rnn_wgan_exp1 # To use the cross-entropy adversarial loss: ./recipes/run_train.sh -l cgan -e conv_cgan_exp1 # To use the least-squares adversarial loss: ./recipes/run_train.sh -l lsgan -e conv_lsgan_exp1
See other scripts in recipes
for examples of training on different datasets.
To monitor the training process, use tensorboard:
tensorboard [--logdir log_path]
To perform conversion, execute:
./recipes/run_test.sh [-g gpu] [-e exp_name] [-c checkpoint] [-v vocoder_type]
-
Options:
-g: GPU device (default: -1) # -1 indicates CPU -e: Experiment name (e.g., "conv_wgan_exp1") -c: Model checkpoint to load (default: 0) # 0 indicates the newest model -v: Vocoder type ("hfg" or "pwg") # hfg: HifiGAN (default) # pwg: Parallel WaveGAN
-
Examples:
# To perform conversion with the default settings: ./recipes/run_test.sh -g 0 -e conv_wgan_exp1 # To use Parallel WaveGAN as an alternative for waveform generation: ./recipes/run_test.sh -g 0 -e conv_wgan_exp1 -v pwg
If you find this work useful for your research, please cite our papers.
@INPROCEEDINGS{Kameoka2018SLT_StarGAN-VC,
author={Hirokazu Kameoka and Takuhiro Kaneko and Kou Tanaka and Nobukatsu Hojo},
booktitle={Proc. 2018 IEEE Spoken Language Technology Workshop (SLT)},
title={StarGAN-VC: Non-parallel Many-to-Many Voice Conversion Using Star Generative Adversarial Networks},
year={2018},
pages={266--273}}
@Article{Kameoka2020IEEETrans_StarGAN-VC,
author={Hirokazu Kameoka and Takuhiro Kaneko and Kou Tanaka and Nobukatsu Hojo},
title={Nonparallel Voice Conversion With Augmented Classifier Star Generative Adversarial Networks},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
volume={28},
pages={2982--2995},
year={2020}}
Hirokazu Kameoka (@kamepong)
E-mail: kame.hirokazu@gmail.com