Parallel data voice conversion based on the pix2pix architecture.
Non-conditional GAN system (neither the generator nor the discriminator are conditioned) based on the pix2pix architecture. The aim is to reconstruct the speech of a source speaker with the voice of a target speaker. The models are not conditioned because it is not possible to learn a meaningful mapping given the (non-linear) audio misalignments due to, for example, source and target speakers speaking at different speeds.
We trained and tested the system with the Voice Conversion Challenge 2018 data. For a (source, target) pair of audio samples (from two different people uttering the same speech) we compute their Mel spectograms so that each one of them is a single-channel 256x256 image. These are the inputs of both the generator and the discriminator.
Source | Target |
Note how the data is misaligned. The speakers have a different cadence while speaking. Sometimes there's even a pause in one of the samples but not in the other. Click on the image to download the audio.
The architecture and training hyperparameters are the same as in the original paper, but we replaced the batch normalization layers by instance normalization layers both in the generator and the discriminator, as suggested here. Also, we use mean squared error as the adversarial loss, as suggested here.
Source | Target | Fake |
Source | Target | Fake |
Source | Target | Fake |