This is a PyTorch implementation of speaker embedding trained with GE2E loss. The original paper about GE2E loss could be found here: Generalized End-to-End Loss for Speaker Verification
import torch
import torchaudio
wav2mel = torch.jit.load("wav2mel.pt")
dvector = torch.jit.load("dvector.pt").eval()
wav_tensor, sample_rate = torchaudio.load("example.wav")
mel_tensor = wav2mel(wav_tensor, sample_rate) # shape: (frames, mel_dim)
emb_tensor = dvector.embed_utterance(mel_tensor) # shape: (emb_dim)
You can also embed multiple utterances of a speaker at once:
emb_tensor = dvector.embed_utterances([mel_tensor_1, mel_tensor_2]) # shape: (emb_dim)
There are 2 modules in this example:
wav2mel.pt
is the preprocessing module which is composed of 2 modules:sox_effects.pt
is used to normalize volume, remove silence, resample audio to 16 KHz, 16 bits, and remix all channels to single channellog_melspectrogram.pt
is used to transform waveforms to log mel spectrograms
dvector.pt
is the speaker encoder
Since all the modules are compiled with TorchScript, you can simply load them and use anywhere without any dependencies.
You can download them from the page of Releases.
You can evaluate the performance of the model with equal error rate.
For example, download the official test splits (veri_test.txt
and veri_test2.txt
) from The VoxCeleb1 Dataset and run the following command:
python equal_error_rate.py VoxCeleb1/test VoxCeleb1/test/veri_test.txt -w wav2mel.pt -c dvector.pt
So far, the released checkpoint was only trained on VoxCeleb1 without any data augmentation. Its performance on the official test splits of VoxCeleb1 are as following:
Test Split | Equal Error Rate | Threshold |
---|---|---|
veri_test.txt | 12.0% | 0.222 |
veri_test2.txt | 11.9% | 0.223 |
To use the script provided here, you have to organize your raw data in this way:
- all utterances from a speaker should be put under a directory (speaker directory)
- all speaker directories should be put under a directory (root directory)
- speaker directory can have subdirectories and utterances can be placed under subdirectories
And you can extract utterances from multiple root directories, e.g.
python preprocess.py VoxCeleb1/dev LibriSpeech/train-clean-360 -o preprocessed
If you need to modify some audio preprocessing hyperparameters, directly modify data/wav2mel.py
.
After preprocessing, 3 preprocessing modules will be saved in the output directory:
wav2mel.pt
sox_effects.pt
log_melspectrogram.pt
The first module
wav2mel.pt
is composed of the second and the third modules. These modules were compiled with TorchScript and can be used anywhere to preprocess audio data.
You have to specify where to store checkpoints and logs, e.g.
python train.py preprocessed <model_dir>
During training, logs will be put under <model_dir>/logs
and checkpoints will be placed under <model_dir>/checkpoints
.
For more details, check the usage with python train.py -h
.
By default I'm using 3-layerd LSTM with attentive pooling as the speaker encoder, but you can use speaker encoders of different architecture.
For more information, please take a look at modules/dvector.py
.
You can visualize speaker embeddings using a trained d-vector. Note that you have to structure speakers' directories in the same way as for preprocessing. e.g.
python visualize.py LibriSpeech/dev-clean -w wav2mel.pt -c dvector.pt -o tsne.jpg
The following plot is the dimension reduction result (using t-SNE) of some utterances from LibriSpeech.