Speaker embedding extractor for various tasks. Open-source speaker embedding extractor.
2015, George et al., Google, End-to-End Text-Dependent Speaker Verification
Application: Text Dependent
Feature: 40 dim Filterbank
Neural Net Architecture: Maxout-DNN(4-layer), embedding layer before softmax
Loss Function: Cross entropy Loss
Normalization: L2 Norm
Dataset Size: 646 speakers
Baseline: i-vector
2018, Li et al., Google, GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION
Application: Text Independent and Independent("OK Google")
Feature: 40 dim Filterbank
Neural Net Architecture: 3 layer LSTM
Loss Function: GE2E (Generalized end-to-end loss)
Normalization: L2 Norm
Window Size: 1.6 second overlap 50%, element wise average
Dataset Size: 1000 speakers, 6.3 enrollment utterances, 7.2 evaluation utterances
Baseline: TE2E (Tuple based end-to-end loss)
2017, Chao et al., Baidu, End-to-End Neural Speaker Embedding System
Application: Text Dependent and Independent
Feature: 64 dim Fbank
Neural Net Architecture: ResNet CNN and GRU
Pre-training: Yes (softmax pre-training)
Loss Function: Triplet loss with cosine distance metric
Feature Normalization: Zero mean unit variance
Dataset: Mandarin and English (not public)
Dataset Size: 250,000 speakers
Baseline: DNN i-vector system
GPUs: 16 K40 GPUs
2017, David et al, JHU, Deep Neural Network Embeddings for Text-Independent Speaker Verification
Feature: MFCC20 NN structure: n-layer LSTM Loss: CE Eval Set: SRE2018 Dev Set Dataset: 1TB hold out from SRE2018
kaldi_io
numpy
Pytorch\