Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any recommendations on implementing TNDD using voxceleb_trainer? #87

Closed
ShaneRun opened this issue Dec 5, 2020 · 7 comments
Closed

Comments

@ShaneRun
Copy link

ShaneRun commented Dec 5, 2020

@joonson
Thank you so much for your contribution on this open source work, it really helps me a lot.
As is known to all, TNDD is commonly implemented in Kaldi.
If I want to implement it in PyTorch based on this trainer,
Do you think it is doable?
And if yes, can you give me some recommendations on how to work it out?
Many thanks in advance.

@joonson
Copy link
Collaborator

joonson commented Dec 7, 2020

TDNNs can be represented as 1d convolutions with dilation.
Here is my implementation of the x-vectors that I have not tested, but you can try it out. XV.py.zip

You could also try a more complex TDNN architecture at #86

@ShaneRun
Copy link
Author

ShaneRun commented Dec 8, 2020

@joonson
Thank you for your implementation!
I will test it out tomorrow and let you know the result.

@Shane-pe
Copy link

Shane-pe commented Dec 9, 2020

@joonson
Is it possible to add i-vector to your trainer?
In that case, this trainer will be more powerful and generalized, which we can call it all-in-one, because i-vector, x-vector (TNDD), and r-vector (ResNet) are also included.
And I believe this trainer can be more and more widely used in the future if more and more people use it as a generalied trainer for supervised learning.
This is my advice on all-in-one, please consider, thanks!

@ShaneRun
Copy link
Author

ShaneRun commented Dec 9, 2020

@joonson
This is the training command:
python ./trainSpeakerNet.py --model XV --log_input True --encoder_type SAP --trainfunc angleproto --save_path exps/exp1 --nPerSpeaker 2 --batch_size 400

This is the test scores using XV.py you provided for 20 epochs:
IT 1, TEER/TAcc 4.25, TLOSS 4.927236
IT 2, TEER/TAcc 7.30, TLOSS 4.539513
IT 3, TEER/TAcc 8.90, TLOSS 4.386307
IT 4, TEER/TAcc 10.19, TLOSS 4.274195
IT 5, TEER/TAcc 11.21, TLOSS 4.189059
IT 6, TEER/TAcc 12.07, TLOSS 4.128410
IT 7, TEER/TAcc 12.83, TLOSS 4.074492
IT 8, TEER/TAcc 13.54, TLOSS 4.023191
IT 9, TEER/TAcc 14.13, TLOSS 3.980177
IT 10, VEER 11.6066
IT 10, TEER/TAcc 14.62, TLOSS 3.947957
IT 11, TEER/TAcc 15.08, TLOSS 3.914889
IT 12, TEER/TAcc 15.39, TLOSS 3.891497
IT 13, TEER/TAcc 15.73, TLOSS 3.867836
IT 14, TEER/TAcc 16.02, TLOSS 3.849706
IT 15, TEER/TAcc 16.27, TLOSS 3.833727
IT 16, TEER/TAcc 16.56, TLOSS 3.809935
IT 17, TEER/TAcc 16.79, TLOSS 3.795477
IT 18, TEER/TAcc 17.06, TLOSS 3.779800
IT 19, TEER/TAcc 17.30, TLOSS 3.765016
IT 20, VEER 9.6288
IT 20, TEER/TAcc 17.55, TLOSS 3.748320

As comparison, I also attached the results I had done before with the same configuration:(the printed information was slightly different because it is not the latest version of trunk) with training command python ./trainSpeakerNet.py --model ResNetSE34L --log_input True --encoder_type SAP --trainfunc softmaxproto --save_path exps/exp2 --nPerSpeaker 2 --batch_size 400
IT 1, LR 0.001000, TEER/TAcc 1.41, TLOSS 11.942019
IT 2, LR 0.001000, TEER/TAcc 7.55, TLOSS 9.519187
IT 3, LR 0.001000, TEER/TAcc 16.41, TLOSS 8.233636
IT 4, LR 0.001000, TEER/TAcc 25.15, TLOSS 7.317389
IT 5, LR 0.001000, TEER/TAcc 32.96, TLOSS 6.619958
IT 6, LR 0.001000, TEER/TAcc 39.75, TLOSS 6.066701
IT 7, LR 0.001000, TEER/TAcc 45.47, TLOSS 5.614951
IT 8, LR 0.001000, TEER/TAcc 50.47, TLOSS 5.230480
IT 9, LR 0.001000, TEER/TAcc 54.69, TLOSS 4.906192
IT 10, LR 0.001000, TEER/TAcc 58.26, TLOSS 4.636681, VEER 6.5695
IT 11, LR 0.000950, TEER/TAcc 61.40, TLOSS 4.403981
IT 12, LR 0.000950, TEER/TAcc 63.75, TLOSS 4.215954
IT 13, LR 0.000950, TEER/TAcc 65.83, TLOSS 4.055617
IT 14, LR 0.000950, TEER/TAcc 67.58, TLOSS 3.913849
IT 15, LR 0.000950, TEER/TAcc 69.32, TLOSS 3.782728
IT 16, LR 0.000950, TEER/TAcc 70.68, TLOSS 3.672339
IT 17, LR 0.000950, TEER/TAcc 72.06, TLOSS 3.564854
IT 18, LR 0.000950, TEER/TAcc 73.17, TLOSS 3.477508
IT 19, LR 0.000950, TEER/TAcc 74.26, TLOSS 3.384924
IT 20, LR 0.000950, TEER/TAcc 75.30, TLOSS 3.300793, VEER 5.6151

Compared with ResNetSE34L model, XV model has the following characteristics:
(1) 3x longer time to train for each epoch
(2) Significantly lower TAcc
(3) Quite significantly lower VEER
Is it okay based on your experience of this XV model you provided?
Thank you so much!

@forwiat
Copy link

forwiat commented Dec 29, 2020

hi @ShaneRun , I see your experiments based on ResNetSE34L, ResNetSE34V2 and XV. I am running ResNetSE34V2, I wonder how low can EER reach. Could you share your experiments' results and some experiments' config? It will help me so much. wish you feedback.
Thank you so much!

@ShaneRun
Copy link
Author

ShaneRun commented Jan 1, 2021

@forwiat
I just run ResNetSE34L.
And I used the recommended setting of the trainer described in README.txt, such as:
--model ResNetSE34L --n_mels 40 --log_input True --encoder SAP --trainfunc softmaxproto

@ShaneRun
Copy link
Author

ShaneRun commented Jan 1, 2021

@forwiat
Moreover, XV is just a test and I don't want to move forward.
the configuration is:
python ./trainSpeakerNet.py --model XV --log_input True --encoder_type SAP --trainfunc angleproto --save_path exps/exp1 --nPerSpeaker 2 --batch_size 400

@joonson joonson closed this as completed Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants