can not converge using default audio augmentation parameters #162

kenhai · 2024-04-19T09:33:59Z

kenhai
Apr 19, 2024

I want to train the model to wake "hey luna", if I set the audio augmentation parameters " {AddColoredNoise:[10:30], AddBackgroundNoise:[0:25], RIR[ave]}", the training result is very fantastic. I use 200000 positive samples and ACAV100M_2000_hrs_16bit datasets in training. The gray line uses the weight 50 and orange line uses weight 150. The model in training dataset seems to reduce false positive rate but in validation set the fpr is increasing.

If I reduce these parameters as {AddColoredNoise:[20:30], AddBackgroundNoise:[20:30], RIR[peak]}, the model can be trained with a satisfied fpr (1 time per hour) and recall rate (90%) within 100000 steps.
Besides, I realize that the "hi luna" phase performs worse than "hey luna" phase in the model, the reason is unknown.

kenhai · 2024-04-19T09:35:29Z

kenhai
Apr 19, 2024
Author

I have tried larger weights but result is worse

0 replies

dscripka · 2024-04-22T00:53:41Z

dscripka
Apr 22, 2024
Maintainer

These results seem good overall, especially for a relatively short wakeword like "hey luna" with just three phonemes.

It's possible that by continuing to tweak hyper parameters you can incrementally improve performance. However, in my own experiments I also notice that different metrics often diverge late in training. In particular, it is very difficult to keep the validation false positive rate low during extended training iterations, as even small models will start to overfit on the training data.

I find that often the best way to reduce false positives is to try and identify what types of words and phrases trigger false activation during normal usage, and then manually adding those phrases to the config file when training.

1 reply

kenhai Apr 25, 2024
Author

You're right. 50000 steps are enough for training.

I try to use the default setting to compare "hey luna" and "hey jarvis", parameters are same with the custom_model.yml in \examples dir. The results are:

n_sample: 10000 and max_negative_weight: 1500
hey jarvis
Final Model Accuracy: 0.8072500228881836
Final Model Recall: 0.6209999918937683
Final Model False Positives per Hour: 1.769911527633667
hey luna
Final Model Accuracy: 0.7135000228881836
Final Model Recall: 0.42899999022483826
Final Model False Positives per Hour: 3.805309772491455

It seems "hey luna" is hard for training, but when I increase the n_samples to 100000,

n_sample: 100000 and max_negative_weight: 1500
hey jarvis
Final Model Accuracy: 0.6359999775886536
Final Model Recall: 0.27300000190734863
Final Model False Positives per Hour: 0.0
hey luna
Final Model Accuracy: 0.6129999756813049
Final Model Recall: 0.22599999606609344
Final Model False Positives per Hour: 0.0

There is slight difference between "hey jarvis" and "hey luna".
It seems there are some problems if I just use default setting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can not converge using default audio augmentation parameters #162

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

can not converge using default audio augmentation parameters #162

kenhai Apr 19, 2024

Replies: 2 comments · 1 reply

kenhai Apr 19, 2024 Author

dscripka Apr 22, 2024 Maintainer

kenhai Apr 25, 2024 Author

kenhai
Apr 19, 2024

Replies: 2 comments 1 reply

kenhai
Apr 19, 2024
Author

dscripka
Apr 22, 2024
Maintainer

kenhai Apr 25, 2024
Author