Knowledge distillation #95

jcsm89 · 2022-08-25T13:48:02Z

jcsm89
Aug 25, 2022

Hello again! Was just reading through the papers linked on knowledge distillation section of the README and going through your code and a question came to mind: the usage of knowledge distillation in its current implementation here is to try to re-train a given model to approximate a previously trained model on a much larger dataset?

For example, we have multiple models trained on glint360k, i.e. very large dataset, and we want to train on the smaller CASIA dataset and approximate the obtained embeddings to those previously obtained with glint360k. This is what I understand from your implementation. DataDistiller just saves img,label,embedding for all images with the original model, then the loss function distiller_loss_cosine just normalized both target and predicted embedding and aims to minimize the difference.

From the papers linked this seems like a simplification. What we should have is a target distribution of good quality image matching scores and we should aim the distribution of our scores (not single embedding comparisons) to be similar to that. I know it's probably just an approximation and that computing all the embeddings for a very large dataset would take forever, but this is ideally how it should be done right?

Sorry for the very long message, as I've told before I've been spending a lot of time trying to comprehend everything in this repository, and I feel it's better to ask than navigate in circles :)

Answered by leondgarse

Aug 28, 2022

Current implementation is for either distilling a model from a larger dataset or a similar one. Just not limited.
The reason for distilling from embedding layer is that, most high accuracy models provided, like those from insightface, are not including the output classifier layer. Ya, the traditional distillation is aiming at minimizing the difference from output classifier layer. Current method just works, I myself didn't dig deep which one is better either...

View full answer

leondgarse · 2022-08-28T06:42:49Z

leondgarse
Aug 28, 2022
Maintainer

Current implementation is for either distilling a model from a larger dataset or a similar one. Just not limited.
The reason for distilling from embedding layer is that, most high accuracy models provided, like those from insightface, are not including the output classifier layer. Ya, the traditional distillation is aiming at minimizing the difference from output classifier layer. Current method just works, I myself didn't dig deep which one is better either...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowledge distillation #95

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Knowledge distillation #95

jcsm89 Aug 25, 2022

Replies: 1 comment

leondgarse Aug 28, 2022 Maintainer

jcsm89
Aug 25, 2022

leondgarse
Aug 28, 2022
Maintainer