Knowledge distillation #95
Replies: 1 comment
-
|
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
jcsm89
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello again! Was just reading through the papers linked on knowledge distillation section of the README and going through your code and a question came to mind: the usage of knowledge distillation in its current implementation here is to try to re-train a given model to approximate a previously trained model on a much larger dataset?
For example, we have multiple models trained on glint360k, i.e. very large dataset, and we want to train on the smaller CASIA dataset and approximate the obtained embeddings to those previously obtained with glint360k. This is what I understand from your implementation. DataDistiller just saves img,label,embedding for all images with the original model, then the loss function distiller_loss_cosine just normalized both target and predicted embedding and aims to minimize the difference.
From the papers linked this seems like a simplification. What we should have is a target distribution of good quality image matching scores and we should aim the distribution of our scores (not single embedding comparisons) to be similar to that. I know it's probably just an approximation and that computing all the embeddings for a very large dataset would take forever, but this is ideally how it should be done right?
Sorry for the very long message, as I've told before I've been spending a lot of time trying to comprehend everything in this repository, and I feel it's better to ask than navigate in circles :)
Beta Was this translation helpful? Give feedback.
All reactions