- Now, I'm fixing all the issues and refining the codes. It will be easier to understand how each KD works than before.
- Algorithms are already implemented again, but they should be checked more with hyperparameter tuning.
- the algorithms which have experimental results have been confirmed.
- This Repo. will be upgraded version of my previous benchmark Repo. (link)
Defined knowledge by the neural response of the hidden layer or the output layer of the network
- Soft-logit : The first knowledge distillation method for deep neural network. Knowledge is defined by softened logits. Because it is easy to handle it, many applied methods were proposed using it such as semi-supervised learning, defencing adversarial attack and so on.
- Deep Mutual Learning (DML) : train teacher and student network coincidently, to follow not only training results but teacher network's training procedure.
- Factor Transfer (FT) : Encode a teacher network's feature map, and transfer the knowledge by mimicking it.
- Jangho Kim et al. "Paraphrasing Complex Network: Network Compression via Factor Transfer" Advances in Neural Information Processing Systems (NeurIPS) 2018 (on worning) Increase the quantity of knowledge by sensing several points of the teacher network
- FitNet : To increase amounts of information, knowledge is defined by multi-connected networks and compared feature maps by L2-distance.
- Attention transfer (AT) : Knowledge is defined by attention map which is L2-norm of each feature point.
- Activation boundary (AB) : To soften teacher network's constraint, they propose the new metric function inspired by hinge loss which usually used for SVM.
- VID : Define variational lower boundary as the knowledge, to maximize mutual information between teacher and student network.
- Ahn, et. al. Variational Information Distillation for Knowledge Transfer (on worning) Defined knowledge by the shared representation between two feature maps
- Flow of Procedure (FSP) : To soften teacher network's constraint, they define knowledge as relation of two feature maps.
- KD using Singular value decomposition(KD-SVD) : To extract major information in feature map, they use singular value decomposition.
- Seung Hyun Lee, et. al. Self-supervised knowledge distillation using singular value decomposition. ECCV 2018 [the original project link] Defined knowledge by intra-data relation
- Relational Knowledge Distillation (RKD): they propose knowledge which contains not only feature information but also intra-data relation information.
- Multi-head Graph Distillation (MHGD): They proposed the distillation module which built with the multi-head attention network. Each attention-head extracts the relation of feature map which contains knowledge about embedding procedure.
- Comprehensive overhaul (CO):
- I use WResNet-40-4 and WResNet-16-4 as the teacher and the student network, respectively.
- All the algorithm is trained in the sample configuration, which is described in "train_w_distillation.py", and only each algorithm's hyper-parameters are tuned. I tried only several times to get acceptable performance, which means that my experimental results are perhaps not optimal.
- Although some of the algorithms used soft-logits parallelly in the paper, I used only the proposed knowledge distillation algorithm to make a fair comparison.
- Initialization-based methods give a far higher performance in the start point but a poor performance in the last point due to overfitting. Therefore, initialized students must have a regularization algorithm, such as Soft-logits.
Full Dataset | 50% Dataset | 25% Dataset | 10% Dataset | |
---|---|---|---|---|
Methods | Accuracy | Last Accuracy | Last Accuracy | Last Accuracy |
Teacher | 78.59 | - | - | - |
Student | 76.25 | - | - | - |
Soft_logits | 76.57 | - | - | - |
FitNet | 75.78 | - | - | - |
AT | 78.14 | - | - | - |
FSP | 76.08 | - | - | - |
DML | - | - | - | - |
KD_SVD | - | - | - | - |
FT | 77.30 | - | - | - |
AB | 76.52 | - | - | - |
RKD | 77.69 | - | - | - |
VID | - | - | - | - |
MHGD | - | - | - | - |
CO | 78.54 | - | - | - |
- Check all the algorithms.
- do experiments.