Skip to content

Latest commit

 

History

History
67 lines (62 loc) · 7.44 KB

File metadata and controls

67 lines (62 loc) · 7.44 KB

Knowledge_distillation_via_TF2.0

  • Now, I'm fixing all the issues and refining the codes. It will be easier to understand how each KD works than before.
  • Algorithms are already implemented again, but they should be checked more with hyperparameter tuning.
    • the algorithms which have experimental results have been confirmed.
  • This Repo. will be upgraded version of my previous benchmark Repo. (link)

Implemented Knowledge Distillation Methods

Defined knowledge by the neural response of the hidden layer or the output layer of the network

Experimental Results

  • I use WResNet-40-4 and WResNet-16-4 as the teacher and the student network, respectively.
  • All the algorithm is trained in the sample configuration, which is described in "train_w_distillation.py", and only each algorithm's hyper-parameters are tuned. I tried only several times to get acceptable performance, which means that my experimental results are perhaps not optimal.
  • Although some of the algorithms used soft-logits parallelly in the paper, I used only the proposed knowledge distillation algorithm to make a fair comparison.
  • Initialization-based methods give a far higher performance in the start point but a poor performance in the last point due to overfitting. Therefore, initialized students must have a regularization algorithm, such as Soft-logits.

Training/Validation accuracy

Full Dataset 50% Dataset 25% Dataset 10% Dataset
Methods Accuracy Last Accuracy Last Accuracy Last Accuracy
Teacher 78.59 - - -
Student 76.25 - - -
Soft_logits 76.57 - - -
FitNet 75.78 - - -
AT 78.14 - - -
FSP 76.08 - - -
DML - - - -
KD_SVD - - - -
FT 77.30 - - -
AB 76.52 - - -
RKD 77.69 - - -
VID - - - -
MHGD - - - -
CO 78.54 - - -

Plan to do

  • Check all the algorithms.
  • do experiments.