Knowledge-Transfer

Multiple methods' implementations to transfer the knowledge between Neural Networks and save/plot/compare the results.

Built with Python 3.8 and Tensorflow 2.2.
Implements Knowledge Distillation (KD), Probabilistic Knowledge Transfer (PKT) and KD + PKT.
Includes an expirimental method implementation for student selective learning.
The USTE(Unified Specialized Teachers Ensemble) method can be used as an input, which is useful for transferring the knowledge efficiently to a smaller network. For more information on how to create and train USTE, see: NN Train.

Where it was used

My diploma thesis: Lightweight Deep Learning For Embedded Intelligence
PRLetters Journal Publcation: Improving knowledge distillation using unified ensembles of specialized teachers

Usage

usage: knowledge_transfer.py [-h]
                             [-m {distillation,pkt,pkt+distillation} [{distillation,pkt,pkt+distillation} ...]]
                             [-sl] [-w START_WEIGHTS] [-t TEMPERATURE]
                             [-kdl KD_LAMBDA_SUPERVISED]
                             [-pktl PKT_LAMBDA_SUPERVISED] [-k NEIGHBORS]
                             [-kdw KD_IMPORTANCE_WEIGHT]
                             [-pktw PKT_IMPORTANCE_WEIGHT] [-ufm]
                             [-s {all,best,none}] [-or]
                             [-res RESULTS_NAME_PREFIX] [-out OUT_FOLDER]
                             [-o {adam,rmsprop,sgd,adagrad,adadelta,adamax}]
                             [-lr LEARNING_RATE] [-lrp LEARNING_RATE_PATIENCE]
                             [-lrd LEARNING_RATE_DECAY]
                             [-lrm LEARNING_RATE_MIN]
                             [-esp EARLY_STOPPING_PATIENCE] [-cn CLIP_NORM]
                             [-cv CLIP_VALUE] [-b1 BETA1] [-b2 BETA2]
                             [-rho RHO] [-mm MOMENTUM] [-d DECAY]
                             [-bs BATCH_SIZE] [-ebs EVALUATION_BATCH_SIZE]
                             [-e EPOCHS] [-v VERBOSITY] [--debug]
                             teacher student
                             {cifar10,cifar100,svhn_cropped,fashion_mnist,mnist}

Transfer the knowledge between two Neural Networks, using different methods
and compare the results.

positional arguments:
  teacher               Path to a trained teacher network.
  student               Path to a student network to be used.
  {cifar10,cifar100,svhn_cropped,fashion_mnist,mnist}
                        The name of the dataset to be used.

optional arguments:
  -h, --help            show this help message and exit
  -m {distillation,pkt,pkt+distillation} [{distillation,pkt,pkt+distillation} ...], 
      --method {distillation,pkt,pkt+distillation} [{distillation,pkt,pkt+distillation} ...]
                        The KT method(s) to be used. 
                        (default ['distillation', 'pkt', 'pkt+distillation']).
  -sl, --selective_learning
                        Whether the models should be designed for the KT with
                        Selective Learning framework (default False).
  -w START_WEIGHTS, --start_weights START_WEIGHTS
                        Filepath containing existing weights to initialize the
                        model.
  -t TEMPERATURE, --temperature TEMPERATURE
                        The temperature for the distillation (default 2).
  -kdl KD_LAMBDA_SUPERVISED, --kd_lambda_supervised KD_LAMBDA_SUPERVISED
                        The lambda value for the KD supervised term (default
                        0.1).
  -pktl PKT_LAMBDA_SUPERVISED, --pkt_lambda_supervised PKT_LAMBDA_SUPERVISED
                        The lambda value for the PKT supervised term (default
                        0.0001).
  -k NEIGHBORS, --neighbors NEIGHBORS
                        The number of neighbors for the PKT method evaluation
                        (default 5).
  -kdw KD_IMPORTANCE_WEIGHT, --kd_importance_weight KD_IMPORTANCE_WEIGHT
                        The importance weight for the KD loss, if method is
                        PKT plus KD (default 1).
  -pktw PKT_IMPORTANCE_WEIGHT, --pkt_importance_weight PKT_IMPORTANCE_WEIGHT
                        The importance weight for the PKT loss, if method is
                        PKT plus KD (default 1).
  -ufm, --use_final_model
                        Whether the final model should be used for saving and
                        results evaluation and not the best one achieved
                        through the training procedure (default False).
  -s {all,best,none}, --save_students {all,best,none}
                        The save mode for the final student networks. (default
                        best).
  -or, --omit_results   Whether the KT comparison results should not be saved
                        (default False).
  -res RESULTS_NAME_PREFIX, --results_name_prefix RESULTS_NAME_PREFIX
                        The prefix for the results filenames (default ).
  -out OUT_FOLDER, --out_folder OUT_FOLDER
                        Path to the folder where the outputs will be stored
                        (default out).
  -o {adam,rmsprop,sgd,adagrad,adadelta,adamax}, --optimizer {adam,rmsprop,sgd,adagrad,adadelta,adamax}
                        The optimizer to be used. (default adam).
  -lr LEARNING_RATE, --learning_rate LEARNING_RATE
                        The learning rate for the optimizer (default 0.001).
  -lrp LEARNING_RATE_PATIENCE, --learning_rate_patience LEARNING_RATE_PATIENCE
                        The number of epochs to wait before decaying the
                        learning rate (default 8).
  -lrd LEARNING_RATE_DECAY, --learning_rate_decay LEARNING_RATE_DECAY
                        The learning rate decay factor. If 0 is given, then
                        the learning rate will remain the same during the
                        training process. (default 0.1).
  -lrm LEARNING_RATE_MIN, --learning_rate_min LEARNING_RATE_MIN
                        The minimum learning rate which can be reached
                        (default 1e-08).
  -esp EARLY_STOPPING_PATIENCE, --early_stopping_patience EARLY_STOPPING_PATIENCE
                        The number of epochs to wait before early stopping. If
                        0 is given, early stopping will not be applied.
                        (default 15).
  -cn CLIP_NORM, --clip_norm CLIP_NORM
                        The clip norm for the optimizer (default None).
  -cv CLIP_VALUE, --clip_value CLIP_VALUE
                        The clip value for the optimizer (default None).
  -b1 BETA1, --beta1 BETA1
                        The beta 1 for the optimizer (default 0.9).
  -b2 BETA2, --beta2 BETA2
                        The beta 2 for the optimizer (default 0.999).
  -rho RHO              The rho for the optimizer (default 0.9).
  -mm MOMENTUM, --momentum MOMENTUM
                        The momentum for the optimizer (default 0.0).
  -d DECAY, --decay DECAY
                        The decay for the optimizer (default 1e-06).
  -bs BATCH_SIZE, --batch_size BATCH_SIZE
                        The batch size for the optimization (default 64).
  -ebs EVALUATION_BATCH_SIZE, --evaluation_batch_size EVALUATION_BATCH_SIZE
                        The batch size for the evaluation (default 128).
  -e EPOCHS, --epochs EPOCHS
                        The number of epochs to train the network (default
                        125).
  -v VERBOSITY, --verbosity VERBOSITY
                        The verbosity for the optimization procedure (default
                        1).
  --debug               Whether debug mode should be enabled (default False).

Example Run on CIFAR-10

Command

knowledge_transfer.py \
'teacher.h5' \
'student.h5' \
cifar10 \
--method distillation, pkt, pkt+distillation \
--results_name_prefix SL_strong \
--temperature 6 \
--kd_lambda_supervised 0.3 \
--kd_importance_weight 1 \
--learning_rate 1e-4 \
--optimizer adam \
--epochs 150 \
--learning_rate_decay 0.5 \
--learning_rate_min 1e-8 \
--early_stopping_patience 0 \
--selective_learning \
--out_folder 'Knowledge-Transfer/out/cifar10/baseline_v2'

Results

Log

2020-02-04 19:17:42,967 [MainThread  ] [INFO ]  
---------------------------------------------------------------------------------------------

2020-02-04 19:17:42,969 [MainThread  ] [INFO ]  Loading dataset...
2020-02-04 19:18:03,476 [MainThread  ] [INFO ]  Preprocessing data...
2020-02-04 19:18:03,862 [MainThread  ] [INFO ]  Preparing selective_learning KT framework...
2020-02-04 19:18:04,620 [MainThread  ] [INFO ]  Getting teacher's predictions...
2020-02-04 19:18:10,947 [MainThread  ] [INFO ]  Starting KT method(s)...
2020-02-04 19:18:10,948 [MainThread  ] [INFO ]  Performing Knowledge Distillation...
2020-02-04 20:18:04,402 [MainThread  ] [INFO ]  Performing Probabilistic Knowledge Transfer...
2020-02-04 21:14:39,540 [MainThread  ] [INFO ]  Performing PKT plus Distillation...
2020-02-04 22:27:24,137 [MainThread  ] [INFO ]  Evaluating results...
2020-02-04 22:27:24,155 [MainThread  ] [INFO ]  Evaluating Knowledge Distillation...
2020-02-04 22:27:31,018 [MainThread  ] [INFO ]  Evaluating Probabilistic Knowledge Transfer...
2020-02-04 22:27:43,570 [MainThread  ] [INFO ]  Evaluating PKT plus Distillation...
2020-02-04 22:27:47,472 [MainThread  ] [INFO ]  Evaluating Teacher...
2020-02-04 22:27:58,568 [MainThread  ] [INFO ]  Final results: 
Parameters:
    Teacher params: 925182
    Student params: 106922
    Ratio: T/S=8.653 S/T=0.1156
Knowledge Distillation: 
    loss: 0.04038
    categorical_accuracy: 0.8419
    categorical_crossentropy: 0.7592
Probabilistic Knowledge Transfer: 
    loss: 0.1777
    categorical_accuracy: 0.7925
    categorical_crossentropy: 5.226
PKT plus Distillation: 
    loss: 0.08202
    categorical_accuracy: 0.8379
    categorical_crossentropy: 1.585
Teacher: 
    loss: 0.3615
    categorical_accuracy: 0.8475
    categorical_crossentropy: 0.465

2020-02-04 22:27:58,569 [MainThread  ] [INFO ]  Saving student network(s)...
2020-02-04 22:28:06,425 [MainThread  ] [INFO ]  Student network has been saved as Knowledge-Transfer/out/cifar10/baseline/SL_strong_Knowledge Distillation_model.h5.
2020-02-04 22:28:06,449 [MainThread  ] [INFO ]  Student network has been saved as Knowledge-Transfer/out/cifar10/baseline/SL_strong_Probabilistic Knowledge Transfer_model.h5.
2020-02-04 22:28:11,739 [MainThread  ] [INFO ]  Student network has been saved as Knowledge-Transfer/out/cifar10/baseline/SL_strong_PKT plus Distillation_model.h5.
2020-02-04 22:28:11,741 [MainThread  ] [INFO ]  Saving results...
2020-02-04 22:28:11,766 [MainThread  ] [INFO ]  Finished!

Comparison

Useful output files

Some useful files (unless otherwise specified) are saved in the chosen out folder destination. These include:

the final student(s) weights
the student model(s) (in keras format)
all the methods resutls in a pickle file, which also contains all the training histories

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
.idea		.idea
core		core
examples		examples
utils		utils
.gitignore		.gitignore
README.md		README.md
knowledge_transfer.py		knowledge_transfer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge-Transfer

Where it was used

Usage

Example Run on CIFAR-10

Command

Results

Log

Comparison

Useful output files

About

Releases

Packages

Languages

Adamantios/Knowledge-Transfer

Folders and files

Latest commit

History

Repository files navigation

Knowledge-Transfer

Where it was used

Usage

Example Run on CIFAR-10

Command

Results

Log

Comparison

Useful output files

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages