This repository contains the code for semi-supervised clustering developed for Master Thesis: "Automatic analysis of images from camera-traps" by Michal Nazarczuk from Imperial College London
The algorithm is inspired with DCEC method (Deep Clustering with Convolutional Autoencoders). The main change adds "labelling" loss (cross-entropy between labelled examples and their predictions) as the loss component.
The following libraries are required to be installed for the proper code evaluation:
- PyTorch
- NumPy
- scikit-learn
- TensorboardX
The code was written and tested on Python 3.4.1
Just copy the repository to your local folder:
git clone https://github.com/michaal94/Semisupervised-Clustering
In order to test the basic version of the semi-supervised clustering just run it with your python distribution you installed libraries for (Anaconda, Virtualenv, etc.). In general type:
cd Semisupervised-Clustering
python3 semi_supervised.py
The example will run sample clustering with MNIST-train dataset.
The algorithm offers a plenty of options for adjustments:
-
Mode choice: full or pretraining only, use:
--mode train_full
or--mode pretrain
Fot full training you can specify whether to use pretraining phase
--pretrain True
or use saved network--pretrain False
and--pretrained net ("path" or idx)
with path or index (see catalog structure) of the pretrained network -
Dataset choice:
- MNIST - train, test, full
- Custom dataset - use the following data structure (characteristic for PyTorch):
-data_directory (clusters must corespond to real clustering only for statistics) -cluster_1 -image_1 -image_2 -... -cluster_2 -image_1 -image_2 -... -... -data_directory_l (data used as labelled, use at least one example in each class in the current version of algorithm) -cluster_1 -image_1 -image_2 -... -cluster_2 -image_1 -image_2 -... -...
Use the following:
--dataset MNIST-train
,--dataset MNIST-test
,--dataset MNIST-full
or--dataset custom
(use the last one with path--dataset_path 'path to your dataset'
and the trasformation you want for images--custom_img_size [height, width, depth]
) -
Different network architectures:
- CAE 3 - convolutional autoencoder used in DCEC
--net_architecture CAE_3
- CAE 3 BN - version with Batch Normalisation layers
--net_architecture CAE_3bn
- CAE 4 (BN) - convolutional autoencoder with 4 convolutional blocks
--net_architecture CAE_4
and--net_architecture CAE_4bn
- CAE 5 (BN) - convolutional autoencoder with 5 convolutional blocks
--net_architecture CAE_5
and--net_architecture CAE_5bn
(used for 128x128 photos)
The following opions may be used for model changes:
- LeakyReLU or ReLU usage:
--leaky True/False
(True provided better results) - Negative slope for Leaky ReLU:
--neg_slope value
(Values around 0.01 were used) - Use of sigmoid and tanh activations at the end of encoder and decoder:
--activations True/False
(False provided better results) - Use of bias in layers:
--bias True/False
- CAE 3 - convolutional autoencoder used in DCEC
-
Optimiser and scheduler settings (Adam optimiser):
- Learning rate:
--rate value
(0.001 is reasonable value for Adam) - Learning rate for pretraining phase:
--rate_pretrain value
(0.001 can be used as well) - Weight decay:
--weight value
(0 was used) - Weight decay for pretraining phase:
--weight_pretrain value
- Scheduler step (how many iterations till the rate is changed):
--sched_step value
- Scheduler step for pretraining phase:
--sched_step_pretrain value
- Scheduler gamma (multiplier of learning rate):
--sched_gamma value
- Scheduler gamma for pretraining phase:
--sched_gamma_pretrain value
- Learning rate:
-
Algorithm specific parameters:
- Clustering loss weight (for reconstruction loss fixed with weight 1):
--gamma value
(Value of 0.1 provided good results) - Labelling loss weight:
--gamma_lab value
(0.01 provided good results) - Update interval for target distribution (in number of batches between updates):
update_interval value
(Value may be chosen such that distribution is updated each 1000-2000 photos) - Label check interval
--label_upd_interval value
(Suggested to leave each iteration update) - Stop criterium tolerance
--tol value
(Depends on dataset, for small 0.01 was used for bigger e.g. MNIST - 0.001) - Target number of clusters
--num_clusters value
- Clustering loss weight (for reconstruction loss fixed with weight 1):
-
Other options:
- Batch size:
--batch_size value
(Depend on your device, but remember that too much may be bad for convergence) - Epochs if stop criterium not met:
--epochs value
- Epochs of pretraining:
--epochs_pretrain value
(300 epochs were used, 200 with 0.001 lerning rate and 100 with 10 times smaller ---sched_step_pretrain 200
,--sched_gamma_pretrain 0.1
) - Report printing frequency (in batches):
--printing_frequency value
- Tensorboard export:
--tensorboard True/False
- Batch size:
The code creates the following catalog structure when reporting the statistics:
-Reports
-(net_architecture_name)_(index).txt
-Nets (copies of weights
-(net_architecture_name)_(index).pt
-(net_architecture_name)_(index)_pretrained.txt
-Runs
-(net_architecture_name)_(index) <- directory containing tensorboard event file
The files are indexed automatically for the files not to be accidentally overwritten.
The code was mainly used to cluster images coming from camera-trap events. However, some additional benchmarks were performed on MNIST datasets. The following table gather some results (for 2% of labelled data):
Set | NMI | Acc |
---|---|---|
MNIST-full | 95.13 | 98.22% |
MNIST-test | 89.59 | 95.29% |
In addition, the t-SNE plots of plain and clustered MNIST full dataset are shown:
Full set before clustering:
After clustering: