Semisupervised Clustering

This repository contains the code for semi-supervised clustering developed for Master Thesis: "Automatic analysis of images from camera-traps" by Michal Nazarczuk from Imperial College London

The algorithm is inspired with DCEC method (Deep Clustering with Convolutional Autoencoders). The main change adds "labelling" loss (cross-entropy between labelled examples and their predictions) as the loss component.

Prerequisites

The following libraries are required to be installed for the proper code evaluation:

PyTorch
NumPy
scikit-learn
TensorboardX

The code was written and tested on Python 3.4.1

Installation and usage

Installation

Just copy the repository to your local folder:

git clone https://github.com/michaal94/Semisupervised-Clustering

Use of the algortihm

In order to test the basic version of the semi-supervised clustering just run it with your python distribution you installed libraries for (Anaconda, Virtualenv, etc.). In general type:

cd Semisupervised-Clustering
python3 semi_supervised.py

The example will run sample clustering with MNIST-train dataset.

Options

The algorithm offers a plenty of options for adjustments:

Mode choice: full or pretraining only, use: --mode train_full or --mode pretrain

Fot full training you can specify whether to use pretraining phase --pretrain True or use saved network --pretrain False and --pretrained net ("path" or idx) with path or index (see catalog structure) of the pretrained network

Dataset choice:

MNIST - train, test, full

Custom dataset - use the following data structure (characteristic for PyTorch):

-data_directory (clusters must corespond to real clustering only for statistics)
    -cluster_1
        -image_1
        -image_2
        -...
    -cluster_2
        -image_1
        -image_2
        -...
    -...
-data_directory_l (data used as labelled, use at least one example in each class in the current version of algorithm)
    -cluster_1
        -image_1
        -image_2
        -...
    -cluster_2
        -image_1
        -image_2
        -...
    -...

Use the following: --dataset MNIST-train, --dataset MNIST-test, --dataset MNIST-full or --dataset custom (use the last one with path --dataset_path 'path to your dataset' and the trasformation you want for images --custom_img_size [height, width, depth])

Different network architectures:
- CAE 3 - convolutional autoencoder used in DCEC --net_architecture CAE_3
- CAE 3 BN - version with Batch Normalisation layers --net_architecture CAE_3bn
- CAE 4 (BN) - convolutional autoencoder with 4 convolutional blocks --net_architecture CAE_4 and --net_architecture CAE_4bn
- CAE 5 (BN) - convolutional autoencoder with 5 convolutional blocks --net_architecture CAE_5 and --net_architecture CAE_5bn (used for 128x128 photos)
The following opions may be used for model changes:
- LeakyReLU or ReLU usage: --leaky True/False (True provided better results)
- Negative slope for Leaky ReLU: --neg_slope value (Values around 0.01 were used)
- Use of sigmoid and tanh activations at the end of encoder and decoder: --activations True/False (False provided better results)
- Use of bias in layers: --bias True/False
Optimiser and scheduler settings (Adam optimiser):
- Learning rate: --rate value (0.001 is reasonable value for Adam)
- Learning rate for pretraining phase: --rate_pretrain value (0.001 can be used as well)
- Weight decay: --weight value (0 was used)
- Weight decay for pretraining phase: --weight_pretrain value
- Scheduler step (how many iterations till the rate is changed): --sched_step value
- Scheduler step for pretraining phase: --sched_step_pretrain value
- Scheduler gamma (multiplier of learning rate): --sched_gamma value
- Scheduler gamma for pretraining phase: --sched_gamma_pretrain value
Algorithm specific parameters:
- Clustering loss weight (for reconstruction loss fixed with weight 1): --gamma value (Value of 0.1 provided good results)
- Labelling loss weight: --gamma_lab value (0.01 provided good results)
- Update interval for target distribution (in number of batches between updates): update_interval value (Value may be chosen such that distribution is updated each 1000-2000 photos)
- Label check interval --label_upd_interval value (Suggested to leave each iteration update)
- Stop criterium tolerance --tol value (Depends on dataset, for small 0.01 was used for bigger e.g. MNIST - 0.001)
- Target number of clusters --num_clusters value
Other options:
- Batch size: --batch_size value (Depend on your device, but remember that too much may be bad for convergence)
- Epochs if stop criterium not met: --epochs value
- Epochs of pretraining: --epochs_pretrain value (300 epochs were used, 200 with 0.001 lerning rate and 100 with 10 times smaller - --sched_step_pretrain 200, --sched_gamma_pretrain 0.1)
- Report printing frequency (in batches): --printing_frequency value
- Tensorboard export: --tensorboard True/False

Catalog structure

The code creates the following catalog structure when reporting the statistics:

-Reports
    -(net_architecture_name)_(index).txt
-Nets (copies of weights
    -(net_architecture_name)_(index).pt
    -(net_architecture_name)_(index)_pretrained.txt
-Runs
    -(net_architecture_name)_(index)  <- directory containing tensorboard event file

The files are indexed automatically for the files not to be accidentally overwritten.

Performance

The code was mainly used to cluster images coming from camera-trap events. However, some additional benchmarks were performed on MNIST datasets. The following table gather some results (for 2% of labelled data):

Set	NMI	Acc
MNIST-full	95.13	98.22%
MNIST-test	89.59	95.29%

In addition, the t-SNE plots of plain and clustered MNIST full dataset are shown:

Full set before clustering:

After clustering:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Semisupervised Clustering

Prerequisites

Installation and usage

Installation

Use of the algortihm

Options

Catalog structure

Performance

Files

README.md

Latest commit

History

README.md

File metadata and controls

Semisupervised Clustering

Prerequisites

Installation and usage

Installation

Use of the algortihm

Options

Catalog structure

Performance