Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images
by Mehdi Cherti, Jenia Jitsev [arXiv:2106.00116]
Short version of the paper accepted at Medical Imaging Meets NeurIPS 2021 Workshop
Longer version of the paper accepted at IEEE International Joint Conference on Neural Networks 2022
In this repository, we provide the code for reproducing the experiments on large-scale pre-training and transfer learning for the paper "Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images" (arXiv:2106.00116).
We provide instructions on how to download the different datasets used in the paper. We provide the pre-trained models, the instructions to fine-tune a pre-trained model on one of the datasets considered in the paper, as well as new datasets.
├── LICENSE <- MIT License
├── README.md <- Main doc README on reproducing the experiments
├── requirements.txt <- The requirements file for reproducing the experiments environment, e.g.
│ generated with `pip freeze > requirements.txt`
├── setup.py <- makes `transfer learning` package installable so it can be imported
├── pretrain.py <- code for pre-training
├── finetune.py <- code for fine-tuning
├── transfer_learning <- Source code
│ ├── dataloaders <- Dataset loaders
│ ├── datasets <- Datasets
│ ├── finetuning <- utilities used for Bit-HyperRule
│ ├── lr_scheduler <- Learning rate schedulers
│ ├── models <- Neural network architecture definitions
│ ├── optim <- Optimizers
├── datasets <- Folder where datasets are stored
├── pretrained_models <- Folder where pre-trained models are stored
Steps to install the package:
pip install -r requirements.txt
python setup.py develop
The folder datasets
will be used to store the datasets. In each subfolder of datasets
, there will be one dataset.
Following are in the instructions to download each dataset considered in the paper.
- Fill the form and download the dataset from https://stanfordmlgroup.github.io/competitions/chexpert/, and extract the archive
- Put the folder
CheXpert-v1.0
indatasets
- Follow the instructions at https://physionet.org/content/mimic-cxr-jpg/2.0.0/ (section Files) and extract the archive
- Put the folder
mimic-cxr-jpg
indatasets
- Download all the files at https://nihcc.app.box.com/v/ChestXray-NIHCC, and extract all the archives in
images/
- Create a folder
NIH-ChestXRay-14
insidedatasets
, and all the files and the folderimages
inNIH-ChestXRay-14
- Download the complete dataset from https://bimcv.cipf.es/bimcv-projects/padchest/ after filling the form and extract the archive
- Unzip all the zip files
0.zip
,1.zip
,...,55.zip
insideBIMCV-PadChest-FULL
- Put the folder
BIMCV-PadChest-FULL
indatasets
and rename it toPadChest
- Download the dataset from https://bit.ly/3xBF9XZ and extract the archive
- Put the folder
oxford-102-flowers
indatasets
If you would like to re-create the dataset from the original version, follow these steps:
- Download the dataset from https://s3.amazonaws.com/fast-ai-imageclas/oxford-102-flowers.tgz and extract the archive
cat valid.txt train.txt > mod_test.txt
cat test.txt > mod_train.txt
wget https://raw.githubusercontent.com/SLAMPAI/large-scale-pretraining-transfer/master/scripts/datasets/flowers_to_image_folder.py;python flowers_to_image_folder.py
, this will create a foldermod_train
and a foldermod_test
- Move the folder to
oxford-102-flowers
todatasets
- Download the dataset from https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet.tgz and extract the archive
- Put the folder
oxford-iiit-pet
indatasets
- Download the dataset from https://www.kaggle.com/andyczhao/covidx-cxr2/version/3 and extract the archive inside a new folder
COVIDx-CXR2
- Put the folder
COVIDx-CXR2
indatasets
- Download the dataset from https://www.kaggle.com/kmader/pulmonary-chest-xray-abnormalities/version/1 and extract the archive inside a new folder
Tuberculosis_dataset
- Put the folder
Tuberculosis_dataset
indatasets
Note that you need Horovod to execute pre-training experiments. Check https://horovod.readthedocs.io/en/stable/running_include.html or https://horovod.readthedocs.io/en/stable/mpi.html to see how to run Horovod, depending on your setup.
For instance, here is how you can run pre-training with Horovod, using 4 GPUs:
horovodrun -np 4 python pretrain.py --config-file configs/chexpert_mimic_nih_padchest_bit50x1.yaml
This will run pre-training for a ResNet-50x1 BiT model, on the concatenation of CheXpert, MIMIC-CXR, NIH Chest-Xray and PadChest.
You can check the other config files in configs/
for other pre-training experiments, and run them in the same manner.
We provide models with pre-trained weights different network sizes (ResNet-50x1, ResNet-152x4) and on various source datasets of different type and size. All models are available at https://bit.ly/34MYsBc.
Each model has its own folder, named following the template <DATASET>_<MODEL>
,
e.g., chexpert_mimic_nih_padchest_bit152x4
is a ResNet152x4 pre-trained on
the concatenation of CheXpert, MIMIC-CXR, NIH Chest-Xray and PadChest.
You can use the script scripts/download_model.sh
to download a pre-trained model,
by providing its name.
For instance, to download chexpert_mimic_nih_padchest_bit152x4
, you can use:
bash scripts/download_model.sh chexpert_mimic_nih_padchest_bit152x4
python finetune.py --pretrain-config-file configs/imagenet1k_bit50x1.yaml --finetune-config-file configs/finetune/cifar10.yaml --logdir cifar10_finetuning
This will fine-tune an R50x1 (pre-trained on ImageNet-1k) on CIFAR-10.
The file configs/finetune/cifar10.yaml
contains the hyper-parmeters used in fine-tuning.
Inside the log directory cifar10_finetuning
, you will find a log file and a tensorboard file
that you can use to visualize the learning curve with different metrics.
python finetune.py --pretrain-config-file configs/chexpert_mimic_nih_padchest_bit50x1.yaml --finetune-config-file configs/finetune/tuberculosis_full.yaml --logdir tuberculosis_finetuning
This will fine-tune on an R50x1 (pre-trained on the concatenation of CheXpert, MIMIC-CXR, NIH Chest-Xray and PadChest) on the Tuberculosis dataset.
The file configs/finetune/tuberculosis_full.yaml
contains the hyper-parameters used in fine-tuning.
Inside the log directory tuberculosis_finetuning
, you will find a log file and a tensorboard file
that you can use to visualize the learning curve with different metrics.
You can also find a fine-tuning example with Tuberculosis in the Colab Notebook
You can also fine-tune one of the pre-trained models on a new dataset. You might need a different data loader depending on your dataset structure. The easiest would be to use an image folder compatible with TorchVision's ImageFolder, where each subfolder of the image folder contains the images belonging to one of the classes.
Following are the steps to fine-tune on a dataset with an image folder structure.
cp configs/finetune/template_image_folder.yaml configs/finetune/your_new_dataset.yaml
- change
train_dir
by the training directory - change
val_dir
by the val or test directory - change
nb_classes
by the number of classes - Train, using for instance
python finetune.py --pretrain-config-file configs/chexpert_mimic_nih_padchest_bit50x1.yaml --finetune-config-file configs/finetune/your_new_dataset.yaml --logdir your_new_dataset_finetuning
We provide all the results as a set of CSV files in the folder results
.
You can use the notebook in notebooks/plots.ipynb
to regenerate the figures from the paper.
If you find this work helpful, please cite our paper:
@article{cherti2021effect,
title={Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images},
author={Cherti, Mehdi and Jitsev, Jenia},
journal={arXiv preprint arXiv:2106.00116},
year={2021}
}
- Thanks to BiT authors https://github.com/google-research/big_transfer. We used BiT architecture, training procedures and BiT-HyperRule from their code.
- Thanks to TorchXrayVision authors https://github.com/mlmed/torchxrayvision. We used their dataset classes for medical data (CheXpert, MIMIC-CXR, NIH Chest-Xray, PadChest).
- Thanks to Horovod authors https://github.com/horovod/horovod, which was used for distributed training. The skeleton of pre-training code is based on https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_imagenet_resnet50.py from Horovod official examples.
- Thanks to https://github.com/rwightman/pytorch-image-models. The structure of the code was inspired from it.
- Thanks to creators and maintainers of openly available X-Ray medical imaging datasets (CheXpert, MIMIC-CXR, NIH Chest-Xray, PadChest, COVIDx, Tuberculosis) that enabled our research
- The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputers JUWELS, JUWELS Booster at Jülich Supercomputing Centre (JSC). We also acknowledge computing resources from the Helmholtz Data Federation and further computing time provided on supercomputer JUSUF in frame of offer for epidemiology research on COVID-19 by JSC.