NOTE: Please see our follow-up work in CVPR 2022, which further extends this repository.
Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology, LMRL Workshop, NeurIPS 2021.
[Workshop]
[arXiv]
Richard. J. Chen, Rahul G. Krishnan
@article{chen2022self,
title={Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology},
author={Chen, Richard J and Krishnan, Rahul G},
journal={Learning Meaningful Representations of Life, NeurIPS 2021},
year={2021}
}
Summary / Main Findings:
- In head-to-head comparison of SimCLR versus DINO, DINO learns more effective pretrained representations for histopathology - likely due to 1) not needing negative samples (histopathology has lots of potential class imbalance), 2) capturing better inductive biases about the part-whole hierarchies of how cells are spatially organized in tissue.
- ImageNet features do lag behind SSL methods (in terms of data-efficiency), but are better than you think on patch/slide-level tasks. Transfer learning with ImageNet features (from a truncated ResNet-50 after 3rd residual block) gives very decent performance using the CLAM package.
- SSL may help mitigate domain shift from site-specific H&E stainining protocols. With vanilla data augmentations, global structure of morphological subtypes (within each class) are more well-preserved than ImageNet features via 2D UMAP scatter plots.
- Self-supervised ViTs are able to localize cell location quite well w/o any supervision. Our results show that ViTs are able to localize visual concepts in histopathology in introspecting the attention heads.
- 06/06/2022: Please see our follow-up work in CVPR 2022, which further extends this repository.
- 03/04/2022: Reproducible and largely-working codebase that I'm satisfied with and have heavily tested.
We use Git LFS to version-control large files in this repository (e.g. - images, embeddings, checkpoints). After installing, to pull these large files, please run:
git lfs pull
SIMCLR and DINO models were trained for 100 epochs using their vanilla training recipes in their respective papers. These models were developed on 2,055,742 patches (256 x 256
resolution at 20X
magnification) extracted from diagnostic slides in the TCGA-BRCA dataset, and evaluated via K-NN on patch-level datasets in histopathology.
Note: Results should be taken-in w.r.t. to the size of dataset and duraration of training epochs. Ideally, longer training with larger batch sizes would demonstrate larger gains in SSL performance.
Arch | SSL Method | Dataset | Epochs | Dim | K-NN | Download |
---|---|---|---|---|---|---|
ResNet-50 | Transfer | ImageNet | N/A | 1024 | 0.935 | N/A |
ResNet-50 | SimCLR | TCGA-BRCA | 100 | 2048 | 0.938 | Backbone |
ViT-S/16 | DINO | TCGA-BRCA | 100 | 384 | 0.941 | Backbone |