This repository contains the code for the following manuscript:
Maximum Mean Discrepancy Kernels for Predictive and Prognostic Modeling of Whole Slide Images, submitted to IEEE International Symposium on Biomedical Imaging (ISBI) 2023 for review.
How similar are two images? In computational pathology, where Whole Slide Images (WSIs) of digitally scanned tissue samples from patients can be multi-gigapixels in size, determination of degree of similarity between two WSIs is a challenging task with a number of practical applications. In this work, we explore a novel strategy based on kernelized Maximum Mean Discrepancy (MMD) analysis for determination of pairwise similarity between WSIs. The proposed approach works by calculating MMD between two WSIs using kernels over deep features of image patches. This allows representation of an entire dataset of WSIs as a kernel matrix for WSI level clustering, weakly-supervised prediction of TP-53 mutation status in breast cancer patients from their routine WSIs as well as survival analysis with state of the art prediction performance. We believe that this work will open up further avenues for application of WSI-level kernels for predictive and prognostic tasks in computational pathology.
scipy 1.7.3
numpy 1.21.6
matplotlib 3.2.2
geomloss 0.2.4
pandas 1.3.5
torch 1.12.1+cu113
sksurv 0.17.2
lifelines 0.27.4
sklearn 1.0.2
seaborn 0.11.2
tqdm 4.64.1
Download the FFPE whole slide images from GDC portal (https://portal.gdc.cancer.gov/) for breast carcinoma (TCGA-BRCA).
Download corresponding gene point mutation and Disease Specific Survival from cBioPortal (https://www.cbioportal.org/).
Using the code under code_data_processing
to perform
- Slide selection: select high quality WSIs from the original dataset
- Tile extraction: extract 512x512 tiles from the large WSI at a spatial resolution of 0.50 microns-per-pixel
- Patches capturing less that 40% of informative tissue are discarded
- Stain normalization
- Feature extraction: extract a feature vector for each tile using SuffleNet pretrained on ImageNet
Details can be found in the paper and code_data_processing.
Step 3. MMD Kernel generation for 1024-dimensional feature representations of 652 and 1052 TCGA-BRCA slides
Using the code under MMD_distance_matrix_generator
to generate an
Details can be found in the paper and MMD_distance_matrix_generator.
Using the code under TP53_prediction
to generate a Support Vector Machine (SVM) with a predefined kernel (generated from the 652 distance matrix from Step 3.) to predict TP-53 point status mutation for breast patients.
Details can be found in the paper and TP53_prediction.
Using the code under survival_analysis
to generate a Support Vector Machine (SVM) with a predefined kernel (generated from the 1052 distance matrix from Step 3.) to predict survival for breast patients.
Details can be found in the paper and survival_analysis.
Some intermediate data are put into the folder data
.
* Joint first authorship