By means of fluorescent antibodies it is possible to observe the amount of nascent RNA within the nucleus of a cell, and thus estimate its Transcription Rate (TR)
. But what about the other molecules, proteins and organelles inside the cell necleus? Is it possible to estimate the TR using only the shape and distribution of these subnuclear components? By means of multichannel images of single cell nucleus (obtained through the Multiplexed Protein Maps (MPM)
protocol [1]) and Convolutional Neural Networks
, we show that this is possible.
Applying pre-processing and data augmentation techniques, we reduce the information contained in the intensity of the pixels and the correlation of these between the different channels. This allowed the CNN to focus mainly on the information provided by the location, size and distribution of elements within the cell nucleus.
For this task different architectures were tried, from a simple CNN (with only 160k parameters), to more complex architectures such as the ResNet50V2 or the Xception (with more than 20m parameters).
Furthermore, through the interpretability methods Integrated Gradients (IG)
[2] and VarGrad (VG)
[3], we could obtain score maps that allowed us to observe the pixels that the CNN considered as relevant to predict the TR for each cell nucleus input image. The analysis of these score maps reveals how as the TR changes, the CNN focuses on different proteins and areas of the nucleus. This shows that interpretability methods can help us to understand how a CNN make its predictions and learn from it, which has the potential to provide guidance for new discoveries in the field of biology.
You can find the complete explanation and development of this work in Manuscript/Thesis_Andres_Becker.pdf
»
Important
There is a bug in the TensorFlow function tf.image.central_crop
that does not allow to take a tensor as input for the argument central_fraction
, which is needed for this work. This bug was fixed since the TensorFlow version 2.5. Therefore, you can either use TF 2.5 or replace manually the library image_ops_impl.py
in your local machine by this.
Reference: https://github.com/tensorflow/tensorflow/pull/45613/files.
To get a local copy up and running follow these simple steps.
A running installation of Anaconda. If you haven't installed Anaconda yet, you can follow the next tutorial:
Anaconda Installation
- Clone the repo
git clone https://github.com/andresbecker/master_thesis.git
- Install the environment
You can do it either by loading theYML
fileor step by stepconda env create -f conda_environment.yml
- Create and activate the environment
conda create -n mpm_inter_env python=3.8 conda activate mpm_inter_env
- Install the needed packages
conda install tensorflow=2.5 tensorboard tensorflow-datasets numpy conda install matplotlib seaborn conda install jupyterlab # To build TensorFlow Datasets pip install -q tfds-nightly
- Create and activate the environment
This implementation is divided in 4 main steps
-
Raw data preprocessing (transformation from text files to multichannel images of single cell nucleus).
This can be done in two different ways; 1) interactively and 2) non interactively.-
Interactively (execute notebook manually)
- Activate the environment and open jupyter-lab
conda activate mpm_inter_env jupyter-lab
- Run the raw data preprocessing notebook
Using the Jupyter navigator, open the notebookworkspace/notebooks/MPPData_into_images_no_split.ipynb
and replace the variablePARAMETERS_FILE
with the absolute path and name of the file containing your input parameters. You can find the parameter file used for this work here.
You can look at a dummy example (and parameters) of the raw data preprocessing in this notebook.
Also, you can find an explanation of the preprocessing input parameters on appendixA.1
ofManuscript/Thesis_Andres_Becker.pdf
.
- Activate the environment and open jupyter-lab
-
Non-interactively (execute notebook in the background)
For this you most use the scriptworkspace/scripts/Run_Jupyter_Notebook_from_Terminal.sh
cd /workspace/scripts ./Run_Jupyter_Notebook_from_Terminal.sh -i ../notebooks/MPPData_into_images_no_split.ipynb -p ./Data_Preprocessing/Parameters/MppData_to_imgs_no_split.json -e mpm_inter_env
This script will create a copy of the specified notebook, load the specified conda environment, use the specified parameters file, run the copy of the notebook in the background and save it as another Jupyter notebook. After the execution is done, the script rename and save the executed notebook using the name of the input parameters file, as well as the date and time when the execution started, in a directory called
NB_output
located in the same directory as the input notebook (e.g. workspace/notebooks/NB_output/MppData_to_imgs_no_split_040121_1002.ipynb).
This approach is very useful when you need to run your notebooks on a server and you don't have access to the graphical interface, or when the job need to be executed by a workload manager likeSLURM
.
To use this approach isvery important
that the notebookworkspace/notebooks/MPPData_into_images_no_split.ipynb
remains unchanged (keep it as template), specially the line where the input parameter file is specified (PARAMETERS_FILE = 'dont_touch_me-input_parameters_file'). You can find the SLURM file used for the raw data preprocessing here.
-
-
TensorFlow dataset (TFDS) creation
- Go to the directory where the python scripts to create the TFDSs are
cd workspace/tf_datasets_scripts
- Specify the parameters for the dataset (like perturbations, wells, output channel, etc)
vi ./MPP_DS_normal_DMSO_z_score/Parameters/my_tf_dataset_parameters.json
- Build the TFDS using the script
Create_tf_dataset.sh
./Create_tf_dataset.sh -o /path_to_store_the_TFDS/tensorflow_datasets -n MPP_DS_normal_DMSO_z_score -p ./MPP_DS_normal_DMSO_z_score/Parameters/my_tf_dataset_parameters.json -e mpm_inter_env
You can find the parameter file used for this work here. Also, you can build a dummy TFDS (and parameters) by executing the following
cd /path_to_this_repo/workspace/tf_datasets_scripts ./Create_tf_dataset.sh -o /data/Master_Thesis_data/datasets/tensorflow_datasets -n MPP_DS_normal_DMSO_z_score_dummy -p ./MPP_DS_normal_DMSO_z_score_dummy/Parameters/tf_dataset_parameters_dummy.json -e mpm_inter_env
Finally, you can find an explanation of the input parameters to build a TFDS on appendix
A.2
ofManuscript/Thesis_Andres_Becker.pdf
. - Go to the directory where the python scripts to create the TFDSs are
-
Model training
This can be done in two different ways; 1) interactively and 2) non interactively.-
Interactively (execute notebook manually)
- Activate the environment and open jupyter-lab
conda activate mpm_inter_env jupyter-lab
- Run the raw data preprocessing notebook
Using the Jupyter navigator, open the notebookworkspace/notebooks/Model_training_class.ipynb
and replace the variablePARAMETERS_FILE
with the absolute path and name of the file containing your input parameters. You can find the parameters files used for this work here.
You can look at a dummy model training example (and parameters) in this notebook.
Also, you can find an explanation of the model training input parameters on appendixA.3
ofManuscript/Thesis_Andres_Becker.pdf
.
- Activate the environment and open jupyter-lab
-
Non-interactively (execute notebook in the background)
For this you most use again the scriptworkspace/scripts/Run_Jupyter_Notebook_from_Terminal.sh
cd /workspace/scripts ./Run_Jupyter_Notebook_from_Terminal.sh -i ../notebooks/Model_training_class.ipynb -p ./Model_training/Thesis_final_results/Parameters/BL/Final_BL_1.json -e mpm_inter_env
This script will create a copy of the specified notebook, load the specified conda environment, use the specified parameters file, run the copy of the notebook in the background and save it as another Jupyter notebook. After the execution is done, the script rename and save the executed notebook using the name of the input parameters file, as well as the date and time when the execution started, in a directory called
NB_output
located in the same directory as the input notebook (e.g. workspace/notebooks/NB_output/Final_BL_1_040121_1002.ipynb).
This approach is very useful when you need to run your notebooks on a server and you don't have access to the graphical interface, or when the job need to be executed by a workload manager likeSLURM
.
To use this approach isvery important
that the notebookworkspace/notebooks/Model_training_class.ipynb
remains unchanged (keep it as template), specially the line where the input parameter file is specified (PARAMETERS_FILE = 'dont_touch_me-input_parameters_file').
-
-
Model interpretation. Score maps creation
- Go to the directory where the python scripts for interpretability methods are
cd workspace/Interpretability/Python_scripts
- Specify the parameters for the interpretability methods (IG number of steps, output dir, etc.)
vi ./Parameters/my_parameters_file.json
- Create the score maps
- Using the python script
get_VarGradIG_from_TFDS_V2.py
directly
conda activate mpm_inter_env python get_VarGradIG_from_TFDS_V2.py -i ./Parameters/my_parameters_file.json
- Through the bash script
workspace/Interpretability/Python_scripts/Run_pyhton_script.sh
./Run_pyhton_script.sh -e mpm_inter_env -s ./get_VarGradIG_from_TFDS_V2.py -p ./Parameters/my_parameters_file.json
- Using the python script
You can find the parameter file used for this work here. Also, you can create dummy score maps using the Bash script
Run_pyhton_script.sh
./Run_pyhton_script.sh -e mpm_inter_env -s ./get_VarGradIG_from_TFDS_V2.py -p ./Parameters/Simple_CNN_dummy.json
You can find an explanation of the input parameters for the interpretability methods on appendix
A.4
ofManuscript/Thesis_Andres_Becker.pdf
. - Go to the directory where the python scripts for interpretability methods are
-
Model interpretation. Score maps analysis
The analysis of the score maps is done in 3 parts:- The channel importance analysis, which is done in the Jupyter notebook
workspace/Interpretability/VarGrad_channel_importance.ipynb
- The analysis to the similarity between score maps channels and cell image channels, which is done in the Jupyter notebook
workspace/Interpretability/Channel_correlation/VarGrad_channel_similarity_mae.ipynb
- The analysis and clustering of score maps channels, based on the similarity between each one of them and the remaining score maps channels. This was not included in the final work, therefore there are lots of room for improvement here. However, you can still take a look at the Jupyter notebooks
workspace/Interpretability/Channel_correlation/VarGrad_channel_similarity_mae.ipynb
andworkspace/Interpretability/Channel_correlation/VarGrad_channel_similarity_mae.ipynb
- The channel importance analysis, which is done in the Jupyter notebook
[1] G. Gut, M. D. Herrmann, and L. Pelkmans. “Multiplexed protein maps link subcellular organization to cellular states”. In: Science 361.6401 (2018). issn: 0036-8075. eprint: https://science.sciencemag.org/content/361/6401/eaar7042.full.pdf
[2] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic Attribution for Deep Networks. 2017. arXiv: 1703.01365 [cs.LG].
[3] J. Adebayo, J. Gilmer, I. Goodfellow, and B. Kim. Local Explanation Methods for Deep Neural Networks Lack Sensitivity to Parameter Values. 2018. arXiv: 1810.03307 [cs.CV].
- We warmly thank Scott Berry from Pelkmans Lab (at the University of Zurich), for providing the data for this work.
Andres Becker - LinkedIn - andres.becker@tum.de
Project Link: https://github.com/andresbecker/master_thesis