Predicting transcription rate from multiplexed protein maps using deep learning

About The Project

By means of fluorescent antibodies it is possible to observe the amount of nascent RNA within the nucleus of a cell, and thus estimate its Transcription Rate (TR). But what about the other molecules, proteins and organelles inside the cell necleus? Is it possible to estimate the TR using only the shape and distribution of these subnuclear components? By means of multichannel images of single cell nucleus (obtained through the Multiplexed Protein Maps (MPM) protocol [1]) and Convolutional Neural Networks, we show that this is possible. Applying pre-processing and data augmentation techniques, we reduce the information contained in the intensity of the pixels and the correlation of these between the different channels. This allowed the CNN to focus mainly on the information provided by the location, size and distribution of elements within the cell nucleus. For this task different architectures were tried, from a simple CNN (with only 160k parameters), to more complex architectures such as the ResNet50V2 or the Xception (with more than 20m parameters). Furthermore, through the interpretability methods Integrated Gradients (IG) [2] and VarGrad (VG) [3], we could obtain score maps that allowed us to observe the pixels that the CNN considered as relevant to predict the TR for each cell nucleus input image. The analysis of these score maps reveals how as the TR changes, the CNN focuses on different proteins and areas of the nucleus. This shows that interpretability methods can help us to understand how a CNN make its predictions and learn from it, which has the potential to provide guidance for new discoveries in the field of biology.

You can find the complete explanation and development of this work in Manuscript/Thesis_Andres_Becker.pdf»

Built With

Important
There is a bug in the TensorFlow function tf.image.central_crop that does not allow to take a tensor as input for the argument central_fraction, which is needed for this work. This bug was fixed since the TensorFlow version 2.5. Therefore, you can either use TF 2.5 or replace manually the library image_ops_impl.py in your local machine by this.
Reference: https://github.com/tensorflow/tensorflow/pull/45613/files.

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

A running installation of Anaconda. If you haven't installed Anaconda yet, you can follow the next tutorial:
Anaconda Installation

Installation

Clone the repo

git clone https://github.com/andresbecker/master_thesis.git

Install the environment
You can do it either by loading the YML file

conda env create -f conda_environment.yml

or step by step

Create and activate the environment

conda create -n mpm_inter_env python=3.8
conda activate mpm_inter_env

Install the needed packages

conda install tensorflow=2.5 tensorboard tensorflow-datasets numpy
conda install matplotlib seaborn
conda install jupyterlab
# To build TensorFlow Datasets
pip install -q tfds-nightly

Usage

This implementation is divided in 4 main steps

Raw data preprocessing (transformation from text files to multichannel images of single cell nucleus).
This can be done in two different ways; 1) interactively and 2) non interactively.
1. Interactively (execute notebook manually)
  1. Activate the environment and open jupyter-lab
```
conda activate mpm_inter_env
jupyter-lab
```
  2. Run the raw data preprocessing notebook
    Using the Jupyter navigator, open the notebook workspace/notebooks/MPPData_into_images_no_split.ipynb and replace the variable PARAMETERS_FILE with the absolute path and name of the file containing your input parameters. You can find the parameter file used for this work here.
    You can look at a dummy example (and parameters) of the raw data preprocessing in this notebook.
    Also, you can find an explanation of the preprocessing input parameters on appendix A.1 of Manuscript/Thesis_Andres_Becker.pdf.
2. Non-interactively (execute notebook in the background)
  For this you most use the script workspace/scripts/Run_Jupyter_Notebook_from_Terminal.sh
```
cd /workspace/scripts
./Run_Jupyter_Notebook_from_Terminal.sh -i ../notebooks/MPPData_into_images_no_split.ipynb -p ./Data_Preprocessing/Parameters/MppData_to_imgs_no_split.json -e mpm_inter_env
```
  This script will create a copy of the specified notebook, load the specified conda environment, use the specified parameters file, run the copy of the notebook in the background and save it as another Jupyter notebook. After the execution is done, the script rename and save the executed notebook using the name of the input parameters file, as well as the date and time when the execution started, in a directory called NB_output located in the same directory as the input notebook (e.g. workspace/notebooks/NB_output/MppData_to_imgs_no_split_040121_1002.ipynb).
  This approach is very useful when you need to run your notebooks on a server and you don't have access to the graphical interface, or when the job need to be executed by a workload manager like SLURM.
  To use this approach is very important that the notebook workspace/notebooks/MPPData_into_images_no_split.ipynb remains unchanged (keep it as template), specially the line where the input parameter file is specified (PARAMETERS_FILE = 'dont_touch_me-input_parameters_file'). You can find the SLURM file used for the raw data preprocessing here.

TensorFlow dataset (TFDS) creation

Go to the directory where the python scripts to create the TFDSs are
```
cd workspace/tf_datasets_scripts
```
Specify the parameters for the dataset (like perturbations, wells, output channel, etc)
```
vi ./MPP_DS_normal_DMSO_z_score/Parameters/my_tf_dataset_parameters.json
```

Build the TFDS using the script Create_tf_dataset.sh

./Create_tf_dataset.sh -o /path_to_store_the_TFDS/tensorflow_datasets -n MPP_DS_normal_DMSO_z_score -p ./MPP_DS_normal_DMSO_z_score/Parameters/my_tf_dataset_parameters.json -e mpm_inter_env

You can find the parameter file used for this work here. Also, you can build a dummy TFDS (and parameters) by executing the following

cd /path_to_this_repo/workspace/tf_datasets_scripts
./Create_tf_dataset.sh -o /data/Master_Thesis_data/datasets/tensorflow_datasets -n MPP_DS_normal_DMSO_z_score_dummy -p ./MPP_DS_normal_DMSO_z_score_dummy/Parameters/tf_dataset_parameters_dummy.json -e mpm_inter_env

Finally, you can find an explanation of the input parameters to build a TFDS on appendix A.2 of Manuscript/Thesis_Andres_Becker.pdf.

Model training
This can be done in two different ways; 1) interactively and 2) non interactively.
1. Interactively (execute notebook manually)
  1. Activate the environment and open jupyter-lab
```
conda activate mpm_inter_env
jupyter-lab
```
  2. Run the raw data preprocessing notebook
    Using the Jupyter navigator, open the notebook workspace/notebooks/Model_training_class.ipynb and replace the variable PARAMETERS_FILE with the absolute path and name of the file containing your input parameters. You can find the parameters files used for this work here.
    You can look at a dummy model training example (and parameters) in this notebook.
    Also, you can find an explanation of the model training input parameters on appendix A.3 of Manuscript/Thesis_Andres_Becker.pdf.
2. Non-interactively (execute notebook in the background)
  For this you most use again the script workspace/scripts/Run_Jupyter_Notebook_from_Terminal.sh
```
cd /workspace/scripts
./Run_Jupyter_Notebook_from_Terminal.sh -i ../notebooks/Model_training_class.ipynb -p ./Model_training/Thesis_final_results/Parameters/BL/Final_BL_1.json -e mpm_inter_env
```
  This script will create a copy of the specified notebook, load the specified conda environment, use the specified parameters file, run the copy of the notebook in the background and save it as another Jupyter notebook. After the execution is done, the script rename and save the executed notebook using the name of the input parameters file, as well as the date and time when the execution started, in a directory called NB_output located in the same directory as the input notebook (e.g. workspace/notebooks/NB_output/Final_BL_1_040121_1002.ipynb).
  This approach is very useful when you need to run your notebooks on a server and you don't have access to the graphical interface, or when the job need to be executed by a workload manager like SLURM.
  To use this approach is very important that the notebook workspace/notebooks/Model_training_class.ipynb remains unchanged (keep it as template), specially the line where the input parameter file is specified (PARAMETERS_FILE = 'dont_touch_me-input_parameters_file').
Model interpretation. Score maps creation
1. Go to the directory where the python scripts for interpretability methods are
```
cd workspace/Interpretability/Python_scripts
```
2. Specify the parameters for the interpretability methods (IG number of steps, output dir, etc.)
```
vi ./Parameters/my_parameters_file.json
```
3. Create the score maps
  1. Using the python script get_VarGradIG_from_TFDS_V2.py directly
```
conda activate mpm_inter_env
python get_VarGradIG_from_TFDS_V2.py -i ./Parameters/my_parameters_file.json
```
  2. Through the bash script workspace/Interpretability/Python_scripts/Run_pyhton_script.sh
```
./Run_pyhton_script.sh -e mpm_inter_env -s ./get_VarGradIG_from_TFDS_V2.py -p ./Parameters/my_parameters_file.json
```
You can find the parameter file used for this work here. Also, you can create dummy score maps using the Bash script Run_pyhton_script.sh
```
./Run_pyhton_script.sh -e mpm_inter_env -s ./get_VarGradIG_from_TFDS_V2.py -p ./Parameters/Simple_CNN_dummy.json
```
You can find an explanation of the input parameters for the interpretability methods on appendix A.4 of Manuscript/Thesis_Andres_Becker.pdf.
Model interpretation. Score maps analysis
The analysis of the score maps is done in 3 parts:
1. The channel importance analysis, which is done in the Jupyter notebook workspace/Interpretability/VarGrad_channel_importance.ipynb
1. The analysis to the similarity between score maps channels and cell image channels, which is done in the Jupyter notebook workspace/Interpretability/Channel_correlation/VarGrad_channel_similarity_mae.ipynb
1. The analysis and clustering of score maps channels, based on the similarity between each one of them and the remaining score maps channels. This was not included in the final work, therefore there are lots of room for improvement here. However, you can still take a look at the Jupyter notebooks workspace/Interpretability/Channel_correlation/VarGrad_channel_similarity_mae.ipynb and workspace/Interpretability/Channel_correlation/VarGrad_channel_similarity_mae.ipynb

References

[1] G. Gut, M. D. Herrmann, and L. Pelkmans. “Multiplexed protein maps link subcellular organization to cellular states”. In: Science 361.6401 (2018). issn: 0036-8075. eprint: https://science.sciencemag.org/content/361/6401/eaar7042.full.pdf

[2] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic Attribution for Deep Networks. 2017. arXiv: 1703.01365 [cs.LG].

[3] J. Adebayo, J. Gilmer, I. Goodfellow, and B. Kim. Local Explanation Methods for Deep Neural Networks Lack Sensitivity to Parameter Values. 2018. arXiv: 1810.03307 [cs.CV].

Acknowledgements

We warmly thank Scott Berry from Pelkmans Lab (at the University of Zurich), for providing the data for this work.

Contact

Andres Becker - LinkedIn - andres.becker@tum.de

Project Link: https://github.com/andresbecker/master_thesis

Name		Name	Last commit message	Last commit date
Latest commit History 740 Commits
Manuscript		Manuscript
Nothing_interesting_here		Nothing_interesting_here
References		References
Thesis_defense_presentation		Thesis_defense_presentation
workspace		workspace
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda_environment.yml		conda_environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting transcription rate from multiplexed protein maps using deep learning

Table of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

References

Acknowledgements

Contact

About

Releases

Packages

Languages

License

andresbecker/master_thesis

Folders and files

Latest commit

History

Repository files navigation

Predicting transcription rate from multiplexed protein maps using deep learning

Table of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

References

Acknowledgements

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages