Skip to content

Predicting transcription rate from multiplexed protein maps using deep learning

License

Notifications You must be signed in to change notification settings

andresbecker/master_thesis

Repository files navigation

Predicting transcription rate from multiplexed protein maps using deep learning

Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
  4. References
  5. Acknowledgements
  6. Contact

About The Project

By means of fluorescent antibodies it is possible to observe the amount of nascent RNA within the nucleus of a cell, and thus estimate its Transcription Rate (TR). But what about the other molecules, proteins and organelles inside the cell necleus? Is it possible to estimate the TR using only the shape and distribution of these subnuclear components? By means of multichannel images of single cell nucleus (obtained through the Multiplexed Protein Maps (MPM) protocol [1]) and Convolutional Neural Networks, we show that this is possible. Applying pre-processing and data augmentation techniques, we reduce the information contained in the intensity of the pixels and the correlation of these between the different channels. This allowed the CNN to focus mainly on the information provided by the location, size and distribution of elements within the cell nucleus. For this task different architectures were tried, from a simple CNN (with only 160k parameters), to more complex architectures such as the ResNet50V2 or the Xception (with more than 20m parameters). Furthermore, through the interpretability methods Integrated Gradients (IG) [2] and VarGrad (VG) [3], we could obtain score maps that allowed us to observe the pixels that the CNN considered as relevant to predict the TR for each cell nucleus input image. The analysis of these score maps reveals how as the TR changes, the CNN focuses on different proteins and areas of the nucleus. This shows that interpretability methods can help us to understand how a CNN make its predictions and learn from it, which has the potential to provide guidance for new discoveries in the field of biology.

You can find the complete explanation and development of this work in Manuscript/Thesis_Andres_Becker.pdf»

Built With

Important
There is a bug in the TensorFlow function tf.image.central_crop that does not allow to take a tensor as input for the argument central_fraction, which is needed for this work. This bug was fixed since the TensorFlow version 2.5. Therefore, you can either use TF 2.5 or replace manually the library image_ops_impl.py in your local machine by this.
Reference: https://github.com/tensorflow/tensorflow/pull/45613/files.

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

A running installation of Anaconda. If you haven't installed Anaconda yet, you can follow the next tutorial:
Anaconda Installation

Installation

  1. Clone the repo
    git clone https://github.com/andresbecker/master_thesis.git
  2. Install the environment
    You can do it either by loading the YML file
    conda env create -f conda_environment.yml
    or step by step
    1. Create and activate the environment
      conda create -n mpm_inter_env python=3.8
      conda activate mpm_inter_env
    2. Install the needed packages
      conda install tensorflow=2.5 tensorboard tensorflow-datasets numpy
      conda install matplotlib seaborn
      conda install jupyterlab
      # To build TensorFlow Datasets
      pip install -q tfds-nightly

Usage

This implementation is divided in 4 main steps

  1. Raw data preprocessing (transformation from text files to multichannel images of single cell nucleus).
    This can be done in two different ways; 1) interactively and 2) non interactively.

    1. Interactively (execute notebook manually)

      1. Activate the environment and open jupyter-lab
        conda activate mpm_inter_env
        jupyter-lab
      2. Run the raw data preprocessing notebook
        Using the Jupyter navigator, open the notebook workspace/notebooks/MPPData_into_images_no_split.ipynb and replace the variable PARAMETERS_FILE with the absolute path and name of the file containing your input parameters. You can find the parameter file used for this work here.
        You can look at a dummy example (and parameters) of the raw data preprocessing in this notebook.
        Also, you can find an explanation of the preprocessing input parameters on appendix A.1 of Manuscript/Thesis_Andres_Becker.pdf.
    2. Non-interactively (execute notebook in the background)
      For this you most use the script workspace/scripts/Run_Jupyter_Notebook_from_Terminal.sh

      cd /workspace/scripts
      ./Run_Jupyter_Notebook_from_Terminal.sh -i ../notebooks/MPPData_into_images_no_split.ipynb -p ./Data_Preprocessing/Parameters/MppData_to_imgs_no_split.json -e mpm_inter_env

      This script will create a copy of the specified notebook, load the specified conda environment, use the specified parameters file, run the copy of the notebook in the background and save it as another Jupyter notebook. After the execution is done, the script rename and save the executed notebook using the name of the input parameters file, as well as the date and time when the execution started, in a directory called NB_output located in the same directory as the input notebook (e.g. workspace/notebooks/NB_output/MppData_to_imgs_no_split_040121_1002.ipynb).
      This approach is very useful when you need to run your notebooks on a server and you don't have access to the graphical interface, or when the job need to be executed by a workload manager like SLURM.
      To use this approach is very important that the notebook workspace/notebooks/MPPData_into_images_no_split.ipynb remains unchanged (keep it as template), specially the line where the input parameter file is specified (PARAMETERS_FILE = 'dont_touch_me-input_parameters_file'). You can find the SLURM file used for the raw data preprocessing here.

  2. TensorFlow dataset (TFDS) creation

    1. Go to the directory where the python scripts to create the TFDSs are
      cd workspace/tf_datasets_scripts
    2. Specify the parameters for the dataset (like perturbations, wells, output channel, etc)
      vi ./MPP_DS_normal_DMSO_z_score/Parameters/my_tf_dataset_parameters.json
    3. Build the TFDS using the script Create_tf_dataset.sh
      ./Create_tf_dataset.sh -o /path_to_store_the_TFDS/tensorflow_datasets -n MPP_DS_normal_DMSO_z_score -p ./MPP_DS_normal_DMSO_z_score/Parameters/my_tf_dataset_parameters.json -e mpm_inter_env

    You can find the parameter file used for this work here. Also, you can build a dummy TFDS (and parameters) by executing the following

    cd /path_to_this_repo/workspace/tf_datasets_scripts
    ./Create_tf_dataset.sh -o /data/Master_Thesis_data/datasets/tensorflow_datasets -n MPP_DS_normal_DMSO_z_score_dummy -p ./MPP_DS_normal_DMSO_z_score_dummy/Parameters/tf_dataset_parameters_dummy.json -e mpm_inter_env

    Finally, you can find an explanation of the input parameters to build a TFDS on appendix A.2 of Manuscript/Thesis_Andres_Becker.pdf.

  3. Model training
    This can be done in two different ways; 1) interactively and 2) non interactively.

    1. Interactively (execute notebook manually)

      1. Activate the environment and open jupyter-lab
        conda activate mpm_inter_env
        jupyter-lab
      2. Run the raw data preprocessing notebook
        Using the Jupyter navigator, open the notebook workspace/notebooks/Model_training_class.ipynb and replace the variable PARAMETERS_FILE with the absolute path and name of the file containing your input parameters. You can find the parameters files used for this work here.
        You can look at a dummy model training example (and parameters) in this notebook.
        Also, you can find an explanation of the model training input parameters on appendix A.3 of Manuscript/Thesis_Andres_Becker.pdf.
    2. Non-interactively (execute notebook in the background)
      For this you most use again the script workspace/scripts/Run_Jupyter_Notebook_from_Terminal.sh

      cd /workspace/scripts
      ./Run_Jupyter_Notebook_from_Terminal.sh -i ../notebooks/Model_training_class.ipynb -p ./Model_training/Thesis_final_results/Parameters/BL/Final_BL_1.json -e mpm_inter_env

      This script will create a copy of the specified notebook, load the specified conda environment, use the specified parameters file, run the copy of the notebook in the background and save it as another Jupyter notebook. After the execution is done, the script rename and save the executed notebook using the name of the input parameters file, as well as the date and time when the execution started, in a directory called NB_output located in the same directory as the input notebook (e.g. workspace/notebooks/NB_output/Final_BL_1_040121_1002.ipynb).
      This approach is very useful when you need to run your notebooks on a server and you don't have access to the graphical interface, or when the job need to be executed by a workload manager like SLURM.
      To use this approach is very important that the notebook workspace/notebooks/Model_training_class.ipynb remains unchanged (keep it as template), specially the line where the input parameter file is specified (PARAMETERS_FILE = 'dont_touch_me-input_parameters_file').

  4. Model interpretation. Score maps creation

    1. Go to the directory where the python scripts for interpretability methods are
      cd workspace/Interpretability/Python_scripts
    2. Specify the parameters for the interpretability methods (IG number of steps, output dir, etc.)
      vi ./Parameters/my_parameters_file.json
    3. Create the score maps
      1. Using the python script get_VarGradIG_from_TFDS_V2.py directly
        conda activate mpm_inter_env
        python get_VarGradIG_from_TFDS_V2.py -i ./Parameters/my_parameters_file.json
      2. Through the bash script workspace/Interpretability/Python_scripts/Run_pyhton_script.sh
        ./Run_pyhton_script.sh -e mpm_inter_env -s ./get_VarGradIG_from_TFDS_V2.py -p ./Parameters/my_parameters_file.json

    You can find the parameter file used for this work here. Also, you can create dummy score maps using the Bash script Run_pyhton_script.sh

    ./Run_pyhton_script.sh -e mpm_inter_env -s ./get_VarGradIG_from_TFDS_V2.py -p ./Parameters/Simple_CNN_dummy.json

    You can find an explanation of the input parameters for the interpretability methods on appendix A.4 of Manuscript/Thesis_Andres_Becker.pdf.

  5. Model interpretation. Score maps analysis
    The analysis of the score maps is done in 3 parts:

    1. The channel importance analysis, which is done in the Jupyter notebook workspace/Interpretability/VarGrad_channel_importance.ipynb

    1. The analysis to the similarity between score maps channels and cell image channels, which is done in the Jupyter notebook workspace/Interpretability/Channel_correlation/VarGrad_channel_similarity_mae.ipynb

    1. The analysis and clustering of score maps channels, based on the similarity between each one of them and the remaining score maps channels. This was not included in the final work, therefore there are lots of room for improvement here. However, you can still take a look at the Jupyter notebooks workspace/Interpretability/Channel_correlation/VarGrad_channel_similarity_mae.ipynb and workspace/Interpretability/Channel_correlation/VarGrad_channel_similarity_mae.ipynb

References

[1] G. Gut, M. D. Herrmann, and L. Pelkmans. “Multiplexed protein maps link subcellular organization to cellular states”. In: Science 361.6401 (2018). issn: 0036-8075. eprint: https://science.sciencemag.org/content/361/6401/eaar7042.full.pdf

[2] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic Attribution for Deep Networks. 2017. arXiv: 1703.01365 [cs.LG].

[3] J. Adebayo, J. Gilmer, I. Goodfellow, and B. Kim. Local Explanation Methods for Deep Neural Networks Lack Sensitivity to Parameter Values. 2018. arXiv: 1810.03307 [cs.CV].

Acknowledgements

  • We warmly thank Scott Berry from Pelkmans Lab (at the University of Zurich), for providing the data for this work.

Contact

Andres Becker - LinkedIn - andres.becker@tum.de

Project Link: https://github.com/andresbecker/master_thesis

About

Predicting transcription rate from multiplexed protein maps using deep learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages