Skip to content

Latest commit

 

History

History
212 lines (160 loc) · 9.08 KB

README.md

File metadata and controls

212 lines (160 loc) · 9.08 KB

Build Status

paccmann_sarscov2

Pipeline to reproduce the results of the paper Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2 (Machine Learning: Science and Technology, 2021). In that paper, we propose a de-novo molecular generative model for protein driven molecular design and bundle it with molecular retrosynthesis models to automatize all steps before the actual synthesis of a drug candidate.

Graphical abstract

Description

In the repo we provide a conda environment and instructions to reproduce the pipeline described in the manuscript:

  1. Train a multimodal protein-compound interaction classifier, also known as the affinity predictor (source code)
  2. Train a toxicity predictor (source code)
  3. Train a generative model for encoded proteins, also known as the ProteinVAE (source code)
  4. Train a generative model for molecules, also known as the SELFIESVAE (source code)
  5. Train PaccMann^RL on SARS-CoV-2 using the pretained models from above (source code)

NOTE: In the linked repositories, there are often multiple examples for training. For the use case of paccmann_sarscov2, relevant examples are named affinity or encoded_proteins.

Requirements

  • conda>=3.7
  • The following data from this Box link.
    View the respective README.md files on data sources.
  • The git repos linked in the previous section

Setup

Install the environment

Create a conda environment:

conda env create -f conda.yml

Activate the environment:

conda activate paccmann_sarscov2

NOTE: On Ubuntu, you may now need to run the following to obtain a functional RDKit distribution:

sudo apt-get install libxrender1

Download data and pretrained models

Download the data as reported in the requirements section. From now on, we will assume that they are stored in the root of the repository in a folder called data, following this structure:

data
├── pretraining
│   ├── ProteinVAE
│   ├── SELFIESVAE
│   ├── affinity_predictor
│   ├── language_models
│   └── toxicity_predictor
└── training

This is around 6GB of data, required for pretaining multiple models. Also, the workload required to run the full pipeline is intensive and might not be straightforward to run all the steps on a desktop laptop.

For these reasons we also provide pretrained models (ca. 700MB) for download.

Once the download of the pretrained models is completed, the directory structure looks like this:

models
├── ProteinVAE
├── SELFIESVAE
├── Tox21
└── affinity

NOTE: no worries, the data and models folders are in the .gitignore.

PaccMann^RL on SARS-CoV-2

Using the pretrained models to train the conditional generator you would only require the data under data/training/ (8MB).

Clone the repo

To get the training script simply type this:

mkdir code && cd code && \
  git clone --branch sarscov2 https://github.com/PaccMann/paccmann_generator && \
  cd ..

The branch is given to ensure a version working with the provided conda environment.

NOTE: no worries, the code folder is in the .gitignore.

Running training

Running the training is as easy as running:

(paccmann_sarscov2) $ python ./code/paccmann_generator/examples/affinity/train_conditional_generator.py \
    ./models/SELFIESVAE \
    ./models/ProteinVAE \
    ./models/affinity \
    ./data/training/merged_sequence_encoding/uniprot_covid-19.csv \
    ./code/paccmann_generator/examples/affinity/conditional_generator.json \
    paccmann_sarscov2 \
    35 \
    ./data/training/unbiased_predictions \
    --tox21_path ./models/Tox21

This will create a biased_models folder containing the conditional generators, biased for all provided proteins from covid-19.uniprot.org except one, in the example for ACE2_HUMAN. The biased generator generates compounds with a shifted distribution compared to unbiased predictions. Ideally, the model generalizes to ACE2_HUMAN and the biased compounds have overall higher affinity (to ACE2_HUMAN) according to the affinity predictor. See the pdf files in biased_models/paccmann_sarscov2_35/results to observe the effect at different stages of training.

NOTE: no worries, the biased_models folder is in the .gitignore.

Pretraining pipeline

We also provide instructions and scripts to reproduce the full pretraining pipeline, keep in mind we discourage you from running this on a desktop laptop.

Calling any of the scripts with the -h or --help flag will provide you with some information on the arguments.

NOTE: in the following, we assume a folder models has been created in the root of the repository.

Clone the repos

To get the scripts to run each of the component create a code folder and clone the repos. Simply type this:

mkdir code && cd code && \
  git clone --branch sarscov2 https://github.com/PaccMann/paccmann_predictor && \ 
  git clone --branch 0.0.2 https://github.com/PaccMann/toxsmi && \
  git clone --branch sarscov2 https://github.com/PaccMann/paccmann_omics && \ 
  git clone --branch sarscov2 https://github.com/PaccMann/paccmann_chemistry && \ 
  git clone --branch sarscov2 https://github.com/PaccMann/paccmann_generator && \
  cd ..

The branch is given to ensure a version working with the provided conda environment.

affinity predictor

(paccmann_sarscov2) $ python ./code/paccmann_predictor/examples/affinity/train_affinity.py \
    ./data/pretraining/affinity_predictor/filtered_train_binding_data.csv \
    ./data/pretraining/affinity_predictor/filtered_val_binding_data.csv \
    ./data/pretraining/affinity_predictor/sequences.smi \
    ./data/pretraining/affinity_predictor/filtered_ligands.smi \
    ./data/pretraining/language_models/smiles_language_chembl_gdsc_ccle_tox21_zinc_organdb_bindingdb.pkl \
    ./data/pretraining/language_models/protein_language_bindingdb.pkl \
    ./models/ \
    ./code/paccmann_predictor/examples/affinity/affinity.json \
    affinity

toxicity predictor

(paccmann_sarscov2) $ python ./code/toxsmi/scripts/train_tox.py \
    ./data/pretraining/toxicity_predictor/tox21_train.csv \
    ./data/pretraining/toxicity_predictor/tox21_test.csv \
    ./data/pretraining/toxicity_predictor/tox21.smi \
    ./data/pretraining/language_models/smiles_language_tox21.pkl \
    ./models/ \
    ./code/toxsmi/params/mca.json \
    Tox21 \
    --embedding_path ./data/pretraining/toxicity_predictor/smiles_vae_embeddings.pkl

protein VAE

(paccmann_sarscov2) $ python ./code/paccmann_omics/examples/encoded_proteins/train_protein_encoding_vae.py \
    ./data/pretraining/proteinVAE/tape_encoded/train_representation.csv \
    ./data/pretraining/proteinVAE/tape_encoded/val_representation.csv \
    ./models/ \
    ./code/paccmann_omics/examples/encoded_proteins/protein_encoding_vae_params.json \
    ProteinVAE

SELFIES VAE

(paccmann_sarscov2) $ python ./code/paccmann_chemistry/examples/train_vae.py \
    ./data/pretraining/SELFIESVAE/train_chembl_22_clean_1576904_sorted_std_final.smi \
    ./data/pretraining/SELFIESVAE/test_chembl_22_clean_1576904_sorted_std_final.smi \
    ./data/pretraining/language_models/selfies_language.pkl \
    ./models/ \
    ./code/paccmann_chemistry/examples/example_params.json \
    SELFIESVAE

References

If you use paccmann_sarscov2 in your projects, please cite the following:

@article{born2021datadriven,
  author = {Born, Jannis and Manica, Matteo and Cadow, Joris and Markert, Greta and Mill, Nil Adell and Filipavicius, Modestas and Janakarajan, Nikita and Cardinale, Antonio and Laino, Teodoro and {Rodr{\'{i}}guez Mart{\'{i}}nez}, Mar{\'{i}}a},
  doi = {10.1088/2632-2153/abe808},
  issn = {2632-2153},
  journal = {Machine Learning: Science and Technology},
  number = {2},
  pages = {025024},
  title = {{Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2}},
  url = {https://iopscience.iop.org/article/10.1088/2632-2153/abe808},
  volume = {2},
  year = {2021}
}