Pipeline to reproduce the results of the paper Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2 (Machine Learning: Science and Technology, 2021). In that paper, we propose a de-novo molecular generative model for protein driven molecular design and bundle it with molecular retrosynthesis models to automatize all steps before the actual synthesis of a drug candidate.
In the repo we provide a conda environment and instructions to reproduce the pipeline described in the manuscript:
- Train a multimodal protein-compound interaction classifier, also known as the affinity predictor (source code)
- Train a toxicity predictor (source code)
- Train a generative model for encoded proteins, also known as the ProteinVAE (source code)
- Train a generative model for molecules, also known as the SELFIESVAE (source code)
- Train PaccMann^RL on SARS-CoV-2 using the pretained models from above (source code)
NOTE: In the linked repositories, there are often multiple examples for training. For the use case of paccmann_sarscov2
, relevant examples are named affinity
or encoded_proteins
.
conda>=3.7
- The following data from this Box link.
View the respectiveREADME.md
files on data sources. - The git repos linked in the previous section
Create a conda environment:
conda env create -f conda.yml
Activate the environment:
conda activate paccmann_sarscov2
NOTE: On Ubuntu, you may now need to run the following to obtain a functional RDKit
distribution:
sudo apt-get install libxrender1
Download the data as reported in the requirements section.
From now on, we will assume that they are stored in the root of the repository in a folder called data
, following this structure:
data
├── pretraining
│ ├── ProteinVAE
│ ├── SELFIESVAE
│ ├── affinity_predictor
│ ├── language_models
│ └── toxicity_predictor
└── training
This is around 6GB of data, required for pretaining multiple models. Also, the workload required to run the full pipeline is intensive and might not be straightforward to run all the steps on a desktop laptop.
For these reasons we also provide pretrained models (ca. 700MB) for download.
Once the download of the pretrained models is completed, the directory structure looks like this:
models
├── ProteinVAE
├── SELFIESVAE
├── Tox21
└── affinity
NOTE: no worries, the data
and models
folders are in the .gitignore.
Using the pretrained models to train the conditional generator you would only require the data under data/training/
(8MB).
To get the training script simply type this:
mkdir code && cd code && \
git clone --branch sarscov2 https://github.com/PaccMann/paccmann_generator && \
cd ..
The branch is given to ensure a version working with the provided conda environment.
NOTE: no worries, the code
folder is in the .gitignore.
Running the training is as easy as running:
(paccmann_sarscov2) $ python ./code/paccmann_generator/examples/affinity/train_conditional_generator.py \
./models/SELFIESVAE \
./models/ProteinVAE \
./models/affinity \
./data/training/merged_sequence_encoding/uniprot_covid-19.csv \
./code/paccmann_generator/examples/affinity/conditional_generator.json \
paccmann_sarscov2 \
35 \
./data/training/unbiased_predictions \
--tox21_path ./models/Tox21
This will create a biased_models
folder containing the conditional generators, biased for all provided proteins from covid-19.uniprot.org except one, in the example for ACE2_HUMAN. The biased generator generates compounds with a shifted distribution compared to unbiased predictions. Ideally, the model generalizes to ACE2_HUMAN and the biased compounds have overall higher affinity (to ACE2_HUMAN) according to the affinity predictor. See the pdf files in biased_models/paccmann_sarscov2_35/results
to observe the effect at different stages of training.
NOTE: no worries, the biased_models
folder is in the .gitignore.
We also provide instructions and scripts to reproduce the full pretraining pipeline, keep in mind we discourage you from running this on a desktop laptop.
Calling any of the scripts with the -h
or --help
flag will provide you with some information on the arguments.
NOTE: in the following, we assume a folder models
has been created in the root of the repository.
To get the scripts to run each of the component create a code
folder and clone the repos. Simply type this:
mkdir code && cd code && \
git clone --branch sarscov2 https://github.com/PaccMann/paccmann_predictor && \
git clone --branch 0.0.2 https://github.com/PaccMann/toxsmi && \
git clone --branch sarscov2 https://github.com/PaccMann/paccmann_omics && \
git clone --branch sarscov2 https://github.com/PaccMann/paccmann_chemistry && \
git clone --branch sarscov2 https://github.com/PaccMann/paccmann_generator && \
cd ..
The branch is given to ensure a version working with the provided conda environment.
(paccmann_sarscov2) $ python ./code/paccmann_predictor/examples/affinity/train_affinity.py \
./data/pretraining/affinity_predictor/filtered_train_binding_data.csv \
./data/pretraining/affinity_predictor/filtered_val_binding_data.csv \
./data/pretraining/affinity_predictor/sequences.smi \
./data/pretraining/affinity_predictor/filtered_ligands.smi \
./data/pretraining/language_models/smiles_language_chembl_gdsc_ccle_tox21_zinc_organdb_bindingdb.pkl \
./data/pretraining/language_models/protein_language_bindingdb.pkl \
./models/ \
./code/paccmann_predictor/examples/affinity/affinity.json \
affinity
(paccmann_sarscov2) $ python ./code/toxsmi/scripts/train_tox.py \
./data/pretraining/toxicity_predictor/tox21_train.csv \
./data/pretraining/toxicity_predictor/tox21_test.csv \
./data/pretraining/toxicity_predictor/tox21.smi \
./data/pretraining/language_models/smiles_language_tox21.pkl \
./models/ \
./code/toxsmi/params/mca.json \
Tox21 \
--embedding_path ./data/pretraining/toxicity_predictor/smiles_vae_embeddings.pkl
(paccmann_sarscov2) $ python ./code/paccmann_omics/examples/encoded_proteins/train_protein_encoding_vae.py \
./data/pretraining/proteinVAE/tape_encoded/train_representation.csv \
./data/pretraining/proteinVAE/tape_encoded/val_representation.csv \
./models/ \
./code/paccmann_omics/examples/encoded_proteins/protein_encoding_vae_params.json \
ProteinVAE
(paccmann_sarscov2) $ python ./code/paccmann_chemistry/examples/train_vae.py \
./data/pretraining/SELFIESVAE/train_chembl_22_clean_1576904_sorted_std_final.smi \
./data/pretraining/SELFIESVAE/test_chembl_22_clean_1576904_sorted_std_final.smi \
./data/pretraining/language_models/selfies_language.pkl \
./models/ \
./code/paccmann_chemistry/examples/example_params.json \
SELFIESVAE
If you use paccmann_sarscov2
in your projects, please cite the following:
@article{born2021datadriven,
author = {Born, Jannis and Manica, Matteo and Cadow, Joris and Markert, Greta and Mill, Nil Adell and Filipavicius, Modestas and Janakarajan, Nikita and Cardinale, Antonio and Laino, Teodoro and {Rodr{\'{i}}guez Mart{\'{i}}nez}, Mar{\'{i}}a},
doi = {10.1088/2632-2153/abe808},
issn = {2632-2153},
journal = {Machine Learning: Science and Technology},
number = {2},
pages = {025024},
title = {{Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2}},
url = {https://iopscience.iop.org/article/10.1088/2632-2153/abe808},
volume = {2},
year = {2021}
}