Authors: Shengchao Liu, Weili Nie, Chengpeng Wang, Jiarui Lu, Zhuoran Qiao, Ling Liu, Jian Tang*, Chaowei Xiao*, Anima Anandkumar*
* jointly supervised
[Paper] [Project Page] [ArXiv] [Datasets on Hugging Face] [Checkpoints on Hugging Face]
First install conda:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Then create virtual environment and install packages:
conda create -n MoleculeSTM python=3.7
conda activate MoleculeSTM
conda install -y -c rdkit rdkit=2020.09.1.0
conda install -y -c conda-forge -c pytorch pytorch=1.9.1
conda install -y -c pyg -c conda-forge pyg==2.0.3
pip install requests
pip install tqdm
pip install matplotlib
pip install spacy
pip install Levenshtein
# for SciBert
conda install -y boto3
pip install transformers
# for MoleculeNet
pip install ogb==1.2.0
# install pysmilesutils
python -m pip install git+https://github.com/MolecularAI/pysmilesutils.git
pip install deepspeed
# install metagron
# pip install megatron-lm==1.1.5
git clone https://github.com/MolecularAI/MolBART.git --branch megatron-molbart-with-zinc
cd MolBART/megatron_molbart/Megatron-LM-v1.1.5-3D_parallelism
pip install .
cd ../../..
# install apex
# wget https://github.com/NVIDIA/apex/archive/refs/tags/22.03.zip
# unzip 22.03.zip
git clone https://github.com/chao1224/apex.git
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ..
We also provide the docker in Dockerfile
.
We provide the raw dataset (after preprocessing) at this Hugging Face link. Or you can use the following python script:
from huggingface_hub import HfApi, snapshot_download
api = HfApi()
snapshot_download(repo_id="chao1224/MoleculeSTM", repo_type="dataset", local_dir='.')
Then you can move all the downloaded datasets under ./data
folder.
Useful resources:
- For molecular structure information (SMILES, 2D molecular graph etc), we can download it from PubChem in SDF format here.
- For textual data, we may first refer to this PubChem RDF tutorial.
The RDF data on the PubChem FTP site is arranged in such a way that you only need to download the type of information in which you are interested, thus allowing you to avoid downloading parts of PubChem data you will not use. For example, if you are just interested in computed chemical properties, you only need to download PubChemRDF data in the compound descriptor directory.
The link is here.- Guidance on using
RDF
andREST
API can be found here.
As confirmed with PubChem group, performing research on these data is not violating their license; however, PubChem does not possess the license for the textual data, which necessitates an extensive evaluation of the license for each pair of structure-text pair data in PubChemSTM. This task poses a substantial workload and has hindered the release of PubChemSTM. However, we have tried our best to upload the structure part of the PubChemSTM data on Hugging Face, and we also provide all the details to generate PubChemSTM as follows:
- Go to
preprocessing/PubChemSTM
folder. python step_01_description_extraction.py
. This step extracts and merge all the textual descriptions into a single json file. We run this on May 30th, 2022. The APIs will keep updating, so you may have slightly different versions if you run this script yourself.bash step_02.sh
. This will download all the SDF files, with SMILES, 2D graph, and computed molecular properties. This may take hours.python step_03_filter_out_SDF.py
. This will filter all the molecules with textual descriptions and save them int the SDF file. This may take <2 hours.python step_04_merge_SDF.py
. This will gather all the molecules into a single SDF file.python step_05_sample_extraction.py
. This will generate theCID2SMILES.csv
file.
We have included them in the Hugging Face link. We briefly list the details below:
DrugBank_data
for zero-shot structure-text retrievalZINC250K_data
for space alignment (step 1 in editing)Editing_data
for zero-shot text-guided (step 2 in editing)single_multi_property_SMILES.txt
for single-objective, multi-objective, binding-affinity-based, and drug relevance editingneighbor2drug
for neighborhood searching for patent drug moleculesChEMBL_data
for binding editing
MoleculeNet_data
for molecular property prediction
This can be done by simplying calling the following for SciBERT:
SciBERT_tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', cache_dir=pretrained_SciBERT_folder)
SciBERT_model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased', cache_dir=pretrained_SciBERT_folder).to(device)
Run download_MegaMolBART.sh
(credit to RetMol). The output structure is like:
├── bart_vocab.txt
└── checkpoints
├── iter_0134000
│ ├── mp_rank_00
│ │ └── model_optim_rng.pt
│ ├── mp_rank_00_model_states.pt
│ ├── zero_pp_rank_0_mp_rank_00optim_states.pt
│ ├── zero_pp_rank_1_mp_rank_00optim_states.pt
│ ├── zero_pp_rank_2_mp_rank_00optim_states.pt
│ ├── zero_pp_rank_3_mp_rank_00optim_states.pt
│ ├── zero_pp_rank_4_mp_rank_00optim_states.pt
│ ├── zero_pp_rank_5_mp_rank_00optim_states.pt
│ ├── zero_pp_rank_6_mp_rank_00optim_states.pt
│ └── zero_pp_rank_7_mp_rank_00optim_states.pt
└── latest_checkpointed_iteration.txt
For GraphMVP, check this repo, and the checkpoints on Google Drive link.
pretrained_GraphMVP/
├── GraphMVP_C
│ └── model.pth
└── GraphMVP_G
└── model.pth
For KV-PLM, check this repo and checkpoints on Google Drive link.
We provide two sets of demo checkpoints at this huggingface link. Or you can use the following python script:
from huggingface_hub import HfApi, snapshot_download
api = HfApi()
snapshot_download(repo_id="chao1224/MoleculeSTM", repo_type="model", cache_dir='.')
For the optimal results reported in the paper, please use the following script:
from huggingface_hub import HfApi, snapshot_download
api = HfApi()
snapshot_download(repo_id="chao1224/MoleculeSTM", repo_type="model", local_dir='.', allow_patterns="*MoleculeSTM*")
We further provide the optimal checkpoints for each downstream task under the scripts
folder (README file).
All the running scripts and demos can be found under the scripts
folder and demos
folder, respectively.
MoleculeSTM-SMILES
python pretrain.py \
--verbose --batch_size=8 \
--molecule_type=SMILES
MoleculeSTM-Graph
python pretrain.py \
--verbose --batch_size=8 \
--molecule_type=Graph
For DrugBank-Description
MoleculeSTM-SMILES
python downstream_01_retrieval_Description_Pharmacodynamics.py \
--task=molecule_description_removed_PubChem \
--molecule_type=SMILES \
--input_model_dir=../data/demo/demo_checkpoints_SMILES
MoleculeSTM-Graph
python downstream_01_retrieval_Description_Pharmacodynamics.py \
--task=molecule_description_removed_PubChem \
--molecule_type=Graph \
--input_model_dir=../data/demo/demo_checkpoints_Graph
For DrugBank-Pharmacodynamics
MoleculeSTM-SMILES
python downstream_01_retrieval_Description_Pharmacodynamics.py \
--task=molecule_pharmacodynamics_removed_PubChem \
--molecule_type=SMILES \
--input_model_dir=../data/demo/demo_checkpoints_SMILES
MoleculeSTM-Graph
python downstream_01_retrieval_Description_Pharmacodynamics.py \
--task=molecule_pharmacodynamics_removed_PubChem \
--molecule_type=Graph \
--input_model_dir=../data/demo/demo_checkpoints_Graph
For DrugBank-ATC
MoleculeSTM-SMILES
python downstream_01_retrieval_ATC.py \
--molecule_type=SMILES \
--input_model_dir=../data/demo/demo_checkpoints_SMILES
MoleculeSTM-Graph
python downstream_01_retrieval_ATC.py \
--molecule_type=Graph \
--input_model_dir=../data/demo/demo_checkpoints_Graph
For description id list, you can find them in MoleculeSTM/downstream_molecule_edit_utils.py
.
MoleculeSTM-SMILES
python downstream_02_molecule_edit_step_01_MoleculeSTM_Space_Alignment.py \
--MoleculeSTM_molecule_type=SMILES \
--MoleculeSTM_model_dir=../data/demo/demo_checkpoints_SMILES
python downstream_02_molecule_edit_step_02_MoleculeSTM_Latent_Optimization.py \
--MoleculeSTM_molecule_type=SMILES \
--MoleculeSTM_model_dir=../data/demo/demo_checkpoints_SMILES \
--language_edit_model_dir=../data/demo/demo_checkpoints_SMILES \
--input_description_id=101
MoleculeSTM-Graph
python downstream_02_molecule_edit_step_01_MoleculeSTM_Space_Alignment.py \
--MoleculeSTM_molecule_type=Graph \
--MoleculeSTM_model_dir=../data/demo/demo_checkpoints_Graph
python downstream_02_molecule_edit_step_02_MoleculeSTM_Latent_Optimization.py \
--MoleculeSTM_molecule_type=Graph \
--MoleculeSTM_model_dir=../data/demo/demo_checkpoints_Graph \
--language_edit_model_dir=../data/demo/demo_checkpoints_Graph \
--input_description_id=101
MoleculeSTM-SMILES
python downstream_03_property_prediction.py \
--dataset=bace --molecule_type=SMILES \
MoleculeSTM-Graph
python downstream_03_property_prediction.py \
--dataset=bace --molecule_type=Graph
Please check the demos
folder. This may require you download the dataset and checkpoints first:
- raw dataset (after preprocessing) at this huggingface link.
- checkpoints at this huggingface link.
Feel free to cite this work if you find it useful to you!
@article{liu2023moleculestm,
title={Multi-modal molecule structure-text model for text-based retrieval and editing},
author={Liu, Shengchao and Nie, Weili and Wang, Chengpeng and Lu, Jiarui and Qiao, Zhuoran and Liu, Ling and Tang, Jian and Xiao, Chaowei and Anandkumar, Anima},
title={Multi-modal molecule structure--text model for text-based retrieval and editing},
journal={Nature Machine Intelligence},
year={2023},
month={Dec},
day={01},
volume={5},
number={12},
pages={1447-1457},
issn={2522-5839},
doi={10.1038/s42256-023-00759-6},
url={https://doi.org/10.1038/s42256-023-00759-6}
}