Meta-learning Adaptive Deep Kernel Gaussian Processes for Molecular Property Prediction (ADKF-IFT, ICLR 2023)
This is the official PyTorch implementation of Adaptive Deep Kernel Fitting with Implicit Function Theorem (ADKF-IFT)
, proposed in the paper Meta-learning Adaptive Deep Kernel Gaussian Processes for Molecular Property Prediction (published at ICLR 2023). Please read our paper [arXiv, OpenReview] for detailed descriptions of the proposed ADKF-IFT method.
We implement ADKF-IFT (which is called ADKT in this repository), DKL, DKT and CNP on FS-Mol and MoleculeNet. We adapt the official code of PAR to FS-Mol. We also provide code for performing regression on FS-Mol for all models suitable for regression. These can be found in the fs_mol
folder.
All raw result data, plots, and notebooks for producing the result plots on the FS-Mol benchmark in the paper can be found in the visualize_results
folder. Our ADKF-IFT model checkpoints for both classification and regression can be downloaded from figshare.
The code to run the MoleculeNet experiment with ADKF-IFT can be found in the MoleculeNet
folder. Please follow the instructions in the README.md
file there to set up and run those experiments.
In addition, the code for reproducing the four representative out-of-domain molecular design experiments (for prediction and Bayesian optimization) can be found in the bayes_opt
folder.
If you find our paper, code, and/or raw result data useful for your research, please consider citing our paper:
@inproceedings{chen2023metalearning,
title={Meta-learning Adaptive Deep Kernel Gaussian Processes for Molecular Property Prediction},
author={Wenlin Chen and Austin Tripp and Jos{\'e} Miguel Hern{\'a}ndez-Lobato},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=KXRSh0sdVTP}
}
This codebase is built upon a fork from FS-Mol and PAR repositories. The README file, license, etc are copied and modified from there.
The following lines of code can be used to set up the repo:
# Clone the PAR submodule
git submodule update --init --recursive
# Create and activate conda environment
conda env create -f environment.yml
conda activate adkf-ift-fsmol
# Download and extract dataset
wget -O fs-mol-dataset.tar https://figshare.com/ndownloader/files/31345321
tar -xf fs-mol-dataset.tar # creates directory ./fs-mol
rm fs-mol-dataset.tar # delete the tar file to save space
mv fs-mol fs-mol-dataset # rename the folder for better clarity
# Download and extract pre-trained model weights
wget -O adkf-ift-weights.zip https://figshare.com/ndownloader/files/39203102
unzip adkf-ift-weights.zip # will create 2 .pt files
Meta-training for classification:
dataset_dir="./fs-mol-dataset" # change as necessary
python fs_mol/adaptive_dkt_train.py "$dataset_dir"
Meta-training for regression:
dataset_dir="./fs-mol-dataset" # change as necessary
python fs_mol/adaptive_dkt_train.py "$dataset_dir" --use-numeric-labels
Meta-testing:
# If you trained a model yourself look for a checkpoint file like:
# "./outputs/FSMol_ADKTModel_gnn+ecfp+fc_{YYYY-MM_DD_HH-MM-SS}/best_validation.pt"
# Otherwise, just use the pretrained model below:
model_checkpoint="./adkf-ift-classification.pt" # change as needed
python fs_mol/adaptive_dkt_test.py "$model_checkpoint" "$dataset_dir"
Meta-testing results for classification can be collected by running:
eval_id="YYYY-MM_DD_HH-MM-SS" # change as needed (requires running meta-testing first)
python fs_mol/plotting/collect_eval_runs.py ADKT "./outputs/FSMol_Eval_ADKTModel_${eval_id}" # change as needed
Meta-testing results for regression can be collected by running:
eval_id="YYYY-MM_DD_HH-MM-SS" # change as needed (requires running meta-testing first)
python fs_mol/plotting/collect_eval_runs.py ADKTNUMERIC "./outputs/FSMol_Eval_ADKTModel_${eval_id}" --metric r2 # change as needed
Results can then be visualized using the notebooks in the visualize_results
folder.
Below is the original README file from the FS-Mol repository.
This repository contains data and code for FS-Mol: A Few-Shot Learning Dataset of Molecules.
-
Clone or download this repository
-
Install dependencies
cd FS-Mol conda env create -f environment.yml conda activate fsmol
The code for the Molecule Attention Transformer baseline is added as a submodule of this repository. Hence, in order to be able to run MAT, one has to clone our repository via git clone --recurse-submodules
. Alternatively, one can first clone our repository normally, and then set up submodules via git submodule update --init
. If the MAT submodule is not set up, all the other parts of our repository should continue to work.
The dataset is available as a download, FS-Mol Data, split into train
, valid
and test
folders. Additionally, we specify which tasks are to be used with the file datasets/fsmol-0.1.json
, a default list of tasks for each data fold. We note that the complete dataset contains many more tasks. Should use of all possible training tasks available be desired, the training script argument --task_list_file datasets/entire_train_set.json
should be used. The task lists will be used to version FS-Mol in future iterations as more data becomes available via ChEMBL.
Tasks are stored as individual compressed JSONLines files, with each line corresponding to the information to a single datapoint for the task. Each datapoint is stored as a JSON dictionary, following a fixed structure:
{
"SMILES": "SMILES_STRING",
"Property": "ACTIVITY BOOL LABEL",
"Assay_ID": "CHEMBL ID",
"RegressionProperty": "ACTIVITY VALUE",
"LogRegressionProperty": "LOG ACTIVITY VALUE",
"Relation": "ASSUMED RELATION OF MEASURED VALUE TO TRUE VALUE",
"AssayType": "TYPE OF ASSAY",
"fingerprints": [...],
"descriptors": [...],
"graph": {
"adjacency_lists": [
[... SINGLE BONDS AS PAIRS ...],
[... DOUBLE BONDS AS PAIRS ...],
[... TRIPLE BONDS AS PAIRS ...]
],
"node_types": [...ATOM TYPES...],
"node_features": [...NODE FEATURES...],
}
}
The fs_mol.data.FSMolDataset
class provides programmatic access in Python to the train/valid/test tasks of the few-shot dataset.
An instance is created from the data directory by FSMolDataset.from_directory(/path/to/dataset)
.
More details and examples of how to use FSMolDataset
are available in fs_mol/notebooks/dataset.ipynb
.
We have provided an implementation of the FS-Mol evaluation methodology in fs_mol.utils.eval_utils.eval_model()
.
This is a framework-agnostic python method, and we demonstrate how to use it for evaluating a new model in detail in notebooks/evaluation.ipynb
.
Note that our baseline test scripts (fs_mol/baseline_test.py
, fs_mol/maml_test.py
, fs_mol/mat_test
, fs_mol/multitask_test.py
and fs_mol/protonet_test.py
) use this method as well and can serve as examples on how to integrate per-task fine-tuning in TensorFlow (maml_test.py
), fine-tuning in PyTorch (mat_test.py
) and single-task training for scikit-learn models (baseline_test.py
).
These scripts also support the --task_list_file
parameter to choose different sets of test tasks, as required.
We provide implementations for three key few-shot learning methods: Multitask learning, Model-Agnostic Meta-Learning, and Prototypical Networks, as well as evaluation on the Single-Task baselines and the Molecule Attention Transformer (MAT) paper, code.
All results and associated plots are found in the baselines/ directory.
These baseline methods can be run on the FS-Mol dataset as follows:
Our kNN and RF baselines are obtained by permitting grid-search over a industry-standard parameter set, detailed in the script baseline_test.py
.
The baseline single-task evaluation can be run as follows, with a choice of kNN or randomForest model:
python fs_mol/baseline_test.py /path/to/data --model {kNN, randomForest}
The Molecule Attention Transformer (MAT) paper, code.
The Molecule Attention Transformer can be evaluated as:
python fs_mol/mat_test.py /path/to/pretrained-mat /path/to/data
The GNN-MAML model consists of a GNN operating on the molecular graph representations of the dataset. The model consists of a maml_train.py
.
The current defaults were used to train the final versions of GNN-MAML available here.
python fs_mol/maml_train.py /path/to/data
Evaluation is run as:
python fs_mol/maml_test.py /path/to/data --trained_model /path/to/gnn-maml-checkpoint
The GNN-MT model consists of a GNN operating on the molecular graph representations of the dataset. The model consists of a multitask_train.py
. This method has similarities to the approach taken for the task-only training contained within Hu 2019
python fs_mol/multitask_train.py /path/to/data
Evaluation is run as:
python fs_mol/multitask_test.py /path/to/gnn-mt-checkpoint /path/to/data
The prototypical networks method Snell 2017 extracts representations of support set datapoints and uses these to classify positive and negative examples. We here used the Mahalonobis distance as a metric for query point distance to class prototypes.
python fs_mol/protonet_train.py /path/to/data
Evaluation is run as:
python fs_mol/protonet_test.py /path/to/pn-checkpoint /path/to/data
We provide pre-trained models for GNN-MAML
, GNN-MT
and PN
, these are downloadable from the links to figshare.
Model Name | Description | Checkpoint File |
---|---|---|
GNN-MAML | Support set size 16. 8-layer GNN. Edge MLP message passing. | MAML-Support16_best_validation.pkl |
GNN-MT | 10-layer GNN. PNA message passing | multitask_best_model.pt |
PN | 10-layer GGN, PNA message passing. ECFP+GNN, Mahalonobis distance metric | PN-Support64_best_validation.pt |
Flexible definition of few-shot models and single task models is defined as demonstrated in the range of train and test scripts in fs_mol
.
We give a detailed example of how to use the abstract class AbstractTorchFSMolModel
in notebooks/integrating_torch_models.ipynb
to integrate a new general PyTorch model, and note that the evaluation procedure described below is demonstrated on sklearn
models in fs_mol/baseline_test.py
and on a Tensorflow-based GNN model in fs_mol/maml_test.py
.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.