LaMBO: Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders
Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug and antibody sequence design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of multi-objective acquisition functions in the latent space of the autoencoder. These acquisition functions allow LaMBO to balance the explore-exploit tradeoff over multiple design rounds, and to balance objective tradeoffs by optimizing sequences at many different points on the Pareto frontier. We evaluate LaMBO on a small-molecule task based on the ZINC dataset and introduce a new large-molecule task targeting fluorescent proteins. In our experiments LaMBO outperforms genetic optimizers and does not require a large pretraining corpus, demonstrating that BayesOpt is practical and effective for biological sequence design.
BayesOpt can be used to maximize the simulated folding stability (-dG) and solvent-accessible surface area (SASA) of red-spectrum fluorescent proteins. Higher is better for both objectives. The starting proteins are shown as colored circles, with corresponding optimized offspring shown as crosses. Stability correlates with protein function (e.g. how long the protein can fluoresce) while SASA is a proxy for fluorescent intensity.
On all three tasks (described in Section 5.1 of the paper), LaMBO outperforms genetic algorithm baselines, specifically NSGA-2 and a model-based genetic optimizer with the same surrogate architecture (MTGP + NEHVI + GA). Performance is quantified by the hypervolume bounded by the optimized Pareto frontier. The midpoint, lower, and upper bounds of each curve depict the 50%, 20%, and 80% quantiles, estimated from 10 trials. See Section 5.2 in the paper for more discussion.
An open-source contribution identified some subtle bugs that hurt performance of all methods substantially on some tasks. The proposed fix has been merged and therefore the current master commit will now produce better results than originally reported. If you wish to reproduce the original curves in the paper, check out the following commit
git checkout 431b052
FoldX is available under a free academic license.
After creating an account you will be emailed a link to download the FoldX executable and supporting assets.
Copy the contents of the downloaded archive to ~/foldx
.
You may also need to rename the FoldX executable (e.g. mv -v ~/foldx/foldx_20221231 ~/foldx/foldx
).
RDKit is easiest to install if you're using Conda as your package manager (shown below).
TDC is required to run the DRD3 docking task. See the linked README for installation instructions.
git clone https://github.com/samuelstanton/lambo && cd lambo
conda create --name lambo-env python=3.8 -y && conda activate lambo-env
conda install -c conda-forge rdkit -y
conda install -c conda-forge pytdc pdbfixer openbabel -y
pip install -r requirements.txt --upgrade
pip install -e .
This project uses Weight and Biases for logging. The experimental data used to produce the plots in our papers is available here.
See ./notebooks/plot_pareto_front
for a demonstration of how to reproduce Figure 1.
See ./notebooks/plot_hypervolume
for a demonstration of how to reproduce Figures 3 and 4.
See ./notebooks/rfp_preprocessing.ipynb
for a demonstration of how to download PDB files from the RCSB Protein Data Bank
and prepare them for use with FoldX.
See ./notebooks/foldx_demo.ipynb
for a demonstration of how to use our Python bindings for FoldX,
given a starting sequence with known structure.
This project uses Hydra for configuration when running from the command line.
We recommend running NSGA-2 first to test your installation
python scripts/black_box_opt.py optimizer=mf_genetic optimizer/algorithm=nsga2 task=regex tokenizer=protein
For the model-based genetic baseline, run
python scripts/black_box_opt.py optimizer=mb_genetic optimizer/algorithm=soga optimizer.encoder_obj=mll task=regex tokenizer=protein surrogate=multi_task_exact_gp acquisition=nehvi
For the full LaMBO algorithm, run
python scripts/black_box_opt.py optimizer=lambo optimizer.encoder_obj=mlm task=regex tokenizer=protein surrogate=multi_task_exact_gp acquisition=nehvi
To evaluate on the multi-objective RFP (large-molecule) or ZINC (small-molecule) tasks,
use task=proxy_rfp tokenizer=protein
and task=chem tokenizer=selfies
,
respectively.
To evaluate on the single-objective ZINC task used in papers like Tripp et al (2020), run
python scripts/black_box_opt.py optimizer=lambo optimizer.encoder_obj=lanmt task=chem_lsbo tokenizer=selfies surrogate=single_task_svgp acquisition=ei encoder=lanmt_cnn surrogate.holdout_ratio=0.1 surrogate.bs=256 surrogate.eval_bs=256 optimizer.resampling_weight=0.5 optimizer.window_size=8
Below we list significant configuration options.
See the config files in ./hydra_config
for all configurable parameters.
Note that any config field can be overridden from the command line, and some configurations are not supported.
nehvi
(default, multi-objective)ehvi
(multi-objective)ei
(single-objective)greedy
(single and multi-objective)
mlm_cnn
(default, substitutions only)mlm_transformer
(substitutions only)lanmt_cnn
(substitutions, insertions, deletions)lanmt_transformer
(substitutions, insertions, deletions)
lambo
(default)mb_genetic
(Genetic baseline with model-based compound screening)mf_genetic
(Model-free genetic baseline)
soga
(default, single-objective)nsga2
(multi-objective)
multi_task_exact_gp
(default, DKL MTGP regression)single_task_svgp
(DKL SVGP regression)single_task_exact_gp
(DKL GP regression)string_kernel_exact_gp
(not recommended, SSK GP regression)deep_ensemble
(MLE regression)
regex
(default, maximize counts of 3 bigrams)regex_easy
(maximize counts of 2 tokens)chem
(ZINC small molecules, maximize LogP and QED)chem_lsbo
(ZINC small molecules, maximize penalized LogP)tdc_docking
(ZINC small molecules, minimize DRD3 docking affinity and synthetic accessibility)proxy_rfp
(FPBase large molecules, maximize stability and SASA)
protein
(default, amino acid vocab for large molecules)selfies
(ZINC-derived SELFIES vocab for small molecules)smiles
(not recommended, ZINC-derived SMILES vocab for small molecules)
pytest tests
This project currently has very limited test coverage.
If you use any part of this code for your own work, please cite
@article{stanton2022accelerating,
title={Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders},
author={Stanton, Samuel and Maddox, Wesley and Gruver, Nate and Maffettone, Phillip and Delaney, Emily and Greenside, Peyton and Wilson, Andrew Gordon},
journal={arXiv preprint arXiv:2203.12742},
year={2022}
}