Skip to content

hmdlab/lambo5utr

Repository files navigation

LaMBO-DNABERT

Overview

LaMBO-DNABERT is a sequence optimization model that combines LaMBO and DNABERT. From the original LaMBO architecture, we incorporated the pre-trained DNA language model as a denoising autoencoder to leverage its learned genomic information. Our model demonstrated improved performance in optimizing human 5' UTR sequences across up to four computational objectives, including AGC content, mean ribosome load (MRL), in vitro stability, and reduced G4 tendency.

Installation

Environment

We have developed our model in the following environment. We strongly recommend using the same Python and CUDA versions, as our pipeline integrates multiple models with potentially conflicting dependencies.

  • Linux architecture: x86_64
  • Ubuntu: 20.04.6 LTS, focal
  • CUDA: 11.3, V11.3.58
  • Python: 3.8.10
  • R: 4.1.2 (only for G4 prediction)

Installation of LaMBO-DNABERT

Clone this repository to the local environment.

git clone https://github.com/hmdlab/lambo5utr.git

Install the necessary python packages along with the lambo5utr package as shown below.

pip3 install -r requirements.txt
pip3 install -e .

Installation of other models

The following models are used in our pipeline. Optimus5Prime is included under ./misc/retrained_main_MRL_model.hdf5 as a retrained model using the same model and data provided in the original work. Additional models needed to be downloaded and installed separately.

  • DNABERT: Pre-trained BERT with human genome used as the encoder and decoder of LaMBO
  • Optimus5Prime: For predicting MRL
  • Nullrecurrent: For predicting in vitro degradation
  • DeepG4: For predicting G4 score
  1. Download DNABERT 3mer model Download the DNABERT 3mer model from here. Specify your local path in ./hydra_config/black_box_opt.yaml as shown below.
pretrained_path: /path/to/your/DNABERT/3-new-12w-0
  1. Installation of Nullrecurrent The original instructions can be found here.
    • Git clone the KaggleOpenVaccine with git-lfs. Available under MIT license.
sudo apt-get install git-lfs
git clone https://github.com/eternagame/KaggleOpenVaccine.git
  • Install EternaFold from here. Freely available for non-commercial use upon registration.
  • Git clone Arnie from here and link it to EternaFold. See detailed instructions here.
git clone https://github.com/DasLab/arnie.git
  • Add path to the KOV direcotry so that the script can find the model parameters. It is also recommended to add it to .bash_profile.
export KOV_PATH="/path/to/KaggleOpenVaccine"
  1. Installation of DeepG4 Build the R environment from renv.lock. The DeepG4 package will be included in the environment.
cd run_scripts
R
> renv::restore()
> q()
  1. Quick test Use the following scripts to test your installation. This test took ~20min with 2GPUs in our environment.
source run_test.sh

Running LaMBO-DNABERT

Use the following script for running LaMBO-DNABERT.
The entire training took ~12h with 2 GPUs in our environment.

source run_script_dnabert.sh

Analyzing and ranking generated sequences

This section includes two notebooks, notebooks/compare_ranking.ipynb and sequence_selection.ipynb, for analyzing and ranking genrated sequences. You can use those sequences to test this section without actually running the model.
compare_ranking.ipynb compares four candidate selection methods, including the simplely defined Pareto frontiers, Pareto ranking, MIPS scores, and R-method. sequence_selection.ipynb describes how we eventually selected sequences for wet-lab experiments. The selection step requires USEARCH to be pre-installed as a command line tool in your environment.

Citation

If you use this code for your research, please cite:

citation info to be provided.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published