Skip to content

Modeling A-Domain Specificity using Protein Language Models

Notifications You must be signed in to change notification settings

abhinadduri/MASPR

Repository files navigation

MASPR

Code for Modeling A-Domain Specificity using Protein Language Models paper.

This repository is under active development. Some major TODOs include allowing users to generate their own training data for new A-domains, batch processing at inference time.

Installing Dependencies

To use MASPR, you need to have conda installed. Once you have conda installed, you will need the following packages:

conda create --name maspr 
conda activate maspr
conda install pip
pip install -r requirements.txt

Benchmarking a MASPR model

To benchmark a MASPR model, you will first need to download the ESM embeddings for the training data (or generate them yourself). You can download these embeddings here.

To reproduce the numbers in the paper:

python train_maspr.py --task ttsplit

To reproduce the generalization benchmark (train on bacteria and test on fungi):

python train_maspr.py --task bacfung

To reproduce the zero-shot learning benchmark (leave-one-substrate-out cross-validation):

python train_maspr.py --task substrate

To train a MASPR model using all the data:

python train_maspr.py --task train --model_path <MODEL_PATH>

Making predictions with MASPR

To predict the specificity for all A-domains in a given gene sequence:

python process_adomain.py -i <GENE_SEQUENCE>

These are typically obtained from a source like MiBiG by clicking on a gene and clicking "Copy AA Sequence".

Zero-shot prediction with MASPR

MASPR can consider novel substrates during inference even if they were not in its training data. To enable this feature, add your desired substrates to the sub_to_smiles dictionary in substrate_smiles.py.

About

Modeling A-Domain Specificity using Protein Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages