PHIStruct is a phage-host interaction prediction tool that uses structure-aware protein embeddings to represent the receptor-binding proteins (RBPs) of phages. By incorporating structure information, it presents improvements over using sequence-only protein embeddings and feature-engineered sequence properties β especially for phages with RBPs that have low sequence similarity to those of known phages.
Preprint: https://doi.org/10.1101/2024.08.24.609479
If you find our work useful, please consider citing:
@article {PHIStruct,
author = {Gonzales, Mark Edward M. and Ureta, Jennifer C. and Shrestha, Anish M.S.},
title = {PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings},
elocation-id = {2024.08.24.609479},
year = {2024},
doi = {10.1101/2024.08.24.609479},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/08/24/2024.08.24.609479},
eprint = {https://www.biorxiv.org/content/early/2024/08/24/2024.08.24.609479.full.pdf},
journal = {bioRxiv}
}
- π° News
- βΎοΈ Run on Google Colab
- π Installation & Usage
- π Description
- π¬ Dataset of Predicted Structures of Receptor-Binding Proteins
- π§ͺ Reproducing Our Results
- π» Authors
- 06 Nov 2024 - We presented our work at the 2024 Australian Bioinformatics and Computational Biology Society (ABACBS) National Conference in Sydney. Poster here.
You can readily run PHIStruct on Google Colab, without the need to install anything on your own computer: http://phistruct.bioinfodlsu.com
Operating System: Windows (using WSL), Linux, or macOS
Clone the repository:
git clone https://github.com/bioinfodlsu/PHIStruct
cd PHIStruct
Create a virtual environment with the dependencies installed via Conda (we recommend using Miniconda):
conda env create -f environment.yaml
Activate this environment by running:
conda activate PHIStruct
Depending on your operating system, run the correct installation command (refer to the last column of the table below) to install and configure the remaining dependencies (you only need to do this once, that is, at installation):
OS/Build | Command for Checking OS/Build | Installation Command |
---|---|---|
Linux AVX2 Build | cat /proc/cpuinfo | grep avx2 |
bash init.sh avx2 |
Linux SSE2 Build | cat /proc/cpuinfo | grep sse2 |
bash init.sh sse2 |
Linux ARM64 Build | dpkg --print-architecture or uname -m |
bash init.sh arm64 |
macOS | β | bash init.sh osx |
Note: Running the init.sh
script may take a few minutes since it involves downloading a model (SaProt, around 5 GB) from Hugging Face.
python3 phistruct.py --input <input_dir> --model <model_joblib> --output <results_dir>
- Replace
<input_dir>
with the path to the directory storing the PDB files describing the structures of the receptor-binding proteins. Sample PDB files are provided here. - Replace
<model_joblib>
with the path to the trained model (recognized format: joblib or compressed joblib, framework: scikit-learn). Download our trained model from this link. No need to uncompress, but doing so will speed up loading the model albeit at the cost of additional storage requirements. Refer to this guide for the list of accepted compressed formats. - Replace
<results_dir>
with the path to the directory to which the results of running PHIStruct will be written. The results of running PHIStruct on the sample PDB files are provided here.
The results for each protein are written to a CSV file (without a header row). Each row contains two comma-separated values: a host genus and the corresponding prediction score (class probability). The rows are sorted in order of decreasing prediction score. Hence, the first row pertains to the top-ranked prediction.
Under the hood, this script first converts each protein into a structure-aware protein embedding using SaProt and then passes the embedding to a multilayer perceptron trained on all the entries in our dataset with host among the ESKAPEE genera (link). If your machine has a GPU, it will automatically be used to accelerate the protein embedding generation step.
python3 train.py --input <training_dataset>
- Replace
<training_dataset>
with the path to the training dataset. A sample can be downloaded here.
The training dataset should be formatted as a CSV file (without a header row) where each row corresponds to a training sample. The first column is for the protein IDs, the second column is for the host genera, and the next 1,280 columns are for the components of the SaProt embeddings.
This script will output a gzip-compressed, serialized version of the trained model with filename phistruct_trained.joblib.gz
.
β Return to Table of Contents.
Motivation: Recent computational approaches for predicting phage-host interaction have explored the use of sequence-only protein language models to produce embeddings of phage proteins without manual feature engineering. However, these embeddings do not directly capture protein structure information and structure-informed signals related to host specificity.
Method: We present PHIStruct, a multilayer perceptron that takes in structure-aware embeddings of receptor-binding proteins, generated via the structure-aware protein language model SaProt, and then predicts the host from among the ESKAPEE genera.
Results: Compared against recent tools, PHIStruct exhibits the best balance of precision and recall, with the highest and most stable F1 score across a wide range of confidence thresholds and sequence similarity settings. The margin in performance is most pronounced when the sequence similarity between the training and test sets drops below 40%, wherein, at a relatively high-confidence threshold of above 50%, PHIStruct presents a 7% to 9% increase in class-averaged F1 over machine learning tools that do not directly incorporate structure information, as well as a 5% to 6% increase over BLASTp.
β Return to Table of Contents.
We also release a dataset of protein structures, computationally predicted via ColabFold, of 19,081 non-redundant (i.e., with duplicates removed) receptor-binding proteins from 8,525 phages across 238 host genera. We identified these receptor-binding proteins based on GenBank annotations. For phage sequences without GenBank annotations, we employed a pipeline that uses the viral protein library PHROG and the machine learning model PhageRBPdetect.
β Return to Table of Contents.
The experiments
folder contains the files and scripts for reproducing our results. Note that additional (large) files have to be downloaded (or generated) following the instructions in the Jupyter notebooks.
Click here to show/hide the list of directories, Jupyter notebooks, and Python scripts, as well as the folder structure.
Directory | Description |
---|---|
data |
Contains the data (including the FASTA files and embeddings) |
preprocessing |
Contains text files related to the preprocessing of host information and the identification of annotated receptor-binding proteins |
rbp_prediction |
Contains the trained model PhageRBPdetect (in JSON format), used for the computational prediction of receptor-binding proteins. Downloaded from this repository (under the MIT License) |
temp |
Contains intermediate output files during preprocessing, exploratory data analysis, and performance evaluation |
Script | Description |
---|---|
ClassificationUtil.py |
Contains the utility functions for the constructing the training and test sets, building the phage-host interaction prediction model, and evaluating its performance |
ConstantsUtil.py |
Contains the constants used in the notebooks and scripts |
MLPDropout.py |
Implements a multilayer perceptron with dropout in scikit-learn |
RBPPredictionUtil.py |
Contains the utility functions for the computational prediction of receptor-binding proteins |
SequenceParsingUtil.py |
Contains the utility functions for preprocessing host information and identifying annotated receptor-binding proteins |
StructureUtil.py |
Contains the utility functions for consolidating the embeddings generated via structure-aware protein language models |
Once you have cloned this repository and finished downloading (or generating) all the additional required files following the instructions in the Jupyter notebooks, your folder structure should be similar to the one below:
PHIStruct
(root)experiments
data
GenomesDB
(Download and unzip)AB002632
- ...
inphared
consolidated
(Download and unzip)rbp.csv
- ...
embeddings
prottransbert
(Download and unzip)complete
hypothetical
rbp
fasta
(Download and unzip)complete
hypothetical
nucleotide
rbp
structure
pdb
(Download and unzip)rbp_saprot_embeddings
(Download and unzip)AAA74324.1_relaxed.r3.pdb.pt
rbp_saprot_mask_embeddings
(Download and unzip)AAA74324.1_relaxed.r3.pdb.pt
rbp_saprot_seq_mask_embeddings
(Download and unzip)AAA74324.1_relaxed.r3.pdb.pt
rbp_saprot_struct_mask_embeddings
(Download and unzip)AAA74324.1_relaxed.r3.pdb.pt
rbp_pst_embeddings
(Download and unzip)AAA74324.1_relaxed.r3.pdb.pt
rbp_prostt5_embeddings.h5
(Download)rbp_prostt5_3di_embeddings.h5
(Download)rbp_saprot_mask_relaxed_r3.csv
(Download)rbp_saprot_relaxed_r3.csv
(Download)rbp_saprot_seq_mask_relaxed_r3.csv
(Download)rbp_saprot_struct_mask_relaxed_r3.csv
(Download)rbp_pst_relaxed_r3.csv
(Download)rbp_prostt5_relaxed_r3.csv
(Download)rbp_prostt5_3di_relaxed_r3.csv
(Download)
3Oct2023_data_excluding_refseq.tsv
3Oct2023_phages_downloaded_from_genbank.gb
(Download)
preprocessing
rbp_prediction
temp
1. Sequence Preprocessing.ipynb
- ...
ClassificationUtil.py
- ...
β Return to Table of Contents.
Operating System: Windows (using WSL), Linux, or macOS
Create a virtual environment with the dependencies installed via Conda (we recommend using Miniconda):
conda env create -f environment_experiments.yaml
Activate this environment by running:
conda activate PHIStruct-experiments
β Return to Table of Contents.
-
Mark Edward M. Gonzales
gonzales.markedward@gmail.com -
Ms. Jennifer C. Ureta
jennifer.ureta@gmail.com -
Dr. Anish M.S. Shrestha
anish.shrestha@dlsu.edu.ph
This is a research project under the Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Philippines.
This research was partly funded by the Department of Science and Technology Philippine Council for Health Research and Development (DOST-PCHRD) under the e-Asia JRP 2021 Alternative therapeutics to tackle AMR pathogens (ATTACK-AMR) program.
This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC) and with computing resources from the Machine Learning eResearch Platform (MLeRP) of Monash University, University of Queensland, and Queensland Cyber Infrastructure Foundation Ltd.