P3CMQA is a protein model quality assessment tool using 3DCNN.
A paper about this method is in preparation.
The web server version is available at this page.
If you want to use a local version, please download this repository.
- Requirements
- Sample data directory structure
- Usage without Docker
- Usage with docker
- Output format
- Training and test data
- Reference
In this method, there are two main processes: preprocessing and prediction.
In the preprocessing part, homology search and local structure prediction are performed, and in the prediction part, inferences are performed using the files obtained in the preprocessing.
For the prediction part, we have released a Dockerfile and the Docker images to simplify the construction of the environment.
The preprocessing part uses other large software and databases, and it was difficult to include it in the Docker image, so this part is not included in the Docker image.
Therefore, the preprocessing part should be installed regardless of whether you use the Docker image or not.
-
Python
version 3.7 or later
-
PSI-BLAST (to generate PSSM)
If you do not have installed PSI-BLAST, please install it. You can download blast+ package from here.
After the download is complete, make sure to export the PATH so that the
psiblast
command can be used.# example export PATH=$PATH:/path/to/blast/ncbi-blast-2.11.0+/bin
-
uniref90 database (to generate PSSM using
psiblast
)If you do not have downloaded uniref90 database, please download
uniref90.fasta.gz
from here. Please note that it takes quite a long time to download.After downloading, please unzip
uniref90.fasta.gz
.Then, create a database with the command
makeblastdb
included in the blast package as follows.makeblastdb -in uniref90.fasta -dbtype prot -out uniref90
When the database creation is complete, please name the directory where the database is saved.
export uniref90=/path/to/uniref90_directory
-
SCRATCH-1D (SSpro, ACCpro20) (to predict SS and RSA)
Please download and install SCRATCH-1D using the below commands.
wget http://download.igb.uci.edu/SCRATCH-1D_1.2.tar.gz tar -xvzf SCRATCH-1D_1.2.tar.gz cd SCRATCH-1D_1.2 perl install.pl
After installation, please export the PATH to use
run_SCRATCH-1D_predictors.sh
.export PATH=$PATH:path/to/SCRATCH-1D_1.2/bin
The
blast-2.2.26
included inSCRATCH-1D/pkg
is a 32-bit Linux version and may cause errors.SCRATCH-1D_1.2/README.txt
contains instructions on how to replace it with a 64-bit version, if necessary, replace it.The 64-bit version is available at https://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26.
-
Clone this repository and download the pre-trained model
The trained model is not included in the GitHub repository, so you have to download it and put it directly under the repository.
We provide the pre-training model at here.
$ git clone https://github.com/yutake27/P3CMQA.git $ cd P3CMQA $ wget http://www.cb.cs.titech.ac.jp/p3cmqa/trained_model.npz $ ls LICENSE README.md data result src trained_model.npz
We also provide the pre-trained model that use only atom-type features at here.
-
Scwrl4 (to optimize sidechain) Not essential, but optional
Please export PATH to SCWRL4 to use
Scwrl4
-
chainer version 7.7.0 (Deep learning tool)
$ pip install chainer
If you have a GPU, you can make predictions faster. Please Install cupy based on your CUDA version.
For example, If the CUDA version is 10.2, run the below command to install cupy.
$ pip install cupy-cuda102
-
Biopython version 1.7.8
$ pip install biopython
-
Prody version 2.0
$ pip install prody
data
├── pdb
│ └── sample
│ ├── sample_1.pdb
│ └── sample_2.pdb
├── fasta
│ └── sample.fasta
└── profile
└── sample
├── sample.pssm
├── sample.ss
└── sample.acc20
$ python preprocess.py -f ../data/fasta/sample.fasta -d $uniref90/uniref90 -n num_thread
Use -f
to specify the fasta file path, -d
to specify the uniref90 database path, and -n
to specify the number of threads to use (default=1).
Then You can get sample.pssm
, sample.ss
and sample.acc20
under data/profile/sample
.
$ Scwrl4 -i sample_1.pdb -o sample_1.pdb
Usage
usage: predict.py [-h] [--gpu GPU] [--input_path INPUT_PATH]
[--input_dir_path INPUT_DIR_PATH] --fasta_path FASTA_PATH
[--model_path MODEL_PATH] [--preprocess_dir PREPROCESS_DIR]
[--output_dir OUTPUT_DIR] [--save_res]
P3CMQA
optional arguments:
-h, --help show this help message and exit
--gpu GPU, -g GPU GPU ID (negative value indicates CPU, default = -1)
--input_path INPUT_PATH, -i INPUT_PATH
The path to a single PDB/mmCIF file
--input_dir_path INPUT_DIR_PATH, -d INPUT_DIR_PATH
The path to a directory of multiple PDB/mmCIF files
--fasta_path FASTA_PATH, -f FASTA_PATH
The path to a Reference FASTA file
--model_path MODEL_PATH, -m MODEL_PATH
The path to a Pre-trained model
--preprocess_dir PREPROCESS_DIR, -p PREPROCESS_DIR
The path to a preprocess directory
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
The path to a output directory
--save_res, -s Save the score for each residue (not required if using
"-i" option)
$ python predict.py -d ../data/pdb/sample -f ../data/fasta/sample.fasta
Use -d
to specify the pdb directory path and -f
to specify the fasta file path.
The results are written under data/score/sample
.
You will get a file with the global score of all model structures and files with the scores of each residue for each model structure.
In this example, you will get data/score/sample/sample.csv
.
If you have a GPU,
$ python predict.py -d ../data/pdb/sample -f ../data/fasta/sample.fasta -g 0
Use -g
to specify the GPU ID (negative value indicates CPU).
If you want to specify the output directory,
$ python predict.py -d ../data/pdb/sample -f ../data/fasta/sample.fasta -o path/to/dir
Use -o
to specify the directory path where you want to output the results.
The results are written in path/to/dir/sample.csv
.
If you want to sepecify the profile directory,
$ python predict.py -d ../data/pdb/sample -f ../data/fasta/sample.fasta -p path/to/profile/dir
Directory path/to/profile/dir
should have sample.pssm
, sample.ss
and sample.acc20
.
$ python predict.py -i ../data/pdb/sample/samle_1.pdb -f ../data/fasta/sample.fasta
Use -i
to specify the pdb file path and -f
to specify the fasta file path.
The results are written in data/score/sample/sample_1.txt
.
We have released two versions of the image on Dockerhub, a CPU version, and a GPU version. The Dockerhub repository is here.
If you do not have a GPU environment, please pull yutake27/p3cmqa:cpu
.
docker pull yutake27/p3cmqa:cpu
For GPU versions, we have released two types of images for CUDA and cuDNN versions.
The first image is p3cmqa:cuda11.0-cudnn8
, which supports CUDA 11.0 and cuDNN 8.0, and the other is p3cmqa:cuda10.2-cudnn7
, which supports CUDA 10.2 and cuDNN 7.0.
Please pull the appropriate image for your version of CUDA and cuDNN.
docker pull yutake27/p3cmqa:cuda11.0-cudnn8
If you want to use another version of CUDA or cuDNN, please modify the Dockerfile included in the repository and build it. Note that it is possible to perform a prediction without cuDNN.
$ python preprocess.py -f ../data/fasta/sample.fasta -d $uniref90/uniref90 -n num_thread
Use -f
to specify the fasta file path, -d
to specify the uniref90 database path, and -n
to specify the number of threads to use (default=1).
Then You can get sample.pssm
, sample.ss
and sample.acc20
under data/profile/sample
.
$ Scwrl4 -i sample_1.pdb -o sample_1.pdb
There is a python script named docker_predict.py
to simplify prediction with Docker.
This script starts the container, executes the prediction, and terminates the container, so there is no need to enter docker commands by yourself.
To execute the prediction, please give the name of the Docker image (Repository:tag
) and optional arguments as in the following command.
python docker_predict.py yutake27/p3cmqa:cuda11.0-cudnn8 -g 0 -d ../data/pdb/sample -f ../data/fasta/sample.fasta
The arguments other than the name of the docker image are the same as without docker, so please check here.
If you do not use the above script, please run the container and execute the prediction as follows.
# docker run
docker run -dit --rm -u "$(id -u $USER):$(id -g $USER)" --gpus=0 \
-v /absolute/path/to/P3CMQA:/home \
-v /absolute/path/to/P3CMQA/data/pdb/sample:/home/data/pdb/sample \
-v /absolute/path/to/P3CMQA/data/fasta:/home/data/fasta/ \
-v /absolute/path/to/P3CMQA/data/profile/sample:/home/data/profile/sample \
-v /absolute/path/to/P3CMQA/data/score/sample:/home/data/score/sample \
--name p3cmqa yutake27/p3cmqa:cuda11.0-cudnn8
# docker exec
docker exec -it -u "$(id -u $USER):$(id -g $USER)" p3cmqa /bin/bash \
-c "cd home/src && python predict.py -d ../data/pdb/sample -f ../data/fasta/sample.fasta -g 0"
-
# Model name : ../data/pdb/sample/sample_1.pdb # Model Quality Score : 0.04516 Resid Resname Score 1 MET 0.13082 2 ALA 0.17883 3 ALA 0.28698 :
For one model structure, the above text file is generated.
The first line shows the name of the model structure and the second line shows the global score (score for the whole model structure). The global score ranges from 0 to 1, the closer the score is to 1, the more similar it is to the natural structure.
The third and subsequent lines indicate the residue number, residue name, and predicted local score (score for each residue). The Local score ranges from 0 to 1, as well as the global score.
You can load this file as a csv file using python in the following way.
import pandas as pd df = pd.read_csv('sample_1.txt', sep='\t', header=2) > Resid Resname Score 0 1 SER 0.04631 1 2 ASN 0.06995 2 3 ALA 0.07282 .. ... ... ...
-
Model_name, Score sample_1, 0.045157402753829956 sample_2, 0.3176875412464142
For all model structures, the above csv file is generated.
The first column shows the name of the model structure and the second column shows the global score.
You can download the training data from here training_CASP7-10.tar.gz.
You can also download the test data from here test_CASP11-13.tar.gz.
Please note that the download size is large.
- Y. Takei and T.Ishida, “P3CMQA: Single-Model Quality Assessment Using 3DCNN with Profile-Based Features,” Bioengineering, vol. 8, no. 3, 2021. Available from: https://doi.org/10.3390/bioengineering8030040
- R. Sato and T. Ishida, “Protein model accuracy estimation based on local structure quality assessment using 3D convolutional neural network,” PLoS One, vol. 14, no. 9, p. e0221347, 2019. Available from: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0221347
- G. G. Krivov, M. V. Shapovalov, and R. L. Dunbrack, “Improved prediction of protein side-chain conformations with SCWRL4,” Proteins Struct. Funct. Bioinforma., vol. 77, no. 4, pp. 778–795, 2009. Available from: https://pubmed.ncbi.nlm.nih.gov/19603484/
- D. J. Lipman et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Res., vol. 25, no. 17, pp. 3389–3402, 1997. Available from: https://pubmed.ncbi.nlm.nih.gov/9254694/
- C. N. Magnan and P. Baldi, “SSpro/ACCpro 5: Almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity,” Bioinformatics, vol. 30, no. 18, pp. 2592–2597, 2014. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4215083/