Software for "Leveraging Implicit Knowledge in Neural Networks for Functional Dissection and Engineering of Proteins"
With just a few clicks, you can run DeeProtein online in our code ocean compute capsule. To do so, you need to sign up and duplicate this capsule.
This compute capsule includes two modes:
- Classification: predict the functions of any protein sequence
- Sensitivity analysis: run a full sensitivity analysis for a set of GO terms for any protein sequence
For both modes click Start Interactive Session choose Juypter. In the dropdown menu New on the righthand side choose Terminal.
Wait until the terminal opens and enter
bash /code/infer.sh
to start the inference mode.
Example:
A prompt will open where you can enter a protein sequence, e.g the sequence of Src-kinase kinase (PDB 6F3F)
Enter Sequence for Classification:
GARMVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMNKGSLLDFLKGETGKYLRLPQLVDMSAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
For which the result looks like this:
Predicted labels:
-----------------
GO-Term Score Explanation
GO:0003674 1.000 molecular_function
GO:0043168 1.000 anion binding
GO:0004672 1.000 protein kinase activity
GO:0035639 1.000 purine ribonucleoside triphosphate binding
GO:0005524 1.000 ATP binding
GO:0016773 1.000 phosphotransferase activity, alcohol group as acceptor
GO:0032559 1.000 adenyl ribonucleotide binding
GO:0032555 1.000 purine ribonucleotide binding
GO:0017076 1.000 purine nucleotide binding
GO:0032553 1.000 ribonucleotide binding
GO:0016301 1.000 kinase activity
GO:0016772 1.000 transferase activity, transferring phosphorus-containing groups
GO:0000166 1.000 nucleotide binding
GO:1901265 1.000 nucleoside phosphate binding
GO:0036094 1.000 small molecule binding
GO:0097367 1.000 carbohydrate derivative binding
GO:0030554 1.000 adenyl nucleotide binding
GO:0005488 1.000 binding
GO:0003824 1.000 catalytic activity
GO:1901363 1.000 heterocyclic compound binding
GO:0097159 1.000 organic cyclic compound binding
GO:0043167 1.000 ion binding
GO:0016740 1.000 transferase activity
--------------------------------------------------
GO:0004674 0.373 protein serine/threonine kinase activity
GO:0016787 0.012 hydrolase activity
GO:0004871 0.002 signal transducer activity
Scores below 0.5
are interpreted as negative predictions. Only non-zero scores are shown.
Use CTRL-C to leave the interactive mode, for example when running a sensitivity analysis next.
Wait until the terminal opens and enter
bash /code/sense.sh
to start the sensitivity analysis mode.
Example:
A prompt will open where you can enter a protein sequence, e.g the sequence of Src-kinase kinase (PDB 6F3F)
Please enter the sequence to analyze:
GARMVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMNKGSLLDFLKGETGKYLRLPQLVDMSAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
and the GO-terms of interest in comma-seperated form, e.g.
Please enter the GO terms to analyze sperarated by commas:
GO:0016301,GO:0005524
The sensitivity analysis can take a few minutes. Results will be written to a tab-separated .txt file and a plot in .png format. To view and download these, click on the 'Jupyter' logo and move to the directory 'results/'.
There you will find the sensitivities.txt, e.g.
Pos AA sec dis GO:0016301 GO:0005524
-1 wt wt wt 0.0 0.0
0 G . _ 0.6929230000000004 0.1063629999999982
1 A . _ 0.01167100000000332 -0.12102099999999892
2 R . _ 1.0941490000000016 -0.7712800000000009
3 M . _ 0.9523300000000036 0.6096879999999985
4 V . _ 1.514687000000002 1.8048670000000016
5 T . _ -0.01615099999999714 0.1625770000000024
6 T . _ -0.09454199999999703 0.202874999999998
...
Note: The output has more than 300 lines and was truncated here.
Each Line contains information about one position in the protein sequence. Its index is given under Pos and the respective amino acid under AA. For each GO term GO:_______, there is one column containing the sensitivity value in that position for that GO term. The second line and the third and fourth columns ("sec", "dis") are for technical reasons and can be ignored.
Note Sequences far off the protein sequence space as covered by SwissProt might give unexpected results. The same applies to very short sequences since DeeProtein was trained on sequences with more than 150 amino acids.
Additionally, a plot of the sensitivity values along the sequence is produces and saves as sensitivities.png, e.g.
You can download these by marking them and clicking 'Download'.
For visualization of the sensitivity data, we provide a plugin for PyMOL in /DeeProtein/pymol/visualize_sensitivity.py. You need PyMOL installed on your computer and our plugin to perform the visualization as follows. PyMOL must have access to the following packages:
The plugin assumes the sensitivity values to be in a 'labels' directory, therefore move sensitivities.txt to a directory called 'labels'
mkdir labels
mv sensitivities.txt labels/
Next, import the plugin in PyMOL
PyMOL> run visualize_sensitivity.py
and color the structure according to one of the GO-terms sensitivites (e.g. kinase activity):
PyMOL> color_sensitivity(file="sensitivities.txt", column="GO:0016301", on_pdb="6F3F", on_chain="A", show_hetatm=True, normalize=False, min_val=-5, max_val=5)
position the protein and save an image
PyMOL> png sensitivity_3D.png
For details see the section Visualization in PyMOL.
This and the following sections describe how to setup and deploy DeeProtein.
Download the GeneOntology
$ wget http://purl.obolibrary.org/obo/go.obo
Clone this git repository:
$ git clone https://github.com/juzb/DeeProtein && cd DeeProtein
- Python 3.5+
- Cuda 8.0+
- Tensorflow 1.7
- Tensorlayer 1.11.0
- GO-Tools
- pandas 0.23+
- BioPython 1.7.4
- SciPy 0.19+
- scikit-learn 0.20
Sensitivity data Sensitivity data is stored in our zenodo repository.
DeeProtein Weights Weights for DeeProtein are stored in our zenodo repository.
Note:
The following usage guide has been tailored to meet the requirements of the code-ocean compute capsule. It allows the execution of DeeProtein's functionality:
- Inference and classification of a given protein sequence
- Occlusion based sensitivity analysis of a given protein sequence
- Visualization of the computed sensitivity traces
In order to work with DeeProtein in greater detail, i.e. to alter and/or combine models please refer to the main.sh
script in our (GitHub repository)[https://github.com/juzb/DeeProtein]
Download the weights for DeeProtein from our zenodo repository
$ wget https://zenodo.org/record/2574979/files/complete_model_0.npz
mv complete_model_0.npz complete_model.npz
-
Specify the options in config.JSON, see FLAGS explanations.
$ vim config.JSON
Alternatively, options can be superseded by specifying the appropriate FLAG in the function call (as seen below).
-
To infer a sequence on the pretrained DeeProtein Model use the sense.sh script as explained in the Classification section.
Use the sense.sh file as described in the Sensitivity Analysis section or
-
Write a masked_dataset.txt file having the following format: Each line consists of
<PDB ID>;<Chain ID>;<GO:1>,<GO:2>, ...;<Sequence>;<Secondary structure>;<dis string>
e.g.
6F3F;A;GO:0016301,GO:0005524;QRKLEALIRDPRSPINVESLLDGLNSLVLDLDFPALRKNKNIDNFLNRYEKIVKKIRGLQMKAEDYDVVKVIGRGAFGEVQLVRHKASQKVYAMKLLSKFEMIKRSDSAFFWEERDIMAFANSPWVVQLFCAFQDDKYLYMVMEYMPGGDLVNLMSNYDVPEKWAKFYTAEVVLALDAIHSMGLIHRDVKPDNMLLDKHGHLKLADFGTCMKMDETGMVHCDTAVGTPDYISPEVLKSQGGDGYYGRECDWWSVGVFLFEMLVGDTPFYADSLVGTYSKIMDHKNSLCFPEDAEISKHAKNLICAFLTDREVRLGRNGVEEIKQHPFFKNDQWNWDNIRETAAPVVPELSSDIDSSNFDDITFPIPKAFVGNQLPFIGFTYYR;...............................................................................................................................................................................................................................................................................................................................................................................................;_______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
where, sequence, secondary structure and dis string are of the same length. If the secondary structure is unknown a string containing only points is valid. If the dis string is unknown, a string containing only underscores is valid.
-
Score the occluded sequences using
$ python DeeProtein.py -mask=True -restore=True -restorepath=path/to/saves -model_to_use=model/model.py -v path/to/test/parent-directory-of-masked_dataset
then
$ python DeeProtein/scripts/calculate_sensitivity.py path/to/test/parent-directory-of-masked_dataset/masked_dataset.txt path/to/test/parent-directory-of-masked_dataset/masked_dataset.txt
-
Visualize the computed sensitivities in a sensitivity trace plot:
$ python DeeProtein/scripts/plot_sensitivity.py path/to/test/parent-directory-of-masked_dataset/masked_dataset.txt "GO-to-plot" /outputdir/sensitivity_1D.png
A number of FLAGS is available in config.JSON
to specify the behavior of DeeProtein, both for inference and training.
While all FLAGS may be superseded in the function calls, we recommend to set up the config.JSON prior to usage:
FOR INFERENCE AND SENSITIVITY ANALYSIS:
info_path
`str, the path where logs and the metrics calculation related files should be savedinfer
'str', eitherTrue
orFalse
(as 'str'), whether to run the model in inference mode.mask
'str', eitherTrue
orFalse
(as 'str'), ifTrue
enables the model to perform sensitivity analysis on a masked dataset specified invaliddata
restorepath
'str', path to directory which contains a complete_model.npz file, the saves from which to restore the modelprint_all
'str', eitherTrue
orFalse
(as 'str'), whether to print ALL (1000) Go term scores in interactive inference. ifFalse
only the top 20 are printed.restore
'str', eitherTrue
orFalse
(as 'str'), whether to restore the model fromrestorepath
or train a new model. Defaults toTrue
godagfile
'str', the path to the Gene-Ontology .obo filegofile
'str', the path to file specifying the GO-terms the model should consider. One GO per line. Available in examples/go_file_1000.txt
FOR TRAINING A NEW MODEL:
validdata
'str', the path to the datasets as .csv. Alternative use in 'mask' mode (i.e. sensitivity analysis): path to masked dataset filetraindata
'str', the path to the datasets as .csvtest
'str', eitherTrue
orFalse
(as 'str'), whether to activate test mode. Test data to be specified undervaliddata
compare
'str', eitherTrue
orFalse
(as 'str'), Compares multiple models based on their dumps, paths specified comma-separated invaliddata
model_to_use
'str', the path to the .py holding the model to use in DeeProtein. Default:model/model.py
logging_level
'str', One of [INFO, DEBUG, WARNING] the basic level on which the logger should operate.modelname
'str', the name of the Model. Default 'DeeProtein'nblocks
'int', number of residual blocks to build. Defaults to 30, given weights do only comply with that number.print_num_params
'str', eitherTrue
orFalse
(as 'str'). Printing the number of parameters is time intensive but interesting. Choose wisely...windowlength
'int', Max. Number of sequence position to consider. Longer sequences are clipped. Defaults to1000
.learningrate
'float', initial learning rate for the training of DeeProteinepsilon
'float', epsilon parameter for the ADAM optimizerbatchsize
'int', batchsize. We recommend 64 for training. Defaults to1
(for inference).nepochs
'int', number of epochs to train the model. Defaults to20
.valid_after
'int', number of batches after which to perform an early run over the valid set.early_valid_size
'int', number of batches to use in early validationkeep_plots
'str', eitherTrue
orFalse
(as 'str'), whether to keep or overwrite plots from each validation run
PyMOL> color_sensitivity(file,
column=None,
show_hetatm=True,
show_chains=True,
on_chain=None,
on_pdb=None,
reload=True,
normalize=True,
min_val=-1,
max_val=1
)
Use to automatically color a structure by sensitivity. Is called from the PyMOL console. If the 3D structure needed to show a given proteins sensitivity is not available in the specified 'prot_path', the PDB file is automatically downloaded.
file
sensitivity value file, name should be '<prefix><PDB ID><Chain ID>_<suffix>.<ending, usually txt>column
column to color according toshow_ligands
whether to show hetatm-flagged atoms by default, e.g. ligandsshow_chains
whether to show other chains, e.g. interacting proteins in the pdb fileon_chain
redirect coloring to other chain the file-name suggestson_pdb
redirect coloring to other pdb filereload
whether to reload the 3D structure, can take a few secondsnormalize
whether to normalize the values to color, keeps zero at zero, does not work well, if there are few outliersmin_val
instead of normalization, the upper and the lower border for coloring can be given here \nmax_val
min_val and max_val must have the same absolute value, if a value of 0 shall be colored white