ProteoBOOSTER is a tool to automatically infer protein-protein interactions (PPI) networks for an entire proteome, and characterize this PPI networks by inferring protein complexes. If functional annotations are avilable for the proteome, complexes are also analyzed for function, providing a functional profile for as many putative complexes as possible.
This repository contains the code to download source databases to do this for any target proteome, as well as the scripts to automatically transfer the network and perform the subsequent analyses. A sister project, ProteoBOOSTER-web, provides a web interface that can display information generated by ProteoBOOSTER on a website.
This repository is composed by a series of scripts that should be run in a specific sequence. In between, however, you should run BLAST and ClusterONE. Below is an example sequence of commands that runs the pipeline to the Homo sapiens reference proteome:
Install the required packages
pip install requirements.txt
Download the required databases
python download_sapshot.py <project-path> -s <data-dir>
Here <project-path>
refers to the directory on your computer where you want to store all the databases, they will be stored in <project-path>/<data-dir>
.
Next, we need to pre-process the interaction databases and combine it into a single database.
python create_interaction_file.py <project-path>/<data-dir> <project-path>/interactions
This creates a collection of files on the <project-path>/interactions/<data-dir>
path (although we're changing this behavior soon).
Now, let's assume you've downloaded a proteome fasta file from UniprotKB, such as the homo sapiens reference proteome: UP000005640_9606.fasta
.
To process this, we need to first create a BLAST database from the fasta that containts all the interactors we've identified before:
makeblastdb -in <project-path>/interactions/<data-dir>/sequences.fasta -out <project-path>/interactions/<data-dir>/sequences.fasta -dbtype prot
Then, we need to align our target fasta file against that database:
blastp -outfmt 6 -query UP000005640_9606.fasta -out UP000005640_9606.blast -db <project-path>/interactions/<data-dir>/sequences.fasta -num_threads <num-cores>
Here <num-cores
is an optional value to speed up the alignment with the number of cores.
We carry on by applying our homology criterion to build a homolog database for the target proteome:
python create_homologs.py UP000005640_9606.blast UP000005640_9606.homologs
Using the calculated homolog database, we can now transfer interactions from the combined database for our target proteome:
python transfer_interactions.py UP000005640_9606.homologs <project-path>/interactions/<data-dir>/interaction_file.tab UP000005640_9606.interologs
If you know that no experimental interactions are available for your target organism, the next step is optional. For homo sapiens (or any model organism) however, it's very likely that some of the proteins in the reference proteome have annotated interactors, so we extract them to include those in the inferred interaction:
python extract_experimental_interactions.py UP000005640_9606.fasta <project-path>/interactions/<data-dir>/interaction_file.tab UP000005640_9606.interactions
We now have enough data to create a graph and use it to infer protein complexes. Let's get the graph in a format that ClusterONE can use:
python prepare_data_for_clustering.py UP000005640_9606.interactions UP000005640_9606.interologs UP000005640_9606.graph
And then cluster the graph:
java -jar cluster_one-1.2.jar -F csv UP000005640_9606.graph > UP000005640_9606.complexes
Finally, to get the functional overrepresentation of all these complexes, you may run:
python overrepresentation.py UP000005640_9606.fasta UP000005640_9606.complexes UP000005640_9606.goa <project-path>/<data-dir>/go-basic.obo UP000005640_9606.overrep
Preparing files for visualization with ProteoBOOSTER-web
The section above already provides all the files required to get the information you need, but it may be preferable to visualize this data using the web based interactive explorer.
It works by loading this information in a relational database and creating graphical user interfaces to communicate with it. As a convenience, we created a script that helps you to transform the set of files you created above into files than can be ingested by the data loader that ships with ProteoBOOSTER-web.
This does require a small amount of extra pre-processing and downloading some more files, however.
You will need to:
- download taxonomy informations from UniProt and clicking the
Download
button. This tutorial will continue assuming you downloaded it and named ittaxonomy-info.tsv
. - create a file containing protein names, this is can be done running the following commands:
grep ">" <project-path>/<data-dir>/trembl.fasta > trembl.proteins
(this will extract all the lines with protein metadata from the fasta file)grep ">" <project-path>/<data-dir>/swissprot.fasta > swissprot.proteins
cat trembl.proteins swissprot.proteins > proteobooster.proteins
python collect_proteins.py UP000005640_9606.homologs UP000005640_9606.interologs UP000005640_9606.exp_interactions UP000005640_9606.proteins
(this will extract all the proteins from the files we generated above, it is not a required step, but it speeds up the generation).python build_protein_info_dict_singlethread.py proteobooster.proteins UP000005640_9606.proteins protein-buffer.proteins
Now that we have all the files we need, we simply need to create a directory where the database files will be written. Let's name that directory db-load
, and then we can run:
python prepare_database_files.py UP000005640_9606 <project-path>/<data-dir>/go-basic.obo UP000005640_9606.goa UP000005640_9606.interologs UP000005640_9606.exp_interactions UP000005640_9606.homologs protein-buffer.proteins UP000005640_9606.complexes UP000005640_9606.overrep <project-path>/<data-dir>/mi.obo taxonomy-info.tsv db-load