This repository contains data and code to generate the results and reproduce the figures and tables found in Supervised learning is an accurate method for network-based gene classification Bioinformatics DOI: 10.1093/bioinformatics/btaa150
. This work focuses on comparing two classes of network-based gene classification methods: supervised learning (SL) and label propagation (LP). The comparison is performed systematically across diverse prediction tasks (functions, diseases, and traits) and molecular networks using meaningful validation schemes and evaluation metrics. Based on this extensive analysis, we find that supervised learning outperforms label propagation for gene classification, especially for function prediction.
This repo provides:
- The data, results, and figures presented in the manuscript.
- Code to regenerate the results.
- The ability to add a new method.
The data used in this study (networks, embeddings, and genesets) is available on Zenodo. To get the data run
sh get_data.sh
We note that data for KEGG and InBioMap are not included as they do not provide a permissive enough license.
results/
: This directory contains two files:
main_result.tsv
: This file contains all the results used to compare the SL and LP methods.mdlsel_result.tsv
: This file contains all the results used for model selection for supervised learning and random walk with restart.
properties/
: This directory contains the network properties of every geneset based on every network.
figures_tables/
: This directory contains all the pre-generated figures and tables.
This code was tested on an Anaconda distribution of python. The major packages used are:
python 3.6.5
numpy 1.14.3
scipy 1.1.0
pandas 0.23.0
scikit-learn 0.19.1
matplotlib 2.2.2
seaborn 0.8.1
statsmodels 0.9.0
The parallelization of the code was tested with Slurm on the high performance computing cluster at Michigan State University.
Run the following two lines of code to test if the code works on one set of parameters: GOBP (geneset collection), STRING (network), SL-A, and LP-I (two methods):
cd demo
sh test_run.sh
This will save a sample results files named test.tsv
into the results/
directory. This script takes a few minutes to run.
Results can be re-generated either in parallel or on a single machine. The files must be executed within the directory they are stored.
- To generate the results on a single-machine run:
cd run
sh generate_main_result.sh # This takes approximately 9 hours to run
sh generate_mdlsel_result.sh # This takes approximately 25 hours to run
- To generate the results in parallel run:
cd run/SLURM
sbatch generate_main_result_parallel.sb # This takes approximately 1 hour to run
sbatch generate_mdlsel_result_parallel.sb # This takes approximately 1 hour to run
Result outputs will be saved as either results/main_result_new.tsv
or results/mdsel_result_new.tsv
. We note that development of the scikit-learn package is very active and thus the exact results depend on the specific version of scikit-learn used to regenerate the results.
- To generate the geneset properties run:
cd run
sh generate_geneset_properties.sh # This takes approximately 5 minutes to run
Geneset properties will be saved in properties_new/
.
- To generate all the figures and tables in the manuscript run:
cd run
sh generate_figures.sh # This takes approximately 5 minutes to run
sh generate_tables.sh # This takes approximately 5 minutes to run
The figure and table generation scirpts will only run when using the orginal results in results/main_reuslts.tsv
or results/mdlsel_result.tsv
, and the results will be saved in figures_tables/
. We note that if you run sh generate_tables.sh
this will give slightly different results from Table S1 and S2 in the manuscript as these new tables will be generated with the data included in this repository, which does not include KEGG and InBioMap.
A new network-based gene classification method can be added using the following steps:
- Create a new class object in
src/core/models.py
. A template for how to do this is included at the bottom of themodels.py
script. - The file
src/main.py
will need to be updated in two places. First, a new model dictionary will have to be added to the function get_mdl_dict. A template of how to do this is included in that function. Second, the model name will have to be added to the all_methods variable. - The model name will need to be added to the -m argument in the scripts that re-generate the scores. We note that
generate_figures.sh
andgenerate_tables.sh
will not work when a new method is added.
For support please contact Remy Liu at liurenmi@msu.edu or Chris Mancuso at mancus16@msu.edu.
See LICENSE.md for license information for all geneset collections and networks used in this project.
If you use this work, please cite:
Liu R*, Mancuso CA*, Yannakopoulos A, Johnson KA, Krishnan A (2020) Supervised-learning is an accurate method for network-based gene classification. Bioinformatics doi.org/10.1093/bioinformatics/btaa150.
Renming Liu#, Christopher A Mancuso#, Anna Yannakopoulos, Kayla A Johnson, Arjun Krishnan*
# These authors are joint first authors and are listed alphabetically.
* General correspondence should be addressed to AK at arjun@msu.edu.
This work was primarily supported by US National Institutes of Health (NIH) grants R35 GM128765 to AK and in part by MSU start-up funds to AK and MSU Engineering Distinguished Fellowship to AY.
We are grateful for the support from the members of the Krishnan Lab.
- Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, Zhang R, Hartmann BM, Zaslavsky E, Sealfon SC, Chasman DI, FitzGerald GA, Dolinski K, Grosser T, Troyanskaya OG. (2015) Understanding multicellular function and disease with human tissue-specific networks. Nature Genetics 47:569-576.
- Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, Simonovic M, Roth A, Santos A, Tsafou KP, Kuhn M, Bork P, Jensen LJ, von Mering C. (2015) STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Research 43:D447-452.
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Research 34:D535-539.
- Li T, Wernersson R, Hansen RB, Horn H, Mercer J, Slodkowicz G, Workman CT, Rigina O, Rapacki K, Stærfeldt HH, Brunak S, Jensen TS, Lage K. (2017) A scored human protein-protein interaction network to catalyze genomic interpretation. Nature Methods 14:61-64.
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 25:25-29.
- The Gene Ontology Consortium. (2019) The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Research 47:D330-338.
- Wu C, MacLeod I, Su AI (2013) BioGPS and MyGene.info: organizing online, gene-centric information. Nucleic Acids Research 41:D561-D565.
- Xin J, Mark A, Afrasiabi C, Tsueng G, Juchler M, Gopal N, Stupp GS, Putman TE, Ainscough BJ, Griffith OL, Torkamani A, Whetzel PL, Mungall CJ, Mooney SD, Su AI, Wu C (2016) High-performance web services for querying gene and variant annotation. Genome Biology 17:1-7.
- Piñero J, Queralt-Rosinach N, Bravo À, Deu-Pons J, Bauer-Mehren A, Baron M, Sanz F, Furlong LI. (2015) DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database bav028.
- Piñero J, Bravo À, Queralt-Rosinach N, Gutiérrez-Sacristán A, Deu-Pons J, Centeno E, García-García J, Sanz F, Furlong LI. (2017) DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Research 45:D833-839.
- Schriml LM, Mitraka E, Munro J, Tauber B, Schor M, Nickle L, Felix V, Jeng L, Bearer C, Lichenstein R, Bisordi K, Campion N, Hyman B, Kurland D, Oates CP, Kibbey S, Sreekumar P, Le C, Giglio M, Greene C. (2019) Human Disease Ontology 2018 update: classification, content and workflow expansion. Nucleic Acids Research 47:D955-D962.
- Smith CL, Blake JA, Kadin JA, Richardson JE, Bult CJ, Mouse Genome Database Group. (2018) Mouse Genome Database (MGD)-2018: knowledgebase for the laboratory mouse. Nucleic Acids Research 46:D836-842.
- Smith CL, Eppig JT. (2009) The mammalian phenotype ontology: enabling robust annotation and comparative analysis. WIREs Systems Biology and Medicine 3:390-399.
GWAS (data downloaded from Synapse at https://www.synapse.org/#!Synapse:syn11944948)
- Choobdar S, Ahsen ME, Crawford J, Tomasoni M, Fang T, Lamparter D, Lin J, Hescott B, Hu X, Mercer J, Natoli T, Narayan R, The DREAM Module Identification Challenge Consortium, Subramanian A, Zhang JD, Stolovitzky G, Kutalik Z, Lage K, Slonim DK, Saez-Rodriguez J, Cowen LJ, Bergmann S, Marbach D. (2019) Open Community Challenge Reveals Molecular Network Modules with Key Roles in Diseases. bioRxiv doi: https://doi.org/10.1101/265553.
- Kanehisa M, Goto S. (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28:27-30.
- Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. (2017) KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research 45:D353-361.
- Kanehisa M, Sato Y, Furumichi M, Morishima K, Tanabe M. (2019) New approach for understanding genome variations in KEGG. Nucleic Acids Research 47:D590-595.
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences U.S.A. 43:15545-50.
- Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP. (2011) Molecular signatures database (MSigDB) 3.0. Bioinformatics 12:1739-40.