-
Notifications
You must be signed in to change notification settings - Fork 11
Building Knowledge Networks
The KnetMiner web application requires, as input, a genome-scale knowledge graph (KG). More background about the KnetMiner KGs can be found in our paper Hassani-Pak et al. 2016. Here we provide an overview of the steps involved in building KGs for new species. KGs can be created using the Ondex CLI (aka KnetBuilder). This guide will use Ondex CLI to build KGs and Ondex Desktop to inspect them.
- Linux server with 32GB RAM or more
- JAVA 8 (Java 11 if you want use the snapshot/development version)
- ondex-knet-builder v3.0 download
- Ondex-Desktop download
We are going to use Solanum tuberosum (potato) as an example organism.
- Potato GFF3 download
- Potato peptide FASTA download
- Potato gene-protein mapping download
- Potato protein domains download
- Potato-Arabidopsis orthologs download
- Arabidopsis KG v42 download
Download and unzip ondex-knet-builder and create a new tutorial-data
folder in the root. You can download the individual datasets to the tutorial-data
folder or download the tutorial-data.zip bundle in one go.
Ondex contains parsers (or importers) for a range of data formats including FASTA, GFF3, Tabular, UniProt-XML, Pubmed-XML, OWL etc. The role of an Ondex parser is to transform the raw data into the graph model using the standardized Ondex metadata. Here we describe how to build a core network of genes, proteins and domains for a particular organism.
Download the gff3 and protein fasta files, and unzip if they are zipped. There is a Ondex parser plugin, called fastagff
, that we can use to create a gene-protein network. The parser has the following parameters:
Fastagff parser parameters:
- GFF3 File: Path to GFF3
- Fasta File: Path to peptide FASTA
- Mapping File: Path to tabular gene and protein id mapping file. Required if protein ids are not equal to gene_id.x
- TaxId [Int]: Taxonomy ID of your organism
- Accession [String]: Cross-reference database (xref)
- DataSource [String]: Data origin (provenance)
We are now going to create an Ondex workflow file (my_workflow.xml) that instructs Ondex-CLI to run the fastagff parser and export the graph into OXL (Ondex Exchange Language) and create some basic stats (XML).
<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
<Workflow>
<Graph name="memorygraph">
<Arg name="GraphName">default</Arg>
<Arg name="graphId">default</Arg>
</Graph>
<Parser name="fastagff">
<Arg name="GFF3 File">tutorial-data/gff3</Arg>
<Arg name="Fasta File">tutorial-data/pep.all.fa</Arg>
<Arg name="Mapping File">tutorial-data/mapping.txt</Arg>
<Arg name="TaxId">4113</Arg>
<Arg name="Accession">ENSEMBL-PLANTS</Arg>
<Arg name="DataSource">ENSEMBL</Arg>
<Arg name="Column of the genes">0</Arg>
<Arg name="Column of the proteins">1</Arg>
<Arg name="graphId">default</Arg>
</Parser>
<Export name="oxl">
<Arg name="pretty">true</Arg>
<Arg name="ExportIsolatedConcepts">true</Arg>
<Arg name="GZip">true</Arg>
<Arg name="ExportFile">tutorial-data/kg_1.oxl</Arg>
<Arg name="graphId">default</Arg>
</Export>
<Export name="graphinfo">
<Arg name="ExportFile">tutorial-data/kg_1_stats.xml</Arg>
<Arg name="graphId">default</Arg>
</Export>
</Workflow>
</Ondex>
To run the workflow go to the Ondex-CLI root folder and type:
bash runme.sh tutorial-data/my_workflow.xml`
Once the workflow has completed, it should create a kg_1.oxl
in the folder specified by the OXL Exporter. You can open this file in Ondex Desktop and it would look similar to the Figure below. Use the Ondex Metagraph and Legend for some useful information. Check: Are the gene and protein numbers same as in the gff and fasta file? Are the gene and protein concepts connected via a relation? Search for certain gene names and check if gene/protein names are correct?
You can also check the kg_1_stats.xml report that was generated by the graphinfo
Exporter.
If everything looks OK, congratulations, you have your beginner's network of genes connected to the proteins they encode.
Note about memory: most of datasets probably will require that you tell Java to use more memory (RAM) than the small default we usually allocate in the launching scripts above. In Bash, this can be done by running this, before running Mini or Ondex:
export JAVA_TOOL_OPTIONS="-Xmx8G"
which allocates 8Gb of RAM for Java (and hence, Ondex). Don't set this with more than 80-90% of the RAM you have in your system, since that could make it unstable and even crash. The instruction above is needed every time you start a new bash session (or once only, if you put it in your Bash configuration file).
Download the protein-domain information from BioMart, choose "Solanum tuberosum". Click on "Features", unselect everything under "Attributes" and select only "Protein stable ID". Open "Protein Domains" and select "InterPro ID", "InterPro short description" and "InterPro description". Under "Filters->Protein Domains" you can select "Limit to genes with Interpro ID(s)".
The downloaded tabular file looks should look like this:
Protein stable ID | Interpro ID | Interpro Short Description | Interpro Description |
---|---|---|---|
PGSC0003DMT400092517 | IPR025558 | DUF4283 | Domain of unknown function DUF4283 |
PGSC0003DMT400092522 | IPR009518 | PSII_PsbX | Photosystem II PsbX |
PGSC0003DMT400092528 | IPR003105 | SRA_YDG | SRA-YDG |
Now Ondex has a generic parser for tabular files, called tabParser2
, that can be configured via XML. The XML schema can be found here (human-readable version).
The tabParser2
configuration for the above protein-domain table could look like this:
<?xml version = "1.0" encoding = "UTF-8" ?>
<parser
xmlns = "http://www.ondex.org/xml/schema/tab_parser"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<delimiter>\t</delimiter>
<quote>"</quote>
<encoding>UTF-8</encoding>
<start-line>1</start-line>
<concept id = "prot">
<class>Protein</class>
<data-source>ENSEMBL</data-source>
<accession data-source="ENSEMBL-PLANTS">
<column index='0' />
</accession>
</concept>
<concept id = "protDomain">
<class>ProtDomain</class>
<data-source>ENSEMBL</data-source>
<name preferred="true">
<column index='3' />
</name>
<name>
<column index='1' />
</name>
<accession data-source="IPRO">
<column index='2' />
</accession>
<attribute name="Description" type="TEXT">
<column index='4' />
</attribute>
</concept>
<relation source-ref="prot" target-ref="protDomain">
<type>has_domain</type>
</relation>
</parser>
We are now going to create a new workflow (my_workflow_2.xml) with instructions to parse the tabular file and export to OXL:
<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
<Workflow>
<Graph name="memorygraph">
<Arg name="GraphName">default</Arg>
<Arg name="graphId">default</Arg>
</Graph>
<Parser name="tabParser2">
<Arg name="InputFile">tutorial-data/protein_domains.txt</Arg>
<Arg name="configFile">tutorial-data/protein_domain_config.xml</Arg>
<Arg name="graphId">default</Arg>
</Parser>
<Export name="oxl">
<Arg name="pretty">true</Arg>
<Arg name="ExportIsolatedConcepts">true</Arg>
<Arg name="GZip">true</Arg>
<Arg name="ExportFile">tutorial-data/kg_2.oxl</Arg>
<Arg name="graphId">default</Arg>
</Export>
</Workflow>
</Ondex>
All we need to do again is to run Ondex-CLI with the above workflow:
bash runme.sh tutorial-data/my_workflow_2.xml
Our next goal is to connect our organisms data to a rich knowledge graph for Arabidopsis that can be licensed from Rothamsted.
To download the Compara data we use Ensembl BioMart and choose "Solanum tuberosum".
Click on "attributes", click on "Homologs", for now we'll get the homologs for A. thaliana. Under "Gene", unselect everything under "Gene Attributes" and select only "Protein stable ID". Then open "Orthologs" and select "Arabidopsis thaliana protein stable ID", "Homology type", "%id. target", "%id. query".
Then, scroll back up and click "Results" on the top left corner. You'll see a few example results, click on "Export all results to"'s "Go" button to get all results as a tab-delimited file. The header of that file should look something like this:
Gene stable ID Protein stable ID Arabidopsis thaliana protein or transcript stable ID Arabidopsis thaliana homology type %id. target Arabidopsis thaliana gene identical to query gene %id. query gene identical to target Arabidopsis thaliana gene
PGSC0003DMG400042093 PGSC0003DMT400092522 AT2G06520.1 ortholog_one2many 59.1667 61.2069
PGSC0003DMG400042126 PGSC0003DMT400092555 AT1G05120.2 ortholog_many2many 57.8512 7.99087
PGSC0003DMG400042126 PGSC0003DMT400092555 AT1G02670.3 ortholog_many2many 44.6281 7.82609
PGSC0003DMG400042168 PGSC0003DMT400092597 AT4G29530.1 ortholog_one2one 56 51.4286
(etc.)
Note lines with a potato protein but no Arabidopsis ortholog should be deleted:
awk '{if ( $3 != "") print}' your_file > compara.txt
The 'tabParser2' configuration for the above tabular file could look like this:
<?xml version = "1.0" encoding = "UTF-8" ?>
<parser
xmlns = "http://www.ondex.org/xml/schema/tab_parser"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<delimiter>\t</delimiter>
<quote>"</quote>
<encoding>UTF-8</encoding>
<start-line>1</start-line>
<concept id="protL">
<class>Protein</class>
<data-source>EnsemblCompara</data-source>
<accession data-source="ENSEMBL-PLANTS">
<column index='1' />
</accession>
</concept>
<concept id="protR">
<class>Protein</class>
<data-source>EnsemblCompara</data-source>
<accession data-source="TAIR">
<column index='2' />
</accession>
</concept>
<relation source-ref="protL" target-ref="protR">
<type>ortho</type>
<evidence>EnsemblCompara</evidence>
<attribute name="Homology_type" type="TEXT">
<column index='3' />
</attribute>
<attribute name="%Identity_Arabidopsis" type="NUMBER">
<column index='4' />
</attribute>
<attribute name="%Identity_Potato" type="NUMBER">
<column index='5' />
</attribute>
</relation>
</parser>
You can again construct a workflow similar to the protein-domain workflow and create a network with ortholog relations between potato and Arabidopsis.
We are going to skip building an individual workflow for the compara data and instead assemble all the previous steps into a single workflow. This will connect the potato gene-protein-domain information with a pre-integrated Arabidopis KG that has many types of information including publications, phenotypes and GO annotations.
<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
<Workflow>
<Graph name="memorygraph">
<Arg name="GraphName">default</Arg>
<Arg name="graphId">default</Arg>
</Graph>
<!-- Gene-Protein -->
<Parser name="fastagff">
<Arg name="GFF3 File">tutorial-data/gff3</Arg>
<Arg name="Fasta File">tutorial-data/pep.all.fa</Arg>
<Arg name="Mapping File">tutorial-data/mapping.txt</Arg>
<Arg name="TaxId">4113</Arg>
<Arg name="Accession">ENSEMBL-PLANTS</Arg>
<Arg name="DataSource">ENSEMBL</Arg>
<Arg name="Column of the genes">0</Arg>
<Arg name="Column of the proteins">1</Arg>
<Arg name="graphId">default</Arg>
</Parser>
<!-- Protein Domain -->
<Parser name="tabParser2">
<Arg name="InputFile">tutorial-data/protein_domains.txt</Arg>
<Arg name="configFile">tutorial-data/protein_domain_config.xml</Arg>
<Arg name="graphId">default</Arg>
</Parser>
<!-- Homology -->
<Parser name="tabParser2">
<Arg name="InputFile">tutorial-data/compara.txt</Arg>
<Arg name="configFile">tutorial-data/compara_config.xml</Arg>
<Arg name="graphId">default</Arg>
</Parser>
<!-- Arabidopsis knowledge network from Rothamsted -->
<Parser name="oxl">
<Arg name="InputFile">tutorial-data/ArabidopsisKG_201610.oxl</Arg>
<Arg name="graphId">default</Arg>
</Parser>
<!-- Mapping -->
<Mapping name="lowmemoryaccessionbased">
<Arg name="IgnoreAmbiguity">false</Arg>
<Arg name="RelationType">collapse_me</Arg>
<Arg name="WithinDataSourceMapping">true</Arg>
<Arg name="graphId">default</Arg>
</Mapping>
<Transformer name="relationcollapser">
<Arg name="CloneAttributes">true</Arg>
<Arg name="CopyTagReferences">true</Arg>
<Arg name="graphId">default</Arg>
<Arg name="RelationType">collapse_me</Arg>
</Transformer>
<!-- Export knowledge network -->
<Export name="oxl">
<Arg name="pretty">true</Arg>
<Arg name="ExportIsolatedConcepts">true</Arg>
<Arg name="GZip">true</Arg>
<Arg name="ExportFile">tutorial-data/kg-final.oxl</Arg>
<Arg name="graphId">default</Arg>
</Export>
</Workflow>
</Ondex>
NOTE: All data and config files used in this workflow are available in tutorial-data
. Make sure the PC or server which runs the workflow has sufficient memory (32 GB RAM or more). The resulting network will be very large but it still can be opened in Ondex if enough memory is available. Ondex won't be able to visualise the entire network but it can produce some useful information and provide simple search and filter tools for first-pass quality checks of the knowledge graph before deploying it in KnetMiner for further checks.
The final OXL (in this case named tutorial-data/kg-final.oxl
) will be used in the KnetMiner server.
RDF Exporter
Neo4j Exporter
New Tab/CSV Importer
BK-Net Ontology
rdf2neo tool for RDF->Neo4j