Building Knowledge Networks

The KnetMiner web application requires, as input, a genome-scale knowledge graph (KG). More introductory background information about the KnetMiner KGs can be found in our paper Hassani-Pak et al. 2016. Here we provide an overview of the steps involved in building KGs for new species. Knowledge Graphs can be created using the Ondex CLI (AKA KnetBuilder). This guide will use Ondex CLI to build KGs and the application called Ondex Desktop to inspect them.

Software and Hardware Requirements

Linux server with 32GB RAM or more
JAVA 8 (Java 11 if you want use the snapshot/development version)
ondex-knet-builder v3.0 download
Ondex-Desktop download

Data Requirements

We are going to use Solanum tuberosum (potato) as an example organism.

Potato GFF3 download
Potato peptide FASTA download
Potato gene-protein mapping download - Save as mart_export.txt or mapping.txt
Potato protein domains download
Potato-Arabidopsis orthologs download
Arabidopsis KG v45 download (not part of tutorial-data.zip)

Download and unzip ondex-knet-builder which will create a top-level (root) folder called something like "ondex-mini". Now create a tutorial-data folder anywhere on your file system. You can download the individual datasets to the tutorial-data folder or download and unzip the tutorial-data.zip data bundle (Note: Does not contain the arabidopsis-45.oxl which needs to be downloaded separately).

Building networks from your data

Ondex contains parsers (or importers) for a range of data formats including FASTA, GFF3, Tabular, UniProt-XML, Pubmed-XML, OWL etc. The role of an Ondex parser is to transform the raw data into the graph model using the standardized Ondex metadata. Here we describe how to build a core network of genes, proteins and domains for a particular organism.

Gene-Protein

Download the gff3 and protein fasta files, and unzip if they are zipped. There is a Ondex parser plugin, called fastagff, that we can use to create a gene-protein network. The parser has the following parameters:

Fastagff parser parameters:

GFF3 File: Path to GFF3
Fasta File: Path to peptide FASTA
Mapping File: Path to tabular gene and protein id mapping file. Required if protein ids are not equal to gene_id.x
TaxId [Int]: Taxonomy ID of your organism
Accession [String]: Cross-reference database (xref)
DataSource [String]: Data origin (provenance)

We are now going to create an Ondex workflow file (my_workflow.xml) that instructs Ondex-CLI to run the fastagff parser and export the graph into OXL (Ondex Exchange Language) and create some basic stats (XML).

<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
  <Workflow>
    <Graph name="memorygraph">
      <Arg name="GraphName">default</Arg>
      <Arg name="graphId">default</Arg>
    </Graph>
    <!-- Gene-Protein -->
    <Parser name="fastagff">
      <Arg name="GFF3 File">${baseDir}/gff3</Arg>
      <Arg name="Fasta File">${baseDir}/protein_fa</Arg>
      <Arg name="Mapping File">${baseDir}/mapping.txt</Arg>
      <Arg name="TaxId">4113</Arg>	<!-- Set to TAXID of your organism -->
      <Arg name="Accession">ENSEMBL-PLANTS</Arg>
      <Arg name="DataSource">ENSEMBL</Arg>
      <Arg name="Column of the genes">0</Arg>
      <Arg name="Column of the proteins">1</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>

    <Export name="oxl">
      <Arg name="pretty">true</Arg>
      <Arg name="ExportIsolatedConcepts">true</Arg>
      <Arg name="GZip">true</Arg>
      <Arg name="ExportFile">${baseDir}/kg_1.oxl</Arg>
      <Arg name="graphId">default</Arg>
    </Export>

    <Export name="graphinfo">
      <Arg name="ExportFile">${baseDir}/kg_1_stats.xml</Arg>
      <Arg name="graphId">default</Arg>
    </Export>
  </Workflow>
</Ondex>

To run the workflow, you need to load Java 8, go to the ondex-mini root folder and execute the runme.sh:

module load Java/1.8.0_192
cd /home/data/knetminer/software/ondex-mini-3.0/
export JAVA_TOOL_OPTIONS="-Xmx8G"
echo $JAVA_TOOL_OPTIONS

./runme.sh /home/data/knetminer/pub/tutorial-data/workflow.xml "baseDir=/home/data/knetminer/pub/tutorial-data/"

Note about memory: most of datasets probably will require that you tell Java to use more memory (RAM) than the small default we usually set in the launching scripts above. In Bash, this can be done by running this, before running ondex-mini or Ondex Desktop:

export JAVA_TOOL_OPTIONS="-Xmx8G"

which allocates 8 Gb of RAM for Java (and hence, Ondex). Don't set this with more than 80% of the RAM you have in your system, since that could make it unstable and even crash. The instruction above is needed every time you start a new bash session (or once only, if you put it in your Bash configuration file).

Once the workflow has completed, it should create a kg_1.oxl in the folder specified by the OXL Exporter. You can open this file in Ondex Desktop and it would look similar to the Figure below. Use the Ondex Metagraph and Legend for some useful information. Check: Are the gene and protein numbers same as in the gff and fasta file? Are the gene and protein concepts connected via a relation? Search for certain gene names and check if gene/protein names are correct?

Ondex Metagraph and Legend

You can also check the kg_1_stats.xml report that was generated by the graphinfo Exporter.

If everything looks OK, congratulations, you have your beginner's network of genes connected to the proteins they encode.

Protein-Domains

Download the protein-domain information from BioMart, choose "Solanum tuberosum". Click on "Features", unselect everything under "Attributes" and select only "Protein stable ID". Open "Protein Domains" and select "InterPro ID", "InterPro short description" and "InterPro description". Under "Filters->Protein Domains" you can select "Limit to genes with Interpro ID(s)".

The downloaded tabular file looks should look like this:

Protein stable ID	Interpro ID	Interpro Short Description	Interpro Description
PGSC0003DMT400092517	IPR025558	DUF4283	Domain of unknown function DUF4283
PGSC0003DMT400092522	IPR009518	PSII_PsbX	Photosystem II PsbX
PGSC0003DMT400092528	IPR003105	SRA_YDG	SRA-YDG

Now Ondex has a generic parser for tabular files, called tabParser2, that can be configured via XML. The XML schema can be found here (human-readable version).

The tabParser2 configuration for the above protein-domain table could look like this:

<?xml version = "1.0" encoding = "UTF-8" ?>
<parser 
	xmlns = "http://www.ondex.org/xml/schema/tab_parser"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

	<delimiter>\t</delimiter>
	<quote>"</quote>
	<encoding>UTF-8</encoding>
	<start-line>1</start-line>
	
	<concept id = "prot">
		<class>Protein</class>
		<data-source>ENSEMBL</data-source>
		<accession data-source="ENSEMBL-PLANTS">
		       <column index='0' />
		</accession>
	</concept>

	<concept id = "protDomain">
		<class>ProtDomain</class>
		<data-source>ENSEMBL</data-source>
		<name preferred="true">
		        <column index='3' />
		</name>
		<name>
			<column index='1' />
		</name>
		<accession data-source="IPRO">
			<column index='2' />
		</accession>
		<attribute name="Description" type="TEXT"> 
			<column index='4' />
		</attribute>
	</concept>

	<relation source-ref="prot" target-ref="protDomain">
		<type>has_domain</type>
	</relation>

</parser>

We are now going to create a new workflow (my_workflow_2.xml) with instructions to parse the tabular file and export to OXL:

<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
  <Workflow>
    <Graph name="memorygraph">
      <Arg name="GraphName">default</Arg>
      <Arg name="graphId">default</Arg>
    </Graph>
    <Parser name="tabParser2">
      <Arg name="InputFile">${baseDir}/protein_domains.txt</Arg>
      <Arg name="configFile">${baseDir}/protein_domain_config.xml</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
    <Export name="oxl">
      <Arg name="pretty">true</Arg>
      <Arg name="ExportIsolatedConcepts">true</Arg>
      <Arg name="GZip">true</Arg>
      <Arg name="ExportFile">${baseDir}/kg_2.oxl</Arg>
      <Arg name="graphId">default</Arg>
    </Export>
  </Workflow>
</Ondex>

All we need to do again is to run Ondex-CLI with the above workflow:

bash runme.sh /home/data/knetminer/pub/tutorial-data/my_workflow_2.xml "baseDir=/home/data/knetminer/pub/tutorial-data/"

Homology to Arabidopsis KG

Our next goal is to connect our organisms data to a rich knowledge graph for Arabidopsis that can be licensed from Rothamsted.

Ensembl Compara Data

To download the Compara data we use Ensembl BioMart and choose "Solanum tuberosum".

Click on "attributes", click on "Homologs", for now we'll get the homologs for A. thaliana. Under "Gene", unselect everything under "Gene Attributes" and select only "Protein stable ID". Then open "Orthologs" and select "Arabidopsis thaliana protein stable ID", "Homology type", "%id. target", "%id. query".

Then, scroll back up and click "Results" on the top left corner. You'll see a few example results, click on "Export all results to"'s "Go" button to get all results as a tab-delimited file. The header of that file should look something like this:

Gene stable ID	Protein stable ID	Arabidopsis thaliana protein or transcript stable ID	Arabidopsis thaliana homology type	%id. target Arabidopsis thaliana gene identical to query gene	%id. query gene identical to target Arabidopsis thaliana gene
PGSC0003DMG400042093	PGSC0003DMT400092522	AT2G06520.1	ortholog_one2many	59.1667	61.2069
PGSC0003DMG400042126	PGSC0003DMT400092555	AT1G05120.2	ortholog_many2many	57.8512	7.99087
PGSC0003DMG400042126	PGSC0003DMT400092555	AT1G02670.3	ortholog_many2many	44.6281	7.82609
PGSC0003DMG400042168	PGSC0003DMT400092597	AT4G29530.1	ortholog_one2one	56	51.4286
(etc.)

Note lines with a potato protein but no Arabidopsis ortholog should be deleted:

awk '{if ( $3 != "") print}' your_file > compara.txt

The 'tabParser2' configuration for the above tabular file could look like this:

<?xml version = "1.0" encoding = "UTF-8" ?>
<parser 
	xmlns = "http://www.ondex.org/xml/schema/tab_parser"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

	<delimiter>\t</delimiter>
	<quote>"</quote>
	<encoding>UTF-8</encoding>
	<start-line>1</start-line>
	
	<concept id="protL">
		<class>Protein</class>
		<data-source>EnsemblCompara</data-source>
		<accession data-source="TAIR">
			<column index='1' />
		</accession>
	</concept>
	
	<concept id="protR">
		<class>Protein</class>
		<data-source>EnsemblCompara</data-source>
		<accession data-source="ENSEMBL-PLANTS">
			<column index='6' />
		</accession>
	</concept>
	
	<relation source-ref="protL" target-ref="protR">
		<type>ortho</type>
		<evidence>EnsemblCompara</evidence>
		<attribute name="Homology_type" type="TEXT">
			<column index='4' />
		</attribute>
		<attribute name="%Identity_Arabidopsis" type="NUMBER">
			<column index='3' />
		</attribute>
		<attribute name="%Identity_Potato" type="NUMBER">
			<column index='8' />
		</attribute>
	</relation>
</parser>

You can again construct a workflow similar to the protein-domain workflow and create a network with ortholog relations between potato and Arabidopsis.

We are going to skip building an individual workflow for the compara data and instead assemble all the previous steps into a single workflow. This will connect the potato gene-protein-domain information with a pre-integrated Arabidopis KG that has many types of information including publications, phenotypes and GO annotations.

<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
  <Workflow>
    <Graph name="memorygraph">
      <Arg name="GraphName">default</Arg>
      <Arg name="graphId">default</Arg>
    </Graph>
    
    <!-- Gene-Protein -->
    <Parser name="fastagff">
      <Arg name="GFF3 File">${baseDir}/gff3</Arg>
      <Arg name="Fasta File">${baseDir}/protein_fa</Arg>
      <Arg name="Mapping File">${baseDir}/mapping.txt</Arg>
      <Arg name="TaxId">4113</Arg>	<!-- Set to TAXID of your organism -->
      <Arg name="Accession">ENSEMBL-PLANTS</Arg>
      <Arg name="DataSource">ENSEMBL</Arg>
      <Arg name="Column of the genes">0</Arg>
      <Arg name="Column of the proteins">1</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
	
    <!-- Protein Domain -->
    <Parser name="tabParser2">
      <Arg name="InputFile">${baseDir}/protein_domains.txt</Arg>
      <Arg name="configFile">${baseDir}/protein_domains_config.xml</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>

    <!-- Homology -->
    <Parser name="tabParser2">
      <Arg name="InputFile">${baseDir}/compara.txt</Arg>
      <Arg name="configFile">${baseDir}/compara_config.xml</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
	
    <!-- Arabidopsis KG from Rothamsted -->
    <Parser name="oxl">
      <Arg name="InputFile">${baseDir}/arabidopsis_45.oxl</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
	
    <!-- Mapping -->
    <Mapping name="lowmemoryaccessionbased">
      <Arg name="IgnoreAmbiguity">false</Arg>
      <Arg name="RelationType">collapse_me</Arg>
      <Arg name="WithinDataSourceMapping">true</Arg>
      <Arg name="graphId">default</Arg>
    </Mapping>
    
	<!-- Collapsing -->
    <Transformer name="relationcollapser">
      <Arg name="CloneAttributes">true</Arg>
      <Arg name="CopyTagReferences">true</Arg>
      <Arg name="graphId">default</Arg>
      <Arg name="RelationType">collapse_me</Arg>
    </Transformer>

    <!-- Export knowledge graph -->
    <Export name="oxl">
      <Arg name="pretty">true</Arg>
      <Arg name="ExportIsolatedConcepts">true</Arg>
      <Arg name="GZip">true</Arg>
      <Arg name="ExportFile">${baseDir}/kg-final.oxl</Arg>
      <Arg name="graphId">default</Arg>
    </Export>
  </Workflow>
</Ondex>

Run the workflow like this

module load Java/1.8.0_192
export JAVA_TOOL_OPTIONS="-Xmx24G"
echo $JAVA_TOOL_OPTIONS

cd /home/data/knetminer/software/ondex-mini-3.0/
./runme.sh /home/data/knetminer/pub/tutorial-data/workflow.xml "baseDir=/home/data/knetminer/pub/tutorial-data/"

All data and config files used in this workflow are located in a tutorial-data folder and the path is provided via baseDir= to KnetBuilder (ondex-mini). To run this workflow you will require JAVA 8 and 24 GB RAM. The resulting knowledge graph will have over a million relationships but it still can be opened in Ondex if enough memory is available. Ondex won't be able to visualise the entire KG but it can produce some useful information and provide simple search and filter tools for first-pass quality checks of the knowledge graph before deploying it in KnetMiner for further checks.

The final OXL (in this case named tutorial-data/kg-final.oxl) will be used in the KnetMiner server.

Relevant Plug-Ins

RDF Exporter
Neo4j Exporter
New Tab/CSV Importer

Relevant Related External Tools

BK-Net Ontology
rdf2neo tool for RDF->Neo4j

Provide feedback

Saved searches

Use saved searches to filter your results more quickly