Skip to content

Semantic Motif Searching in Knetminer

Marco Brandizi edited this page Jun 24, 2020 · 20 revisions

Introduction

Knetminer uses a combination of graph patterns and traditional search ranking techniques to estimate how genes are relevant to search words, which, of course, is used to rank and select the genes to show as a search result.

Details are available here. We define a semantic motif a graph path (or a pattern matching a path) from a gene to another entity in a Knetminer knowledge graph. An example (in an informal syntax):

  Gene -  encodes -> Protein - interacts-with (1-2 links) -> Protein <- mentions <- Publication

which links protein-mentioning publications to other interacting proteins and genes that encode the latter.

Knetminer can link genes to other entities by means of multiple motifs like the above. Every dataset/specie that makes up an instance can be configured with a set of motifs, which are used against genes in the datasets to find relevant gene-related entities.

That matching is performed by what we call graph traverser. Currently, there are two ways to perform semantic motif searches in Knetminer, each having two different languages to define the motifs, and different sets of configuration options. Each of such ways has its own graph traverser, which means you can choose which type of semantic motif search you want to use, and thus the corresponding graph pattern language, by defining the right traverser in a configuration file. Details are given in this document.

# The Data Model for the Knetminer Knowledge Graphs

Both the graph traversers used in Knetminer (or any other traverser, for what matters) allows for the definition of graph patterns by referring to the node type names and node link names used in the underlining Knetminer dataset. This is essentially a knowledge graph, namely a property graph, and those names are based on a predefined schema. The reference for such schema is a metadata file included in Ondex. Examples of of it are given in our paper about the Knetminer backend. The same metadata are automatically translated into our BioKNO ontology, and sample queries in SPARQL are presented in our SPARQL endpoint.

All the examples in the hereby document are based on the same metadata.

The State Machine Traverser

Historically, the so-called state machine traverser (SM) has been the first developed within the Ondex project. This allows to define semantic motifs according to a graph of transitions between node types (concept classes in Ondex terms) and relation types which you want to hold between nodes.

For instance, this is what we use for the arabidopsis dataset

Arabidopsis State Machine

Where we're saying, for example, that we want to match a gene with any trait that co-occurs (cooc_wi) with the gene (in the sense of text mining occurrence), and both relations Gene - cooc_wi -> Trait and Trait - cooc_wi -> Gene will be matched (non-directional link). As another exmaple, look again the figure and find the chain Gene - enc - Protein - genetic|physical -> Protein, which includes self-loops on the first protein, mixed directed and undirected links, multiple releation types that are valid to link from a protein to the next.

State machine can be defined by a simple flat file format. The file defining the SM in figure is here. Let's look at an example:

#Finite States *=start state ^=end state
1*	Gene
2^	Publication
3^	MolFunc
...
7^	Protein
8^	Gene
9	Gene
10	Protein
...
10-10	ortho	4
10-10	xref	4
10-10	genetic	6	d
...
1-10	enc
10-7	physical	6	d
10-7	genetic	6	d
...

The format is very simple:

  • Every file row defines either a node or a transition, each line has different fields separated by tab characters
  • Node types are numbered in a first section of the file (type names must match Ondex concept classes)
  • Node numbers are used to define transitions between nodes (similarly to nodes, transition names must match Ondex relation types)
  • A transition has the format:
      <node1>-<node2>	<name>	[limit]	[d]
    
    Where limit is the max "distance" of a path that is found between the gene and <node2>. In calculating such a distance, the first gene counts 1, every following link or node counts 1. For instance, the distance of X in the path gene0 -> encodes -> protein0 -> ortho -> protein1 is 5
  • The optional flag 'd' is used to define directional transitions. Note that these are usually faster to match. So, even if a directional transition isn't defined for enc (we don't expect data where proteins encode genes), adding the flag would speed up things in case of problems.

How to configure the SM traverser in Knetminer.

The SM configuration is part of the configuration required to setup a Knetminer instance, which is
described in our wiki. Our pre-configured datasets are examples of it.

Details are:

  • In maven-settings.xml, leave the Maven property knetminer.api.graphTraverserClass empty, ie, don't add it to your dataset-specific settings, which will inherit the default empty value. This corresponds to picking the default traverser class, net.sourceforge.ondex.algorithm.graphquery.GraphTraverser, which is the SM traverser. (so defining it explicitly would achieve the same result). Note that the Maven property is injected into the Knetminer configuration file that is generated from a template.

  • Define the state machine for your dataset in the file <dataset>/ws/SemanticMotifs.txt, using the format explained above. This is the path set in the Knetminer configuration file. An alternative to this would be placing your own data-source.xml file in the dataset directory and define a different path/name. Unless, you've particular need, we don't recommend it.

  • In maven-settings.xml, define the right knetminer.specieTaxId (comma-separated list of NCBITax codes), that is used to pick the genes of your specie of interest. Semantic motifs are applied to these during the Knetminer initialisation, in order to have pre-computed data to start searches from. These are named 'seed genes'. As an alternative to using the specie ID, you can define a list of seed genes explicitly, by setting the property knetminer.backend.seedGenesFile in maven-settings.xml, see this example.

Performance tuning and trouble shooting

The SM traverser is usually rather efficient, without having much to configure/tune. However, there are a few factors that affects its performance:

  • Obviously, the more seed genes you have, the slower the Knetminer initialisation is
  • Similarly, bigger state machines (in terms of total number of nodes + transitions) take more time, but the traversal is parallel and usually scales well.
  • The biggest impact on performance is on what you match. For instance, if you have self-loops of protein->xref->protein, this can easily hangs in trying to match long chain of cross-references, especially if there are loops in the graph. In a case like this, you should always define a low-enough limit constraint (see above).
  • Both the traversers save their initialisation results in memory (to be reused during the application lifetime), if you have many paths to match, you might need more RAM (see the Docker documentation). Increasing memory can also make the initialisation stage faster, since this limits the frequency of intermediate results cleaning operations (for the geeks, the garbage collector overhead).

The SM renderer

The Cypher/Neo4j Traverser

How to configure the Cypher Traverser

Performance tuning and troubleshooting

Knetminer logs and performance reporter

Parameters tuning

The Cypher debugger