-
Notifications
You must be signed in to change notification settings - Fork 16
Semantic Motif Searching in Knetminer
Knetminer uses a combination of graph patterns and traditional search ranking techniques to estimate how genes are relevant to search words, which, of course, is used to rank and select the genes to show as a search result.
Details are available here. We define a semantic motif a graph path (or a pattern matching a path) from a gene to another entity in a Knetminer knowledge graph. An example (in an informal syntax):
Gene - encodes -> Protein - interacts-with (1-2 links) -> Protein <- mentions <- Publication
which links protein-mentioning publications to other interacting proteins and genes that encode the latter.
Knetminer can link genes to other entities by means of multiple motifs like the above. Every dataset/specie that makes up an instance can be configured with a set of motifs, which are used against genes in the datasets to find relevant gene-related entities.
That matching is performed by what we call graph traverser. Currently, there are two ways to perform semantic motif searches in Knetminer, each having two different languages to define the motifs, and different sets of configuration options. Each of such ways has its own graph traverser, which means you can choose which type of semantic motif search you want to use, and thus the corresponding graph pattern language, by defining the right traverser in a configuration file. Details are given in this document.
# The Data Model for the Knetminer Knowledge Graphs
Both the graph traversers used in Knetminer (or any other traverser, for what matters) allows for the definition of graph patterns by referring to the node type names and node link names used in the underlining Knetminer dataset. This is essentially a knowledge graph, namely a property graph, and those names are based on a predefined schema. The reference for such schema is a metadata file included in Ondex. Examples of of it are given in our paper about the Knetminer backend. The same metadata are automatically translated into our BioKNO ontology, and sample queries in SPARQL are presented in our SPARQL endpoint.
All the examples in the hereby document are based on the same metadata.
Historically, the so-called state machine traverser (SM) has been the first developed within the Ondex project. This allows to define semantic motifs according to a graph of transitions between node types (concept classes in Ondex terms) and relation types which you want to hold between nodes.
For instance, this is what we use for the arabidopsis dataset
Where we're saying, for example, that we want to match a gene with any trait that co-occurs (cooc_wi)
with the gene (in the sense of text mining occurrence), and both relations Gene - cooc_wi -> Trait
and Trait - cooc_wi -> Gene
will be matched (non-directional link). As another exmaple, look again the
figure and find the chain Gene - enc - Protein - genetic|physical -> Protein
, which includes self-loops
on the first protein, mixed directed and undirected links, multiple releation types that are valid to
link from a protein to the next.
State machine can be defined by a simple flat file format. The file defining the SM in figure is here. Let's look at an example:
#Finite States *=start state ^=end state
1* Gene
2^ Publication
3^ MolFunc
...
7^ Protein
8^ Gene
9 Gene
10 Protein
...
10-10 ortho 4
10-10 xref 4
10-10 genetic 6 d
...
1-10 enc
10-7 physical 6 d
10-7 genetic 6 d
...
The format is very simple:
- Every file row defines either a node or a transition, each line has different fields separated by tab characters
- Node types are numbered in a first section of the file (type names must match Ondex concept classes)
- Node numbers are used to define transitions between nodes (similarly to nodes, transition names must match Ondex relation types)
- A transition has the format:
Where limit is the max "distance" of a path that is found between the gene and
<node1>-<node2> <name> [limit] [d]
<node2>
. In calculating such a distance, the first gene counts 1, every following link or node counts 1. For instance, the distance of X in the pathgene0 -> encodes -> protein0 -> ortho -> protein1
is 5 - The optional flag 'd' is used to define directional transitions. Note that these are usually faster to
match. So, even if a directional transition isn't defined for
enc
(we don't expect data where proteins encode genes), adding the flag would speed up things in case of problems.
The SM configuration is part of the configuration required to setup a Knetminer instance, which is
described in our wiki. Our pre-configured datasets are examples of it.
Details are:
-
In
maven-settings.xml
, leave the Maven propertyknetminer.api.graphTraverserClass
empty, ie, don't add it to your dataset-specific settings, which will inherit the default empty value. This corresponds to picking the default traverser class,net.sourceforge.ondex.algorithm.graphquery.GraphTraverser
, which is the SM traverser. (so defining it explicitly would achieve the same result). Note that the Maven property is injected into the Knetminer configuration file that is generated from a template. -
Define the state machine for your dataset in the file
<dataset>/ws/SemanticMotifs.txt
, using the format explained above. This is the path set in the Knetminer configuration file. An alternative to this would be placing your owndata-source.xml
file in the dataset directory and define a different path/name. Unless, you've particular need, we don't recommend it. -
In
maven-settings.xml
, define the rightknetminer.specieTaxId
(comma-separated list of NCBITax codes), that is used to pick the genes of your specie of interest. Semantic motifs are applied to these during the Knetminer initialisation, in order to have pre-computed data to start searches from. These are named 'seed genes'. As an alternative to using the specie ID, you can define a list of seed genes explicitly, by setting the propertyknetminer.backend.seedGenesFile
in maven-settings.xml, see this example.
The SM traverser is usually rather efficient, without having much to configure/tune. However, there are a few factors that affects its performance:
- Obviously, the more seed genes you have, the slower the Knetminer initialisation is
- Similarly, bigger state machines (in terms of total number of nodes + transitions) take more time, but the traversal is parallel and usually scales well.
- The biggest impact on performance is on what you match. For instance, if you have self-loops of
protein->xref->protein
, this can easily hangs in trying to match long chain of cross-references, especially if there are loops in the graph. In a case like this, you should always define a low-enough limit constraint (see above). - Both the traversers save their initialisation results in memory (to be reused during the application lifetime), if you have many paths to match, you might need more RAM (see the Docker documentation). Increasing memory can also make the initialisation stage faster, since this limits the frequency of intermediate results cleaning operations (for the geeks, the garbage collector overhead).