Skip to content

Commit

Permalink
Merge pull request #28 from TheJacksonLaboratory/ic_calculation_and_t…
Browse files Browse the repository at this point in the history
…raversal

Improve IC calculation and graph traversals
  • Loading branch information
ielis authored May 26, 2023
2 parents 161ee46 + 846eb42 commit 01ba82c
Show file tree
Hide file tree
Showing 8 changed files with 231 additions and 50 deletions.
51 changes: 43 additions & 8 deletions notebooks/Tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@
"id": "5278e1d4-506b-4136-a90c-e45715534914",
"metadata": {},
"source": [
"We start by importing the relevant parts and use `load_ontolgy` to load the latest HPO in the Obographs format. We get an instance of `Ontology` which is a container for ontology data.\n",
"We start by importing the relevant parts and use `load_ontology` to load the latest HPO in the Obographs format. We get an instance of `Ontology` which is a container for ontology data.\n",
"\n",
"Let's see how HPO toolkit models the ontology data."
]
Expand Down Expand Up @@ -317,11 +317,14 @@
"source": [
"# Ontology algorithms\n",
"\n",
"The `hpotk.algorithm` module provides several ontology algorithms.\n",
"The `hpo-toolkit` module provides several ontology algorithms.\n",
"\n",
"## Ontology traversal\n",
"\n",
"Use `get_parents` and `get_ancestors` to get a `set` with *parents* or *ancestors* of a `TermId`:"
"> As of `v0.1.6`, the ontology traversal should not be performed using methods from the `hpotk.algorithm` module.\n",
"> Use the methods defined on the ontology `graph` instead.\n",
"\n",
"Use `get_parents` and `get_ancestors` of the ontology `graph` to get an iterable with *parents* or *ancestors* of a `TermId`:"
]
},
{
Expand Down Expand Up @@ -358,15 +361,15 @@
}
],
"source": [
"import hpotk\n",
"arachnodactyly_id = arachnodactyly.identifier\n",
"\n",
"print('#' * 20 + ' Parents ' + '#' * 20)\n",
"for parent in hpotk.algorithm.get_parents(o, current_arachnodactyly_id):\n",
"for parent in o.graph.get_parents(arachnodactyly_id):\n",
" term = o.get_term(parent)\n",
" print(f\"{term.identifier.value} - {term.name}\")\n",
"\n",
"print('\\n'+'#' * 20 + ' Ancestors ' + '#' * 18)\n",
"for parent in hpotk.algorithm.get_ancestors(o, current_arachnodactyly_id):\n",
"for parent in o.graph.get_ancestors(arachnodactyly_id):\n",
" term = o.get_term(parent)\n",
" print(f\"{term.identifier.value} - {term.name}\")"
]
Expand Down Expand Up @@ -405,12 +408,12 @@
],
"source": [
"print('#' * 20 + ' Children ' + '#' * 20)\n",
"for child in hpotk.algorithm.get_children(o, 'HP:0100807'): # Long fingers\n",
"for child in o.graph.get_children(TermId.from_curie('HP:0100807')): # Long fingers\n",
" term = o.get_term(child)\n",
" print(f\"{term.identifier.value} - {term.name}\")\n",
" \n",
"print('\\n'+'#' * 20 + ' Descendants ' + '#' * 17)\n",
"for descendant in hpotk.algorithm.get_descendants(o, 'HP:0006109'): # Absent phalangeal crease \n",
"for descendant in o.graph.get_descendants(TermId.from_curie('HP:0006109')): # Absent phalangeal crease \n",
" term = o.get_term(descendant)\n",
" print(f\"{term.identifier.value} - {term.name}\")"
]
Expand Down Expand Up @@ -1092,11 +1095,43 @@
"By default, the `base` parameter of the `calculate_ic_for_annotated_items` is set to $e$ (Euler's number), returning the IC values in [nats](https://en.wikipedia.org/wiki/Nat_(unit)). Set `base=2` to get the IC in bits."
]
},
{
"cell_type": "markdown",
"id": "caab19bd-c09a-4e2f-84d3-402f3ee40635",
"metadata": {},
"source": [
"### Information content of an ontology module\n",
"\n",
"We can calculate IC for an ontology module - a set of descendants of an ontology term. In case of HPO, it makes sense to analyze \n",
"descendants of [Phenotypic abnormality](https://hpo.jax.org/app/browse/term/HP:0000118):\n",
"\n",
"```python\n",
"from hpotk.constants.hpo.base import PHENOTYPIC_ABNORMALITY\n",
"term_id2ic = calculate_ic_for_annotated_items(diseases, o, module_root=PHENOTYPIC_ABNORMALITY)\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "c4ac12fb-7057-4f5b-b907-ee74f808b2c9",
"metadata": {},
"source": [
"### Use pseudocount for the very rare terms\n",
"\n",
"By default the IC is calculated only for ontology terms that are used at least once to annotate the annotated items (e.g. diseases). However, we may want to use \"pseudocount\" technique and set `count=1` for the very rare terms, resulting in a large IC value:\n",
"\n",
"```python\n",
"term_id2ic = calculate_ic_for_annotated_items(diseases, o, use_pseudocount=True)\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "255a4fbf-f1e9-40e8-8efa-6b91724968a5",
"metadata": {},
"source": [
"### Metadata\n",
"\n",
"The versions of the items and ontology used to calculate the ICs are preserved in the `metadata` attribute of the `AnnotationIcContainer`:"
]
},
Expand Down
1 change: 1 addition & 0 deletions src/hpotk/algorithm/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# TODO[v1.0.0] - remove the deprecated methods
from ._traversal import get_ancestors, get_parents
from ._traversal import get_children, get_descendents, get_descendants
from ._traversal import exists_path
Expand Down
12 changes: 12 additions & 0 deletions src/hpotk/algorithm/_traversal.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ def get_ancestors(g: typing.Union[GraphAware, OntologyGraph],
:param include_source: whether to include the `source` term in the resulting set
:return: a `frozenset` with ancestor :class:`TermId`\\ (s).
"""
# TODO[v1.0.0] - remove the deprecated method
warn('The method is deprecated and will be removed in v1.0.0. Use `get_ancestors` of the graph instead',
DeprecationWarning, stacklevel=2)
# Check
g = _check_ontology_graph_is_available(g)
source = _check_curie_or_term_id(source)
Expand Down Expand Up @@ -45,6 +48,9 @@ def get_parents(g: typing.Union[GraphAware, OntologyGraph],
:param include_source: whether to include the `source` term ID(s) in the results
:return: a :class:`frozenset` with parent `TermId`s
"""
# TODO[v1.0.0] - remove the deprecated method
warn('The method is deprecated and will be removed in v1.0.0. Use `get_parents` of the graph instead',
DeprecationWarning, stacklevel=2)
# Check
g = _check_ontology_graph_is_available(g)
source = _check_curie_or_term_id(source)
Expand Down Expand Up @@ -72,6 +78,9 @@ def get_descendants(g: typing.Union[GraphAware, OntologyGraph],
:param include_source: whether to include the `source` term ID(s) in the results
:return: a :class:`frozenset` with descendants `TermId`s
"""
# TODO[v1.0.0] - remove the deprecated method
warn('The method is deprecated and will be removed in v1.0.0. Use `get_descendants` of the graph instead',
DeprecationWarning, stacklevel=2)
# Check
g = _check_ontology_graph_is_available(g)
source = _check_curie_or_term_id(source)
Expand Down Expand Up @@ -118,6 +127,9 @@ def get_children(g: typing.Union[GraphAware, OntologyGraph],
:param include_source: whether to include the `source` term in the results
:return: an iterable with child `TermId`s
"""
# TODO[v1.0.0] - remove the deprecated method
warn('The method is deprecated and will be removed in v1.0.0. Use `get_children` of the graph instead',
DeprecationWarning, stacklevel=2)
# Check
g = _check_ontology_graph_is_available(g)
source = _check_curie_or_term_id(source)
Expand Down
49 changes: 40 additions & 9 deletions src/hpotk/algorithm/similarity/_ic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,43 +2,74 @@
import typing
from collections import Counter

from hpotk.model import TermId
from hpotk.annotations import AnnotatedItemContainer
from hpotk.ontology import MinimalOntology
from hpotk.util import validate_instance
from ._model import SimpleAnnotationIcContainer, AnnotationIcContainer
from .._augment import augment_with_ancestors


def calculate_ic_for_annotated_items(items: AnnotatedItemContainer,
ontology: MinimalOntology,
base: typing.Optional[float] = None) -> AnnotationIcContainer:
base: typing.Optional[float] = None,
module_root: typing.Optional[TermId] = None,
use_pseudocount: bool = False) -> AnnotationIcContainer:
"""
Calculate information content (IC) for each :class:`TermId` based on a collection of annotated `items`.
The calculation can be done for an ontology module - only the descendants of the provided `module_root`
will be included in the analysis. If `assume_annotated` is `True`, then the count of all ontology/module terms
is set to at least 1, even for those terms that do not annotate the `items`.
:param items: a collection of :class:`hpotk.annotations.AnnotatedItem`\ s
:param ontology: ontology with concepts used to annotate the `items`
:param base: information content base or `None` for *e*
(produces IC in `nats <https://en.wikipedia.org/wiki/Nat_(unit)>`_)
:param module_root: the root of the ontology module to calculate the IC for.
:param use_pseudocount: assume that each ontology term annotates at least one of the `items`.
:return: a container with mappings from :class:`TermId` to information content in nats, bits, or else,
depending on the `base` value
"""
ontology = validate_instance(ontology, MinimalOntology, 'ontology')

graph = ontology.graph
root = graph.root

hit_count = Counter()
term_id_count: Counter[TermId] = Counter()
module_term_ids: typing.Optional[typing.Set[TermId]] = None \
if module_root is None \
else set(graph.get_descendants(module_root, include_source=True))

for item in items:
for annotation in item.annotations:
if annotation.is_present:
for ancestor in augment_with_ancestors(graph, annotation.identifier, include_source=True):
hit_count[ancestor] += 1
if module_root is not None and annotation.identifier not in module_term_ids:
# annotation is not from the target module.
continue

for ancestor in graph.get_ancestors(annotation.identifier, include_source=True):
if module_term_ids is None:
# Not doing module
term_id_count[ancestor] += 1
elif ancestor in module_term_ids:
# Doing module and the ancestor is from the module
term_id_count[ancestor] += 1

if use_pseudocount:
# Set the count of all primary term IDs to at least one but DO NOT increment the count of the ancestor
# that already count>=1 .
# Note, in the HPO case, this will set count of non-phenotypic abnormalities (e.g. Clinical modifier)
# to 1 as well.
corpus = map(lambda t: t.identifier, ontology.terms) \
if module_root is None \
else module_term_ids
for term_id in corpus:
if term_id not in term_id_count:
term_id_count[term_id] = 1

log_func = math.log if base is None else lambda c: math.log(c, base)

population_count = hit_count[root]
population_count = term_id_count[graph.root] if module_root is None else term_id_count[module_root]

data = {term_id: -log_func(count / population_count) for term_id, count in hit_count.items()}
data = {term_id: -log_func(count / population_count) for term_id, count in term_id_count.items()}
metadata = {'annotated_items_version': items.version, 'ontology_version': ontology.version}
return SimpleAnnotationIcContainer(data, metadata=metadata)
8 changes: 4 additions & 4 deletions src/hpotk/graph/_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,28 +27,28 @@ def root(self) -> NODE:
pass

@abc.abstractmethod
def get_children(self, source: NODE) -> typing.Iterable[NODE]:
def get_children(self, source: NODE, include_source: bool = False) -> typing.Iterable[NODE]:
"""
Get an iterable with the children of the `source` node.
"""
pass

@abc.abstractmethod
def get_descendants(self, source: NODE) -> typing.Iterable[NODE]:
def get_descendants(self, source: NODE, include_source: bool = False) -> typing.Iterable[NODE]:
"""
Get an iterable with the descendants of the `source` node.
"""
pass

@abc.abstractmethod
def get_parents(self, source: NODE) -> typing.Iterable[NODE]:
def get_parents(self, source: NODE, include_source: bool = False) -> typing.Iterable[NODE]:
"""
Get an iterable with the parents of the `source` node.
"""
pass

@abc.abstractmethod
def get_ancestors(self, source: NODE) -> typing.Iterable[NODE]:
def get_ancestors(self, source: NODE, include_source: bool = False) -> typing.Iterable[NODE]:
"""
Get an iterable with the ancestors of the `source` node.
"""
Expand Down
45 changes: 31 additions & 14 deletions src/hpotk/graph/_csr_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,30 +27,38 @@ def __init__(self, root: NODE,
def root(self) -> NODE:
return self._root

def get_children(self, source: NODE) -> typing.Iterable[NODE]:
def get_children(self, source: NODE, include_source: bool = False) -> typing.Iterable[NODE]:
# In other words, find a row in the CSR corresponding to the `source`
# and retrieve the columns to which the `source` is the PARENT.
return map(self._get_node_for_idx,
self._get_node_indices_with_relationship(source, BaseCsrOntologyGraph.PARENT_RELATIONSHIP_CODE))
self._get_node_indices_with_relationship(source,
BaseCsrOntologyGraph.PARENT_RELATIONSHIP_CODE,
include_source))

def get_descendants(self, source: NODE) -> typing.Iterable[NODE]:
def get_descendants(self, source: NODE, include_source: bool = False) -> typing.Iterable[NODE]:
# See `self.get_children()` for explanation of `BaseCsrOntologyGraph.PARENT_RELATIONSHIP_CODE`.
return map(self._get_node_for_idx,
self._traverse_graph(source, BaseCsrOntologyGraph.PARENT_RELATIONSHIP_CODE))
self._traverse_graph(source,
BaseCsrOntologyGraph.PARENT_RELATIONSHIP_CODE,
include_source))

def get_parents(self, source: NODE) -> typing.Iterable[NODE]:
def get_parents(self, source: NODE, include_source: bool = False) -> typing.Iterable[NODE]:
# In other words, find a row in the CSR corresponding to the `source`
# and retrieve the columns to which the `source` is the CHILD.
return map(self._get_node_for_idx,
self._get_node_indices_with_relationship(source, BaseCsrOntologyGraph.CHILD_RELATIONSHIP_CODE))
self._get_node_indices_with_relationship(source,
BaseCsrOntologyGraph.CHILD_RELATIONSHIP_CODE,
include_source))

def get_ancestors(self, source: NODE) -> typing.Iterable[NODE]:
def get_ancestors(self, source: NODE, include_source: bool = False) -> typing.Iterable[NODE]:
# See `self.get_parents()` for explanation of `BaseCsrOntologyGraph.CHILD_RELATIONSHIP_CODE`.
return map(self._get_node_for_idx,
self._traverse_graph(source, BaseCsrOntologyGraph.CHILD_RELATIONSHIP_CODE))
self._traverse_graph(source,
BaseCsrOntologyGraph.CHILD_RELATIONSHIP_CODE,
include_source))

def is_leaf(self, node: NODE) -> bool:
for _ in self._get_node_indices_with_relationship(node, BaseCsrOntologyGraph.PARENT_RELATIONSHIP_CODE):
for _ in self._get_node_indices_with_relationship(node, BaseCsrOntologyGraph.PARENT_RELATIONSHIP_CODE, False):
return False
return True

Expand All @@ -60,12 +68,14 @@ def __contains__(self, item: NODE) -> bool:
def __iter__(self) -> typing.Iterator[NODE]:
return iter(self._nodes)

def _traverse_graph(self, source: NODE, relationship) -> typing.Generator[int, None, None]:
def _traverse_graph(self, source: NODE,
relationship,
include_source: bool) -> typing.Generator[int, None, None]:
seen: set[int] = set()
buffer: typing.Deque[int] = deque()

# Init
for idx in self._get_node_indices_with_relationship(source, relationship):
for idx in self._get_node_indices_with_relationship(source, relationship, include_source):
seen.add(idx)
buffer.append(idx)

Expand All @@ -79,11 +89,18 @@ def _traverse_graph(self, source: NODE, relationship) -> typing.Generator[int, N

yield current

def _get_node_indices_with_relationship(self, source: NODE, relationship) -> typing.Generator[int, None, None]:
def _get_node_indices_with_relationship(self, source: NODE,
relationship,
include_source: bool) -> typing.Generator[int, None, None]:
row_idx = self._get_idx_for_node(source)
return self._get_cols_with_relationship(row_idx, relationship)
if include_source:
yield row_idx

def _get_cols_with_relationship(self, idx: typing.Optional[int], relationship) -> typing.Generator[int, None, None]:
for idx in self._get_cols_with_relationship(row_idx, relationship):
yield idx

def _get_cols_with_relationship(self, idx: typing.Optional[int],
relationship) -> typing.Generator[int, None, None]:
if idx is None:
return
col_indices = self._adjacency_matrix.col_indices_of_val(idx, relationship)
Expand Down
Loading

0 comments on commit 01ba82c

Please sign in to comment.