Annotations in NexSON

The original NexSON documentation is at https://github.com/OpenTreeOfLife/phylesystem-api/wiki/NexSON.

This page is a short-term scratchpad/sandbox for fleshing out the details of how annotations that are added to NexSON files during a process of commenting or validation should be structured.

1 Background

NexSON is a JSON format derived from NexML using BadgerFish conventions. NexML is intended to be used to encode phylogenetic trees, alignments, and associated information. NexSON implements several of the NexML first-class object types including a top-level study object (the nexml object), which may contain a number of other first-class objects of the types: otus, otu, trees, tree, node, and edge. For more information, see the other NexSON pages and the NexML schema.

1.1 The `meta` element

As a means to store additional metadata, the NexML schema defines a meta tag, which may be a child of any first-class NexML object. These meta tags may contain arbitrarily complex, discretionary data structures, and are intended to contain metadata annotating the first-class element to which they are attached. In NexSON, sets of meta tags are represented as a JSON array, stored under they meta key in the first-class object to which the meta tags apply. For example:

{  
    "tree": {
        "meta": [
            {
                "@property": "the meta element property type",
                "childElement": { "$": "inner text of this child element" },
                "aDifferentChild": { "$": "foo" }
            },
            {
                "@property": "a different property type",
                "arrayValue": [
                    {"$": "inner text of an 'arrayValue' child of this meta element" },
                    {"$": "another arrayValue" },
                    {"$": "and a final one" }
                ]
            }
        ]
    }
}

The content of these meta tags is unconstrained by the NexML schema. For example, phenoscape embeds data structured according to other XML schemas inside meta tags. See some example phenoscape files here.

1.2 NexSON annotations for the Open Tree of Life Project

This document proposes a standardized model to facilitate efficient and straightforward storage and retrieval of human- and machine-generated annotation metadata regarding a NexSON study and its contained objects. The goals of this proposed model are limited to the scope of the Open Tree of Life project. Thus, no attempt is made to generalize a model suitable for all conceivable annotation purposes under the sun. Rather, the concepts are tailored to suit the activities expected to occur as part of the OToL workflow, including (but not necessarily limited to):

1.2A Example use-cases for NexSON annotations

Study curation
NexSON structural validation
Data quality assessment
Metadata persistence
Cross-purpose communication of OToL tools, including external tools intended to complement and extend OToL tools.

Extensions to this model, or other models, may be required for other purposes outside the defined scope.

1.3 Historical information

Much of the content documented here was originally discussed in the Annotations thread on the opentreeoflife-software@googlegroups.com. Where appropriate, attempts have been made to incorporate concepts from related projects addressing the formalization of annotation data, including Open Annotation, Annotation Ontology, and the W3C PROV Ontology.

2 Primary annotation objects

Three primary annotation object classes are proposed: annotationEvent, agent, and message. This three-part breakdown corresponds to the PROV data model, with its Activity, Agent, and Entity types corresponding to NexSON annotationEvent, agent, and message objects respectively. The roles of these object types are defined as follows:

2.1 `annotationEvent` class

An annotationEvent is a one-time event, during which an agent generates one or more messages related to a study or element(s) within it. Each annotationEvent should generate one or more message objects. annotationEvent objects relate message objects to associated agent objects and contain information about the event itself, including the date [more info here].

After some discussion (Jan 13, 2014 software G+ hangout), we decided to put the message objects inside their containing annotationEvent object.

2.2 `agent` class

An agent is a person or program that creates annotations, possibly acting on behalf of another agent. agent. objects contain information identifying and describing real-world annotating agents, including names, urls, information about the execution environment (for automated agents), version (for automated agents), etc. For more information, refer to the agent syntax below.

2.3 `message` class

A message is a simple data structure provides information about a particular target object or set of objects. Messages are generalized, and contain features to accommodate diverse annotation data. For more information, refer to the message object syntax below.

3 Storage conventions

Annotation information may conceivably be stored anywhere (e.g. within single-file NexSON documents, or externally accessible via URL). For convenience and simplicity, at this time we propose storing annotations within NexSON documents themselves.

3.1 Container elements

Two top-level NexSON meta element containers are proposed to store collections of primary annotation objects. These container elements are in fact NexSON meta elements, whose @property value may be equal to "ot:annotationEvents" or "ot:agents". Corresponding annotation elements of each respective type should be stored in the appropriate meta container. (The "ot:messages" container is now deprecated, in favor of storing messages inside annotation events.)

3.2 Storage of `annotationEvent` and `agent` objects

Exactly one meta element with the property "ot:annotationEvents" and one with "ot:agents" should exist for a given study, as children of the nexml object itself. These containers should contain all of the annotationEvent and agent objects associated with message objects applied to elements within the study.

3.3 Storage and placement of `message` objects

Inside the AnnotationEvent that created them.

Deprecated recommendation The message objects themselves should be stored in "ot:messages" containers that are attached to the least inclusive NexSON element to which the information in the message applies. Thus, one "ot:messages" container may exist as a child of each annotated object (see 2A for a list of annotatable object types) in the study. meta containers of the "ot:messages" type should only be assigned to the following first-class NexSON objects:

3.3A NexSON elements suitable for storing "ot:messages" `meta` containers

N.B. This entire section is deprecated, in favor of storing messages inside annotation events.

nexml (the study itself)
tree
node
edge
otu

Determining the best location to attach these "ot:messages" containers may be a rather arbitrary choice in many cases, but placements facilitating ease of interpretation and semantic consistency are encouraged. By convention, the "ot:messages" element attached to top-level nexml element should contain message objects that describe information about the study itself, or about one of its associated annotationEvent or agent objects; "ot:messages" containers attached to a tree element should contain message objects specific to that tree, but message objects specific to a single node within that tree should be stored in a "ot:messages" container attached to the node itself; etc.

It may be instructive to consider a negative example: it is possible to store every message object in the "ot:messages" meta container attached to the nexml element itself, and simply use the "refersTo" field (see below) to associate message objects with the NexSON objects to which they pertain. This usage pattern is discouraged since it complicates the association of the message objects to their relevant NexSON elements. With that in mind, it is worth recognizing that there may be rare cases where it is appropriate to store all of the message objects associated with a given annotationEvent in the "ot:messages" container attached to the nexml object. For instance, when every message object refers to both tree and otu objects, or to other annotationEvent objects, then most inclusive placement of each message is the nexml object.

JA: If we put warnings/queries/errors in the meta that is inside the top-level nexml object, then the curator application could grab the annotations, and quickly ascertain which parts of the study have problems. This will require that app to hold onto tree, node, edge and otu IDs until it has the data to instantiate objects of those types. But that does not seem too onerous.

CEH: I think we want to avoid the need for the curator app (or any other app for that matter) to download and parse the entire NexSON study and/or entire set of messages in order to find the relevant ones. I would suggest that implementing services (such as OTI) capable of returning the information based on queries (e.g. "return all warnings/queries/errors for anything in study X") would be more scalable than searching the NexSON for them on every load. In this case, the placement of the messages within the file is arbitrary. I would argue that storing them as children of the objects to which they most closely pertain makes more intuitive sense than not doing so, and that it will be easier to parse in many cases (e.g. no need to hold onto node, edge, otu, etc. ids as mentioned above). So this is my recommendation.

4 Syntax for `meta` container objects

In accordance with Badgerfish conventions (for XML compatibility), each container object in the JSON representation will contain an array of objects of the corresponding type under the defined key. Each element of these arrays corresponds to a single tag of the same type name in the XML as the as the array key in the Badgerfished JSON (e.g. annotationEvent, agent, or message)`.

4.1 annotationEvents collection

tag	legal value(s)	explanation
@property	"ot:annotationEvents"
@xsi:type	"nex:ResourceMeta"
annotation	list of `annotationEvent` elements	See details below

4.2 agents collection

tag	legal value(s)	explanation
@property	"ot:agents"
@xsi:type	"nex:ResourceMeta"
agent	list of `agent` elements	See details below

4.3 messages collection Deprecated in favor of message objects inside `annotationEvents`

tag	legal value(s)	explanation
@property	"ot:messages"
@xsi:type	"nex:ResourceMeta"
message	list of `message` elements	See details below.

5 Primary object syntax

5.1 `annotationEvent` object

tag	legal value(s)	explanation
@id	string	unique among the set of IDs used in this file (not necessarily globally unique)
@description	string	human-readable description of the type of annotation performed (e.g. "NexSON validation" or "treemachine import check")
@wasAssociatedWithAgentId	string	id of the `agent` (person or tool; see below) that created the `annotationEvent`
@dateCreated	String in ISO 8601	date that the `annotationEvent` occurred
@passedChecks	boolean	default True. False indicates that the author is a validating service (rather than just a commenting tool), and some aspect of the validation procedure failed in some serious way. The details should be in the messages.
@preserve	boolean	False by default. True serves as a flag to future invocations of the same tool (software agent), indicating that the message should be retained
otherProperty	array of `otherProperty` elements	Optional. See below for additional information
message	list of `message` elements	See details below.

5.2 `agent` object

An Agent can be a human author or a program. (Is there a standard way of describing a software tool that we should be using here? <-- Yes, we are adapting this from the PROV model.) Here is the basic info we want:

tag	legal value(s)	explanation
@id	string	unique among the set of IDs used in this file (not necessarily globally unique)
@name	string	Name of software that produced the annotation, or authorized user (GitHub username or email)
@url	string	URL of service or page that describes the tool (blank for a human)
@description	string	human-readable description of the tool, or full name for a human
@version	string	version number string of the authoring tool (blank for a human)
invocation	object	Only applicable to automated (i.e. software) agents. `invocation` object that contains relevant info about the execution environment and operating parameters
otherProperty	array of `otherProperty` objects	Optional. See below for more information

5.2.1 `invocation` object (sub-object of `agent`)

tag	legal value(s)	explanation
commandLine	list of strings	(optional) args
method	string	GET, PUT... for web services
data	string	data parameters passed to the web-services call
checksPerformed	list of strings	list of Message Codes (see below) that the service claims to have checked for
otherProperty	array of `otherProperty` objects for additional information	Optional. See below for more information

5.3 `message` object

tag	legal value(s)	explanation
@id	string	unique among the set of IDs used in this file (not necessarily globally unique)
@wasGeneratedById	string	Deprecated no longer used because message objects now occur inside the annotation event that generates them. The id of the `annotationEvent` object with which this message is associated
wasAttributedToId	string	Optional. The id of an `agent` object that this message is attributed to, which may be different from the `agent` associated with the generating `annotationEvent`. For example, "wasAttributedToId" could identify a human `agent` operating a software `agent` with which the `annotationEvent` itself may be associated.
@severity	string	one of the defined Severity values (like logger message levels; see below)
@code	string	one of the Message Codes (see below)
@humanMessageType	string	one of the Message Types (see below). Optional if the Message Code indicates that a front end should be able generate a message from the code (see below).
@humanMessage	string	human-interpretable message (ie. no NexSON IDs). Optional if the Message Code indicates that a front end should be able generate a message from the code (see below).
dataAnnotation	string	Optional. More precise message for machine consumption
data	object	fields depend on the Message Code (see below)
refersTo	path object (see below)	path to the object that the message refers to (see path syntax below)
other	object	object (key to string, number, or boolean) for additional information

6 Secondary/supporting object syntax

6.1 `otherProperty` object

These objects are used to designate optional properties. They are intended to be used a catch-all for necessary round-trip information that does not belong in any pre-defined property for a given object. This feature is intentionally restrictive to reduce complexity and increase consistency/adherence to the annotation spec.

tag	explanation
name	the name of this property
value	a value of one of the predefined value types below

6.2 Value types

Acceptable values are defined by JSON.

tag	explanation
STRING	a string wrapped in quotes.
NUMBER	a floating point or integer value.
BOOLEAN	a boolean value. Acceptable values are either of the strings "true" or "false", without the quotes.

6.3 Message levels

tag	explanation
ERROR	an error. generally designates a fail condition
WARNING	a warning. designates a condition that is not encouraged but is not generally a fail condition
INFO	neither a warning nor an error

6.4 Message types

The following message types (borrowed from the Open Annotation and Annotation Ontology projects) are used to define different cases for the human-readable message (if there is one):

tag	explanation
NONE	there is no human-readable message
NOTE	a general, human-readable note
COMMENT	this suggests editorial intent
REPLY	points to another annotation (Note, Comment, or Reply)
EXPLANATORY_NOTE	by the curator?
QUESTION	specifically asks for reply or clarification
ERRATUM	identifies an error (added by curator to historical stuff? or by a reviewer?)

7 Referring to other objects from within `message` objects

Because message objects may relate to more than a single NexSON element,

Here we define a lightweight, NexSON-specific method of describing the paths

Avoiding strict use of a JSON version of XPATH will avoid parsing on the string and dealing with funky Ids (which are legal but could make naive parsing hard to implement).

7.1 Path syntax

Used in the refersTo field to indicate the target of the comment. It seems like we can just expand the parts of an absolute path expression (taking advantage of Ids and the fact that NexSONs are not that "deep").

tag	legal value(s)	explanation
@idref	string	ID of the object referred to. This ID will also be found in one of the subsequent fields, but duplicating it here makes it easy for a id->object map to quickly interpret this path blob
@top	"meta" "otus" or "trees"	child of the nexml element
@otusID	string	only if otus is top
@otuID	string	only if otus is top. Optional
@treesID	string	only if trees is top
@treeID	string	only if trees is top. Optional
@edgeID	string	only if trees is top and treeId is specified. Optional
@nodeID	string	only if trees is top and treeId is specified. Optional
@metaID	string	only if meta is top (useful for replies)
@annotationID	string	only if meta is top
@messageID	string	only if meta is top; NOTE that message may be "localized" anywhere in the study!
@property	string	optional. property of the element referenced by the preceding parts of the path
@inMeta	bool	The property is in the meta list of the element referenced by the preceding parts of the path

The pseudocode for processing on of these paths would be something like this (assuming that [] looks of a property or contained Id in an object):

function find_prop_in_meta(meta_list, prop) {
  for (element in meta_list) {
    if (element.property == path.property) {
        return element
    }
  }
  throw InvalidPathException()
}

function resolve(nexml, path) {
  if (path.top == "meta") {
    el = nexml.meta
  } else if (path.top == "otus") {
    otus = nexml.otus[path.otusID];
    if (defined(path.otuID)) {
      el = otus[path.otuID]
    } else {
      el = otus
    }
  } else if (path.top == "trees") {
    trees = nexml.trees[path.treesID]
    if (defined(path.treeID)) {
      tree = trees[path.treeID]
      if (defined(path.nodeID)) {
        el = tree[path.nodeID]
      } else if (defined(path.edgeID)) {
        el = tree[path.edgeID]
      } else {
        el = tree
      }
    } else {
      el = trees;
    }
  } else {
    throw InvalidPathException();
  }
  if (defined(path.inMeta)) {
    return find_prop_in_meta(el.meta, path.property)
  }
  if (defined(path.property)) {
    return el[path.property]
  } else {
    return el
  }
}

8 Lists of OpenTree message codes

This is intended to be an extensible, controlled vocabulary of the types of messages that we anticipate seeing/generating. Preferably, many of the codes, along with the data blob in the message will be rich enough to create a meaningful user interface for the message (without simply forcing the UI to simply display the messageForUser and hope that user will know how to react to the message).

8.1 Codes that we should anticipate seeing regularly during curation

code name	`data` contents	explanation
REFERENCED_ID_NOT_FOUND	{key: string, value: string}	The NexSON attribute with the name key refers to an ID, but the ID is not in the NexSON. We have about 3000 cases of this with @otu in nodes or @source in edge objects not matching.
TIP_WITHOUT_OTU	{}	`refersTo` object is a node that is a tip on the tree, but is not mapped to any OTU object. This is an NexSON error, not failure to map to OTT. We have about 3000 cases
UNRECOGNIZED_PROPERTY_VALUE	{key: string, value: string}	the `meta` array associated key value pair in which the key is recognized, but the value is not valid. We have about 51 cases of this in which key is "ot:branchLengthMode" and value is "ot:years" (which is deprecated, I think).
MISSING_OPTIONAL_KEY	string	the attribute is not found. Used to report lack of "ot:dataDeposit", "ot:focalClade", "ot:inGroupClade", "ot:ottolid", and "ot:studyPublication" fields. So we have about 32 thousand of these
NO_ROOT_NODE	{}	tree that is `refersTo` has no node flagged as the root. We have 12 cases
TIP_WITHOUT_OTT_ID	{}	`refersTo` is a node with and otu, but the otu has no OTT ID. (about 31 thousand cases)
MULTIPLE_TIPS_MAPPED_TO_OTT_ID	{nodes:[list of IDs]}	`refersTo` is a tree the nodes listed are tips in the tree that map to the same OTT ID (about 31 thousand nodes)
MULTIPLE_TREES	{}	trees element is `refersTo` and it has multiple trees with no indication of which one treemachine should prefer to use
UNRECOGIZED_TAG	string	value of an `ot:tag` meta is not understood. This is not unexpected at all (and this sort of message will probably be suppressed), but the validator does emit it currently so we can see what tags are being used.
UNVALIDATED_ANNOTATION	{key: string, value: string}	a object in the `meta` list was an unrecognized key. Not surprising (will be suppressed).
CONFLICTING_PROPERTY_VALUES	list of key-value pairs that conflict	Flags with conflicting meanings, for example the "delete me" and the "choose me" tags
NO_TREES	{}	file contains no trees that are not flagged for deletion
NON_MONOPHYLETIC_TIPS_MAPPED_TO_OTT_ID	list of lists of IDs. each sublist is a set of nodes that are monophyletic on the tree and for which all the tips have the same OTT ID	This code is more serious than MULTIPLE_TIPS_MAPPED_TO_OTT_ID because it indicates cases in which different arbitrary prunings could lead to different phylogenetic statements

8.2 Codes emitted when POSTing NexSON (when we support it)

These indicate serious problems with the NexSON (and we can probably be unfriendly about them in terms of UI, because they'll probably be encountered by developers):

code name	`data` contents	explanation
MISSING_MANDATORY_KEY	string - key name	`refersTo` object lacks a mandatory attribute.
UNRECOGNIZED_KEY	string - key name	`refersTo` object has an attribute that is not allowed by the NeXML schema
MISSING_LIST_EXPECTED	?	element (e.g. edge) that should be a list, was not
DUPLICATING_SINGLETON_KEY	string	the attribute specified was encountered more than one time, though it should have been found only once (e.g. a doi)
REPEATED_ID	string	ID found more than once
MULTIPLE_ROOT_NODES	{}	tree has more than one node marked as root
MULTIPLE_EDGES_FOR_NODES	{}	node has more than one edge to parent
CYCLE_DETECTED	{node : id string}	tree has a cycle (including the referenced node)
DISCONNECTED_GRAPH_DETECTED	{}	tree is not connected graph
INCORRECT_ROOT_NODE_LABEL	{}	the node labelled as the root has a parent

8.3: Codes generated by the opentree web app (curation UI)

code name	`data` contents	explanation
OTU_MAPPING_HINTS	object	Object describing 'searchContext' (string) and required 'substitutions' (sub-objects)
SUPPORTING_FILE_INFO	object	Object describing 'files' (sub-objects)

9 Examples

9.1 Use cases

Presumably the curator app (see the "curator" subdir of the opentree repo ) will try to render a subset of this information to curators. Specifically the annotations could be warnings, error messages, and queries to the curators.

Some annotations could also be "extra" contributions to the study data, that need not be shown to curators. These could still be useful for users of the git repo of the studies (currently this is the treenexus repo, but that name will probably change soon).

9.2 An example annotation object representation

Taken from study 1003:

{
    "id": "anno1",
    "description": "Open Tree NexSON validation", 
    "agent": "agentX",
    "checksPassed": false
}

9.3 An example agent object representation

{
    "id": "agentX",
    "description": "validator of NexSON constraints as well as constraints that would allow a study to be imported into the Open Tree of Life's phylogenetic synthesis tools", 
    "invocation": {
        "checksPerformed": [
            "MISSING_MANDATORY_KEY", 
            "MISSING_OPTIONAL_KEY", 
            "UNRECOGNIZED_KEY", 
            "MISSING_LIST_EXPECTED", 
            "DUPLICATING_SINGLETON_KEY", 
            "REFERENCED_ID_NOT_FOUND", 
            "REPEATED_ID", 
            "MULTIPLE_ROOT_NODES", 
            "NO_ROOT_NODE", 
            "MULTIPLE_EDGES_FOR_NODES", 
            "CYCLE_DETECTED", 
            "DISCONNECTED_GRAPH_DETECTED", 
            "INCORRECT_ROOT_NODE_LABEL", 
            "TIP_WITHOUT_OTU", 
            "TIP_WITHOUT_OTT_ID", 
            "MULTIPLE_TIPS_MAPPED_TO_OTT_ID", 
            "NON_MONOPHYLETIC_TIPS_MAPPED_TO_OTT_ID", 
            "INVALID_PROPERTY_VALUE", 
            "PROPERTY_VALUE_NOT_USEFUL", 
            "UNRECOGNIZED_PROPERTY_VALUE", 
            "MULTIPLE_TREES", 
            "UNRECOGNIZED_TAG", 
            "CONFLICTING_PROPERTY_VALUES", 
            "NO_TREES"
        ], 
        "commandLine": [
            "--validate"
        ]
    }, 
    "name": "normalize_ot_nexson.py", 
    "url": "https://github.com/OpenTreeOfLife/api.opentreeoflife.org", 
    "version": "0.0.1a"
}

9.4 An example message object representation

{
    "parentAnnotationId": "anno1",
    "code": "NON_MONOPHYLETIC_TIPS_MAPPED_TO_OTT_ID",
    "comment": "Multiple nodes that do not form the tips of a clade are mapped to the OTT ID \"210453\". The clades are \"node503822\" +++ \"node503824\" +++ \"node503827\" +++ \"node503832\" in \"tree(id=tree1945)\"\n", 
    "data": {
        "nodes": [
            [
                "node503822"
            ], 
            [
                "node503824"
            ], 
            [
                "node503827"
            ], 
            [
                "node503832"
            ]
        ]
    }, 
    "preserve": false, 
    "refersTo": {
        "inMeta": false, 
        "top": "trees", 
        "treeID": "tree1945", 
        "treesID": "trees1003"
    }, 
    "severity": "WARNING"
}