An important first step in developing the openDS specification is to position open digital specimens as a data structure in relation to other important structures, standards and initiatives relevant to the domain. This settles and firmly establishes openDS in the contexts of several important areas of Internet technologies and functional capabilities appearing in modern information (ICT) systems, and the practices and procedures these support. This is essential for ease of implementation, interoperability and future capabilities. Important topics, addressed in the sections below include positioning openDS in relation to:
- Provenance (origin, history), attribution of work and citation
- Knowledge representation, logical reasoning and inference i.e., ontologies
- Integration with information representations in the wider cultural heritage sector
- Digital Object Architecture, Internet of FAIR Data and Services, European Open Science Cloud;
- Semantic Web and Linked Data
- Compatibility with TDWG and other domain-specific specifications
- Global relevance
Positioning openDS in relation to provenance and to knowledge representation reveals how the two main components of openDS i.e., the openDS data model and the Ontology for open Digital Specimens (ODS) relate to each other.
The need for accurate recording and tracing of the provenance (origin, history, what was done, when and by whom) of digital specimens, and the ability to attribute (credit) persons and organisations for the performance of such work is essential for many reasons. In common with all heritage objects, people want to know where specimens came from, when they were gathered and by whom; and what has happened to them subsequently. Work on describing and determining (identifying) a specimen must be recorded. Scientists want to know about changes to that in the light of new knowledge over the years. They need to cite specimens as evidence for their conclusions in analytical work. They want to gain recognition for work done; for example in identifying specimens and discovering new species, for digitization, for derivation of other data and in curatorial work. Organisations want to be able to show the contribution of their collections and scientific work/education to the advancement of knowledge, to demonstrate the positive value to science in general, and to show impact on the economy and society. Funding agencies want to know where and how their money is being spent. In particular, the curatorial work performed in and on collections is under recognised, as are the contributions of research software engineers and, increasingly, the creative work of data scientists.
In the era of digital data and Web (software) applications, the W3C PROV model (overview) is a leading approach for capturing and representing provenance data as work is performed. PROV (primer) defines a core data model for building representations of the entities, people and processes involved in producing a piece of data or thing in the world i.e., provenance. To fill the gap in creating an attributable record of work involving collections heritage and other objects used for research, especially in the curation and maintenance of such collections, the Research Data Alliance (RDA) has published an RDA Recommendation on 'Attribution Metadata'. This Recommendation provides mechanisms to say what activity generated some outcome (such as a determination of species); who that activity was associated with (i.e., who performed it) and in what capacity (role) they performed the activity and why. A technical document accompanying the formal recommendation provides detailed explanation and examples.
PROV's fundamental 'Entity/Activity/Agent' model and its properties, as applied by the 'Attribution Metadata' recommendation acts in openDS as the top-level primary object model. open Digital Specimens are a subclass of class prov:Entity. This is explained in the introduction to the openDS data model.
Support for future processing of Digital Specimen content by machines (e.g., by inference, classification, etc.) as well as by humans is essential. This means that from the beginning open Digital Specimens must be machine-readable representations that allow computers/software to ‘understand’ the data being processed. This is to be achieved by associating data elements (classes, attributes) to specific terms in vocabularies and to concepts in ontologies; especially to relevant ontologies in the biomedical domain. Note that being machine-readable does not necessarily equate to being machine-actionable. That has additional requirements.
The most important ontologies relevant to the subject domain are several of those in the suite of interoperable reference ontologies gathered under the banner of the OBO Foundry. The Open Biological and Biomedical Ontologies (OBO) ontologies, of which there are presently more than one hundred offer a structured reference for terms used in different research fields and their interconnections across the biological and medical domains, with the aim of improving data integration and logical reasoning across the life sciences.
All OBO ontologies stem from the Basic Formal Ontology (BFO), a true upper level ontology designed for supporting information retrieval, analysis and integration in scientific and other domains. The Relation Ontology (RO) standardizes many relationship types and ensures that different ontologies across the OBO family can work together seamlessly when these relations are expressed in a standard manner. For the purposes of openDS and situating the notion of a Digital Specimen into the biomedical domain, there are three important domain ontologies from the OBO family to consider:
- Ontology for Biomedical Investigations (OBI) covers the description of biological and clinical investigations, providing a model for the design of an investigation, the protocols and instrumentation used, the materials used, the data generated, and the type of analysis performed on it.
- Biological Collections Ontology (BCO) is an ontology for biodiversity data and is highly relevant as it is the place where physical specimens are defined as a concept.
- Information Artifact Ontology (IAO) is an ontology of information entities and is relevant because it allows us to situate digital specimens into the digital realm.
There are many other ontologies in the OBO Foundry suite that can have relevance for specific elements of the digital specimen idea. The Biological Imaging Methods Ontology (FBBI), for instance is an ontology for methods in biomedical imaging. The Environment Ontology (ENVO) is an ontology of environmental features and habitats. The Flora Phenotype Ontology (FLOPO) is an ontology for traits and phenotypes of flowering plants. Depending on extensions and the inclusion of different third-party data types, these and other OBO ontologies can become relevant in future releases of ODS.
Nevertheless, there are terms, concepts and relations needed to bring openDS to fruition that don’t exist as definitions within the OBO Foundry suite at all. Many of these might be found from existing domain specific standards such as ABCD 3.0, the EFG extension for geosciences and Darwin Core (DwC). Many others though are not yet formally defined.
The way that openDS semantically relates to and roots its origins in the important ontologies and vocabularies explained above (OBI, BCO, IAO, ABCD/EFG, DwC) is explained in the introduction to the Ontology for open Digital Specimens (ODS). ODS is the ontological representation of the openDS model. The ODS will be defined alongside the data model as a community/domain specific ontology extending existing concepts and defining new terms, concepts and relations. This latter aspect, the relations is especially important for building controlled links between a specimen and other specimens and between a specimen and the data derived from or about that specimen; either directly related to it or indirectly relevant to it. In time, such a graph of connections can grow to become an important tool for further exploration, analysis and discovery.
Whilst not presently foreseen to become part of the OBO Foundry suite, ODS development will nevertheless follow OBO Foundry best practices; specifically, adhering to the Principles to achieve a logically well-formed and scientifically accurate ontology.
Here we are concerned with integration of the openDS model with information representations used in the wider cultural heritage sector.
As an initiative of the International Council of Museums, the CIDOC Conceptual Reference Model (CRM), also published as ISO 21127:2014 is an approach for information integration in the field of cultural heritage. When implemented by information systems, CIDOC CRM helps researchers, administrators and the public explore complex questions with regards to humankind's past across diverse and dispersed datasets. The CIDOC CRM achieves this by providing definitions and a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation and of general interest for the querying and exploration of such data. This formal descriptions allows the integration of data from multiple sources in a software and schema agnostic fashion. Developed over twenty years, the model is widely used around the world. Thus, it is important to understand how open Digital Specimens, as digital expressions of natural science objects abundant in museums around the world fit into the CRM.
ISO 21127:2014 reconfirmed as a valid International Standard in April 2020 is based on version 5.0.4 of the CRM. The latest current version (7.0.1) was published in October 2020. In the comparison that follows, references to class identifiers of the form 'Enn' refer to class definitions in this current version. Class names are capitalized.
As an identifiable immaterial item with an objectively recognizable (defined) structure, documented as a single identifiable unit, a Digital Specimen object structured in accordance with the openDS specification corresponds to an instance of the E73 Information Object class in the CIDOC CRM. In contrast, a physical specimen is an instance of the class E19 Physical Object (or one of its subclasses, such as E20 Biological Object). As instances of Persistent Items (class E77) both a digital specimen (of class E73) and a physical specimen (of class E19) are eventual subclasses of E72 Legal Object, meaning that both can have Rights (class E30) attached to them.
Interestingly, a digital specimen can also be considered (by being an instance of class E73) as being an instance of a Human-Made Thing (class E71); whereas a physical specimen is a Human-Made Thing only in specific cases: for example, when a photograph, microscope slide or audio-/video-recording is treated by an institution as a specimen for curation purposes.
Class E78 Curated Holding comprises aggregations of physical things that are curated and preserved together i.e., as a collection. Using the example of digital libraries, the class definition states that aggregations of electronic content should be treated as instances of this class because such aggregation requires keeping ‘physical carriers’ (i.e., computer servers and storage) of the electronic content. Thus, a curated collection of specimens, whether physical or digital corresponds to an instance of this class. Thus, the openDS Digital Collection object, a subclass of the Object Group class in the TDWG CD (Collection Description) Standard is synonymous with class E78.
There are two extensions to the CIDOC core model that might later become relevant and useful:
- CRMsci – Scientific Observation Model for integrating metadata about scientific observation, measurements and processed data may be relevant; specifically, class S13 Sample. A physical specimen (being an individual member of the sample (subset taken from a larger population/mass) might be a subclass of this.
- CRMinf – Argumentation Model for integrating metadata about argumentation and inference making. Argumentation isn’t an aspect that we’ve thought too much about capturing yet, except to the extent that we’ve discussed the notion of attaching multiple annotations and interpretations to digital specimens. The CRMinf argumentation model supports the activity of making honest inferences or observations about e.g., an object. An honest inference or observation is one in which the actor carrying out the argumentation justifies and believes that a specific belief (view, opinion, conclusion, argument) is the correct one at the time the activity was carried out and that any logic or method employed to arrive at that belief was correctly applied.
It's worth noting that the CIDOC CRM core has interesting models for space/time concepts and for relations between objects that bear further looking at for the openDS case.
Already we have considered much relating to these topics, inlcuding the positioning of a Digital Specimen as a kind of FAIR Digital Object. openDS and its manifestation are direct consequence of those considerations. Relevant further reading includes:
- A conceptual design blueprint for the DiSSCo digitization infrastructure, which is deliverable D8.1 out of the EC-funded project (2018-2020) on innovation and consolidation for large scale digitisation of natural heritage (ICEDIG). ICEDIG aimed at supporting the coming implementation phase of the new DiSSCo Research Infrastructure by designing and addressing the technical, financial, policy and governance aspects necessary to operate such a large distributed initiative for natural sciences collections across Europe.
- A paper, "FAIR data and services in biodiversity science and geoscience" examines the intersection of the FAIR Guiding Principles (Findable, Accessible, Interoperable and Reusable), the challenges and opportunities presented by the aggregation of widely distributed and heterogeneous data about biological and geological specimens, and the use of the Digital Object Architecture (DOA) model and components as an approach to solving those challenges that offers adherence to the FAIR principles as an integral characteristic.
- A paper, "Incorporating RDA outputs in the design of a European Research Infrastructure for Natural Science Collections" shows how DOA-related recommendations and supporting documents from the Research Data Alliance have been applied to the various stages of the DiSSCo data lifecycle.
- A paper, "FAIR Digital Objects for science: From data pieces to actionable knowledge units" concludes that the FAIR Digital Object concept has the potential to act as the interoperable federative core of a hyperinfrastructure initiative such as the European Open Science Cloud (EOSC).
- The FAIR Digital Object Framework describing rules that must be met by all implementations of FAIR Digital Objects.
As an extension of the World Wide Web, the Semantic Web (a.k.a. Web of Data) aims to make data on the Internet machine-readable. The Semantic Web uses languages such as Resource Description Framework (RDF), Web Ontology Language (OWL), Extensible Markup Language (XML) and, increasingly, these days Javascript Object Notation Linked Data (JSON-LD) and schema.org in conjunction with some simple principles for linking (the Linked Data Principles) to achieve this. Semantic Web markup/Linked Data supplements and/or replaces human-readable HTML web pages with machine-readable descriptions of arbitary data elements and their relations, making links between data that are understandable to both machines and humans.
These technologies are already widely deployed in biodiversity and geodiversity informatics settings. Museum collection data is routinely being published in both HTML and RDF forms by members of the Consortium of European Taxonomic Facilities (CETAF) through their insitutional data portals; as this CEETAF URI tester example demonstrates.
OpenDS meets exceeds Semantic Web/Linked Data principles by ensuring that Digital Specimen data is machine-actionble (see knowledge representation above) as well as human-readable. The Ontology for open Digital Specimens (ODS) provides the definitions of terms, concepts and relations necessary to achieve this. The serialization of openDS as JSON-LD organises and connects Digital Specimens in a linked manner.
openDS is intended to form the basis of the next generation of biodiversity and geodiversity informatics infrastructure for collections-based science, compatible with the emerging European Open Science Cloud (EOSC). This next generation of infrastructure - the Internet of FAIR Data and Services where data is implicitly findable, accessible, interoperable and reusable (FAIR) - transforms the Internet from an infrastructure predominantly designed for just conveying information in digital form from one location to another to an infrastructure where information can be proactively managed, manipulated and processed.
openDS builds compatibly on a body of standards extant in the biodiversity and geodiversity informatics domain developed by TDWG (Biodiversity Information Standards), GeoCASe and others over many years.
Specifically, openDS builds on: (Consider and elaborate each of the following.)
- Result of alignment activity - principal informant for openDS
- Darwin Core - principal informant for openDS.
- ABCD/EFG - principal informant for openDS.
- MIDS - openDS, MIDS and CD must align.
- CD - openDS, MIDS and CD must align.
- Audubon Core vocabulary for metadata of multimedia resources and collections - May have relevance for information related to multimedia resources encoding in digital specimens.
- GGBN data standard - May have relevance for information related to sequence data encoding in digital specimens.
- Including relevant from geosciences. - may be other relevant standards to consider.
To be completed.
openDS must demonstrate compatibility and alignment to support the extended specimen concept, as introduced by Webster (2017).
Encouragingly, much of the conceptual basis for digital specimens as conceived by DiSSCo in Europe and extended specimens as developed by the Biodiversity Collections Network in the USA is common.
END.