Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dwc:occurrenceID in the context of dwc:catalogNumber review #21

Closed
cboelling opened this issue Oct 16, 2021 · 74 comments
Closed

dwc:occurrenceID in the context of dwc:catalogNumber review #21

cboelling opened this issue Oct 16, 2021 · 74 comments

Comments

@cboelling
Copy link
Member

cboelling commented Oct 16, 2021

Originally posted by @tucotuco in #6 (comment)

An occurrenceID is meant to be, ideally, a resolvable (IRI that returns metadata when requested) global unique identifier for an assertion that an Organism was present or absent, or a that no Organism identifiable as a member of a Taxon was present at a particular place at a particular time following some protocol for detection. It does not identify a material entity or a digital entity - it is an identifier for instances of an abstract concept of an exemplar of a Taxon having been present or absent, possibly backed by evidence in the form of material or digital entities.

From the current definitions of dwc:Occurrence and dwc:OccurrenceID I understand that a dwc:occurrenceID is intended an identifier for a dwc:Occurrence (the being present of one or more individuals of taxon X (somewhere) within geographical location L (at some time) during time interval T), which is, while less tangible than a preserved specimen, a real thing and exists independently of what biodiversity researchers think or do.

The concept described in the quote is different and on the level of assertions (i.e. what a human agent thinks about an ocurrence for a given X, L, T) including assertions of absence, i.e. that a given occurrence, specified by X, L, T, did not occur.

I just would like to make sure I understand the existing terms correctly or if there are new requirements for those. I created this as a new issue to keep the discussion in the original issue #6 close to its core topic.

@cboelling cboelling changed the title dwc:occurrenceID in the context of dwc:catalogNumber review dwc:occurrenceID in the context of dwc:catalogNumber review Oct 16, 2021
@Jegelewicz
Copy link
Collaborator

One issue that I see here is that for most museums - the "catalog number" describes BOTH the "occurrence" AND the object(s) or organism in the collection related to it. So museums are going to need to up their game in creating unique IDs for occurrence, material sample, and so on....

@cboelling
Copy link
Member Author

cboelling commented Oct 21, 2021

for most museums - the "catalog number" describes BOTH the "occurrence" AND the object(s) or organism in the collection related to it. So museums are going to need to up their game in creating unique IDs for occurrence, material sample, and so on....

I think that this is a key point for the issues tackled in this task group and in particular also for #11 on linking the type of artefact used to infer an occurrence to a pointer for that occurrence.

I am tempted to think that given the current definitions of the relevant terms in DwC that a dataset that uses the "catalog number" to describe BOTH the "occurrence" AND the object(s) or organism in the collection related to it is not DwC compliant and cannot be treated as such by recipients of that data.

@dagendresen
Copy link
Contributor

Recall, as is already covered in our discussions, that an important contributing reason why museums describe their specimens as occurrences is that dwc:occurrenceID is required when publishing specimens in GBIF ;-)

@deepreef
Copy link

One issue that I see here is that for most museums - the "catalog number" describes BOTH the "occurrence" AND the object(s) or organism in the collection related to it. So museums are going to need to up their game in creating unique IDs for occurrence, material sample, and so on....

THANK YOU!!! This is something I've been making noise about for years (and why I'm so excited about this Task Group!)

In summary: Catalog Numbers are assigned to physical things in collections (i.e., MaterialSample instances). When DwC expanded to accommodate unvouchered observations, the core record became an Occurrence (~=intersection of the organism represented by the physical thing and the place and time of its extraction from nature), and its (mandatory) identifier was occurrenceID. In the vast majority of cases, the cardinality of Catalog Numbers to (true) occurrenceID values is 1:1 (because most specimens in Museums were only collected once) -- so representing physical objects identified by a unique catalog number using the unique identifier representing its extraction from nature is not that big of a problem. But as soon as you get derivative MaterialSample instances (e.g., tissue samples extracted from whole specimens), we break the 1:1 cardinality. In many/most cases, the tissue sample gets its own unique catalog number, separate from the voucher specimen, but it shares the same circumstances of extraction from nature (Occurrence), so now we have two catalog numbers connected to the same occurrenceID. The problem, though, is that I bet in most cases when Museums have more than one MaterialSample derived from the same Occurrence (e.g., voucher and tissue sample), they end up being represented by two different occurrenceID values.

I think the recent activity surrounding MaterialSample (this Task group and the DwC discussions that promoted it) is that we've reached critical mass where this problem (deviation of 1:1 cardinality between MaterialSample and Occurrence) now needs to be addressed at a community-wide level.

Back to your point, years ago we added materialSampleID to all our specimen records, in addition to the occurrenceID. To the uninitiated, this seems like a redundant identifier (why have two different unique identifiers for the same record?!?!). But the reason we do this is to accommodate cases where there is not a 1:1 correspondence between MaterialSample and Occurrence. For example, when a tissue sample is extracted from a voucher, the tissue sample gets its own catalog number, and its own materialSampleID, but it inherits the same occurrenceID as the voucher.

The logical consequence of this is that the DwC terms catalogNumber and otherCatalogNumbers should be organized in the MaterialSample class instead of the Occurrence class (where they are currently organized). I really think we need to get there, but this has huge implications for DwC data providers and consumers, because for so long the so-called "Darwin Core triplet" (institutionCode+collectionCode+catalogNumber) has a long history of being a "natural key" for Occurrence instances; and indeed, I think a non-trivial number of content providers concatenate these three values to function as the value of occurrenceID.

In other words, getting the community on board with disentangling MaterialSample properties from what has traditionally been represented as Occurrence properties has non-trivial consequences.

@RogerBurkhalter
Copy link

@deepreef I know that many of our neontological collections at my museum use the "so-called "Darwin Core triplet" (institutionCode+collectionCode+catalogNumber)" as the OccurrenceID, and it will be very difficult to get them to change to a machine-generated UUID or PID. The one collection I am CM for, Invertebrate Paleontology, has UUID's where recommended, it's not that hard. Getting smaller collections on board, especially those with limited resources of money or people, will be a major task.

@deepreef
Copy link

@RogerBurkhalter: yeah, that's exactly what I did. For each specimen table that was the source of records for Occurrence instances (which already had a occurrenceID field with auto-generated UUID), I simply added a second field to the same table for materialSampleID with auto-generated UUID. As long as I know internally that the occurrenceID represents the "specimen at collecting event" (actually dwc:organism at dwc:event, but that's already discussed in issue #2 of this Task Group), and that the materialSampleID represents the physical specimen itself, it's easy to manage the links when cardinality differs from 1:1.

Indeed, I imagine most collections are at the mercy of their respective CMS and how it manages data and translates/exports it to DwC. But if we can achieve some sort of clarity and stability on the definitions of these various DwC classes (especially MaterialSample, Organism and Occurrence) and better understand the relationships among them, perhaps the CMS developers will begin adjusting their underlying data models to accommodate the various IDs properly.

@Jegelewicz
Copy link
Collaborator

In the vast majority of cases, the cardinality of Catalog Numbers to (true) occurrenceID values is 1:1 (because most specimens in Museums were only collected once)

I think this may be less true than you think, especially as physical specimens have been subsampled and shared for molecular study. See ArctosDB/arctos#4032 (comment)

@deepreef
Copy link

deepreef commented Nov 9, 2021

I think this may be less true than you think, especially as physical specimens have been subsampled and shared for molecular study. See ArctosDB/arctos#4032 (comment)

Agreed -- "vast" was an overstatement; but I bet if we looked at GBIF or iDigBio data, we'd still get a 1:1 correspondence between Catalog Number and occurrenceID in the majority of cases. But my larger point was that we can't rely on that -- even if exceptions are a minority, we need to accommodate them more robustly than we have been historically using DwC.

@dagendresen
Copy link
Contributor

At least ALL museum collections using the Darwin-Core-Triplet (see also Guralnick et al 2014) approach to build their occurrenceIDs (as is STILL today recommended in the Darwin Core definition for occurrenceID!!!) would by design have a 1:1 cardinality of catalogNumber to occurrenceID!

occurrenceID (...) In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique. (...) Examples (...) urn:catalog:UWBM:Bird:89776

When generating a sub-sample from a museum specimen (or any MaterialSample), the Darwin-Core-Triplet as occurrenceID would be less of a problem if only the occurrenceID identifier-string was maintained unchanged as the occurrenceID also for the sub-sample (and not generated a-new from a new catalogNumber).

Indeed, I imagine most collections are at the mercy of their respective CMS and how it manages data and translates/exports it to DwC (deepreef)

Because MaterialSample (approx 2013-03-28) and materialSampleID (approx 2013-05-25) are relatively recent additions to Darwin Core, most museum collections would likely not have any materialSampleIDs assigned to their specimens (yet)? I see an important mission of the MaterialSample task group to (finally) build the foundation for museum collections to start implementing MaterialSample and materialSampleID - and to demand such implementations from their collection management systems.

@afuchs1
Copy link

afuchs1 commented Nov 9, 2021

as the data manager of a combined herbarium, living collection and seed bank collection we have many use cases where an occurrence (when and where material was collected) has many catalogue numbers (how it is physically represented in the collection).

@deepreef
Copy link

I see an important mission of the MaterialSample task group to (finally) build the foundation for museum collections to start implementing MaterialSample and materialSampleID - and to demand such implementations from their collection management systems.

Indeed! I think that those of us connected with data management systems that do incorporate materialSampleID values should coordinate and compare notes, so we (this Task Group) can develop a series of recommendations for how to disentangle occurrenceID values from materialSampleID values, in future presentations of DwC content.

@deepreef
Copy link

as the data manager of a combined herbarium, living collection and seed bank collection we have many use cases where an occurrence (when and where material was collected) has many catalogue numbers (how it is physically represented in the collection).

Do you already assign materialSampleID values? If so (and even if not), how would you recommend collections like yours develop a process for associating catalog numbers with materialSampleID values, then aggregating sets of those values (i.e., MaterialSample instances) linked to a single occurrenceID value?

@dagendresen
Copy link
Contributor

Here is a recent example with material from the same occurrence (collected in June 2021) deposited in the vascular plant herbarium (CMS = MUSIT) and in the DNA tissue bank (CMS = Corema) at the museum in Oslo. When publishing the DNA bank in GBIF a few years ago we quickly became aware of the restrictive requirement for distinct occurrenceID in each dataset - in practice blocking us from publishing derived tissue samples with the "correct" occurrenceID (because multiple tissue samples are often extracted from the same material sample/specimen). The tissue sample specimen is thus (unfortunately) published with the occurrenceID mapped to the assigned materialSampleID and the (correct) occurrenceID is instead published as relatedResourceID. (An organismID was minted as well for linking, but the vascular plant herbarium CMS did not support this term).

Herbarium Oslo https://doi.org/10.15468/wtlymk
occurrenceKey https://www.gbif.org/occurrence/3393367309
catalogueNumber 1628289
occurrenceID urn:catalog:O:V:1628289
DNA tissue bank https://doi.org/10.15468/nzszik
occurrenceKey https://www.gbif.org/occurrence/3357457309
catalogueNumber O-DP-81046/1-T
organismID urn:uuid:52036a5e-1943-5f16-9326-217c3c4a4fa1
occurrenceID urn:uuid:9516f60c-d4c0-4e39-9235-37c89fee38f2
materialSampleID urn:uuid:9516f60c-d4c0-4e39-9235-37c89fee38f2
relatedResourceID urn:catalog:O:V:1628289

@Jegelewicz
Copy link
Collaborator

Here is an example of what happens with Arctos data and how we could (now) pass a MaterialSampleID.

https://arctos.database.museum/guid/DMNS:Mamm:12344
is from the same individual/collection event as
https://arctos.database.museum/guid/MSB:Mamm:233616

If you look at DMNS:Mamm:12344, this is how it would work:

Term Value Note
catalogNumber DMNS:Mamm:12344 Ideally, we would pass the url http://arctos.database.museum/guid/DMNS:Mamm:12344
occurrenceID http://arctos.database.museum/guid/DMNS:Mamm:12344?seid=877493 this url is built at the time data is published each month and is a concatenation of the catalog record url, "?seid=", and the specimen event id
OrganismID http://arctos.database.museum/guid/DMNS:Mamm:12344 Arctos does have a place to enter this, but if nothing is entered there, the default is the url for the catalog record (assumes there are no other samples of this organism)
Associated Occurrences (same individual as) MSB:Mamm http://arctos.database.museum/guid/MSB:Mamm:233616 Here we place a concatenation of all "relationships". Because this record has the "same individual as" relationship, this is where one would find the "parts from a single organism"
MaterialSampleID https://arctos.database.museum/guid/DMNS:Mamm:12344/PID21958887 currently we do not pass anything here, but we have recently assigned on-the-fly numbers to individual parts in a catalog record which you can see at the bottom of the page. These numbers can be stabilized by the collection and turned into PIDs which means that the part can then never be deleted. We just added this feature recently and I am not aware of anyone making use of it yet.

Each of the exercises make me realize how differently we are all approaching this and how I need to work with the Arctos community to get data in the appropriate places....

@afuchs1
Copy link

afuchs1 commented Nov 11, 2021

as the data manager of a combined herbarium, living collection and seed bank collection we have many use cases where an occurrence (when and where material was collected) has many catalogue numbers (how it is physically represented in the collection).

Do you already assign materialSampleID values? If so (and even if not), how would you recommend collections like yours develop a process for associating catalog numbers with materialSampleID values, then aggregating sets of those values (i.e., MaterialSample instances) linked to a single occurrenceID value?

We don't have a separate materialSampleID's. Everything is treated as catalogued items within a collecting occurrence (currently we allocate an internal sequential number to as there is no real world candidates and deliver to DwC by appending the institutionCode) and each item has a unique catalogNumber by virtue of adding a suffix for the different physical items across all 3 collections. eg. CANB897925.1 herbarium sheet; CANB 897925.9 seed packet; CANB 897925.6 cutting (now dead), a DNA collected at the time of collection is also given a catalogNumber. The institutionCode and accNo are not necessarily unique within an occurrence as we have combined separate institutional collections over time and used different accession numbering schemes, but they all link to the same occurrenceID.
Currently we don't handle allocation of an ID to samples taken from items well (ie. DNA taken from a preservedSpecimen or livingSpecimen), but in my mind whether these get a materialSampleID or catalogNo is less important than it having an ID which can be resolved, knowing what the status of that 'thing' is. Does it still exist or was it transitory, what was/is the type of material, and can we create explicit relationships between these 'things' and hold data about them. If this is held in the data then we can deliver it to any schema.
eg. image taken of this sheet, DNA sampled from a leaf on this sheet, cutting taken from plant grown in the gardens originally collected from a cutting in the wild. Each of these are essentially 'object' relationship 'object' about which we can hold additional data. (I think I have gone off track)

@Jegelewicz
Copy link
Collaborator

@afuchs1 I don't think you went off track at all! I think that is the essence of what MaterialSample should be about - "this"!

@deepreef
Copy link

Agreed! I think this gets to the heart of what we're trying to address in this Task Group.

I'm still wrestling with the boundary between Organisms and MaterialSamples. In many cases, the distinction is clear -- but in some cases it gets murky. For example, consider this real-world use-case:

Diver encounters a rare fish on the reef, and gets in-situ video of it. The fish is collected and brought to the surface alive, and transported half-way around the world (still alive). It is then photographed again (alive) in an aquarium. Some years later it dies and becomes a specimen at a Museum, where it is photographed again before preservation. Several tissue samples are taken, and the remaining specimen is preserved in alcohol. Over time, the specimen is moved from one shelf to another, or put on display, or loaned, or whatever.

There's a lot to unpack there, and while it may seem like a bit of an edge case, it's not that sharp of an edge, and whatever we come up with ought to be able to accommodate this kind of use case.

I'm still (mostly) confident that an instance of dwc:Occurrence represents an intersection of an instance of dwc:Event and dwc:Organism, and that any associated dwc:MaterialSample instances (living or dead or extracted tissue) do not participate in any dwc:Occurrence instances directly (in the same way that dwc:Identification instances do not participate directly in dwc:Occurrence instances). Instead, dwc:Occurrence instances associated with dwc:MaterialSample instances are inherited through a dwc:Organism intermediary.

However, I also recognize that dwc:MaterialSample instances participate in what could be framed as dwc:Event instances directly (assuming a dwc:Event instance is an action that happens and a particular place and time). For example, when I photographed the live fish in an aquarium in the example above, is it the Organism instance that was photographed, or the MaterialSample? Same question for photographing the dead specimen at the Museum. And same for when the tissue sample was analyzed for DNA sequencing. In all those cases (live aquarium photo, dead specimen photo, DNA sequencing), information about where, when and by whom fit nicely into a dwc:Event instance -- but at least in some cases, it would be a MaterialSample that participated in the Event, not an Organism. But would we call that intersection of MaterialSample+Event an Occurrence (sensu DwC)? Or is the nature of that intersection somehow different from a "proper" dwc:Occurrence instance? Same applies to a tiger in a zoo, I guess.

Ultimately, we want to be able to attach media items to both Organisms and MaterialSamples, and somehow track the circumstances of the Events where those media captures took place. I'm just not clear in my head whether all, or some of those involve dwc:Occurrence instances.

OK, now I'm the one who is gone way off track! Sorry about that! I know the above is probably way to abstract and conceptual for what we're trying to accomplish with dwc:MaterialSample -- but it seems to me that having this sort of stuff more or less understood and sorted out will only improve whatever standards recommendations emerge from, this Task Group.

@dagendresen
Copy link
Contributor

Following your line of thought - what is the thing/class taking part in an Event ... in and Occurrence.

How do we model environment or ecosystem or nature types or geology? (which are not appropriately modeled as Organism). Would these only be properties of a Location? Or is there room for a new class for these things (in Darwin Core? from before they became MaterialSamples?). In my mind, we sample MaterialSamples (which also can become accessioned collection specimens) from such things. E.g. water samples, minerals, geological samples, etc. for other purposes than recording any living things. In my mind, we thus already have many MaterialSamples (accessioned specimens) at the museum in Oslo that is not derived from any Organism.

Apropos - Is an Occurrence with occurrenceStatus = absent then actually an Occurrence at all?

(However, I am jumping outside the topic of this thread here)

@deepreef
Copy link

These are the kinds of questions that keep me up at night contemplating. I guess one fundamental thing we ought to pin down is: is the "Sample" of MaterialSample a noun or a verb?
Noun: "Some material thing that represents a sample of some abstract or material thing"
Verb: "Some material thing that has been sampled from some abstract or material thing"
The distinction is subtle (if it even exists), but I tend to lean toward noun, which doesn't require that there be a "sampling event". Sure, there may be a sampling event, and this may be true for the vast majority of MaterialSample instances, but treating it as a noun means that the sampling event is not necessarily intrinsic to the MaterialSample itself.
This is probably way too esoteric (and maybe unnecessary), but I guess it boils down to whether a MaterialSample instances is always, or merely usually, the result of a sampling event.

We deal with non-organism stuff the same way we deal with organism stuff, in that we treat "Organism" as a subclass of "Individual" (other subclasses could be things like "vehicle", "sunset", "habitat", etc.) They're all fundamentally abstract, and many (but not all) of them have material manifestations. So instances of MaterialSample are not limited to being physical representations of living things. A non-fossil rock can be a MaterialSample in my mind (even if outside the scope of DwC).

I'm reluctant to apply any of these things directly to Location, because most of them are bounded by time ... which I think would make them technically Events (at least in my mind).

Apropos - Is an Occurrence with occurrenceStatus = absent then actually an Occurrence at all?

Yeah, that's another one that keeps me up at night.

I worry this may mead down one of those very distracting philosophical paths that would be appropriate in some context, but not this one. On the other hand, I think some of these fundamentals are important to allow us to nail down the scope & definition of dwc:MaterialSample.

@dr-shorthair
Copy link

The key property of a Sample - material- or otherwise - is the intention that it be representative of something larger.
This is particularly obvious from the verb form 'to sample'. If you don't want to consider the act that created it, or the intention to represent something, then 'sample' is just a fancy name for 'thing'.

@dagendresen
Copy link
Contributor

I worry this may mead down one of those very distracting philosophical paths that would be appropriate in some context, but not this one. On the other hand, I think some of these fundamentals are important to allow us to nail down the scope & definition of dwc:MaterialSample.

What is a MaterialSample?

(PreservedSpecimen + FossilSpecimen + LivingSpecimen + tissue samples & environment samples => MaterialSample)

Could a material non-organism thing in situ that is not yet sampled still be a kind of "MaterialSample". Could such a thing in situ qualify for a dwc:catalogNumber if it is accessioned/catalogued (by a museum)? I guess such a non-organism thing would anyway not be part of an Occurrence? (and never be assigned a dwc:occurrenceID).

In my use case, thinking of a "nature type" (which could also be lifeless) evaluated to be designated for active conservation by national nature protection legislation. (Would we at all want/care about to enable Darwin Core to describe the monitoring and conservation of and ecological research on such things?).

(Sorry for staying outside of the thread main topic)

@albenson-usgs
Copy link

From my perspective the only time you have an occurrence is when you have an organism (or some part of an organism that can be identified, e.g. DNA) in its natural environment. Therefore the fish photographed in the aquarium is not an occurrence, nor when it dies and goes to a museum, or when tissue samples are taken. Those are all events, sure, but they are not occurrences.

I don't have trouble with occurrenceStatus = absent is still an occurrence. You went an looked for an organism using methods that would usually find it if it was there and didn't see it. I think you all are getting to philosophical here. Researchers use these 0 in their analyses all the time. It's not like Darwin Core invented this out of thin air. I was just on a call yesterday for seagrass monitoring where they want to make sure to include when a species of seagrass occurs in one plot at a field site but is absent from another plot because it has importance in the analyses they do.

@dagendresen
Copy link
Contributor

an occurrence is when you have an organism (or some part of an organism that can be identified, e.g. DNA) in its natural environment

What is an Occurrence?

If the Occurrence is the intersection between Event and Organism, there would be an Occurrence (that CAN be described and identified by an occurrenceID) each and all times this intersection happens - not limited to the "natural environment" (sensu in situ in "wild" nature)?!

If a LivingSpecimen can be BOTH "MaterialSample" and "Organism" at the same time (??) then places and times where it is CAN be described as "Occurrence"s would be broader than the original "collecting" event when it was sampled from "wild" nature (sensu in situ)?! Thus, even a tiger in a Zoo is a valid Occurrence?

For a cultivated crop resulting from crop breeding and conserved as a LivingSpecimen there is no "wild natural environment at all" -- so would we then agree that the "natural environment" is in the agricultural field?

@dagendresen
Copy link
Contributor

I think that occurrenceID in practice is used to identify the different "Evidence" of (mostly) "Occurrence"s and not as an identifier for the "Occurrence" itself! And that this is causing all kinds of problems!

@baskaufs
Copy link

Way behind on this conversation, but I can say with confidence that @tucotuco confirmed years ago that occurrences do not have to be restricted to natural occurrences. It's buried somewhere in the tdwg-content email archives.

@tucotuco
Copy link
Member

tucotuco commented Nov 12, 2021 via email

@albenson-usgs
Copy link

albenson-usgs commented Nov 12, 2021

@tucotuco from Rich's example above (fish in the water -> aquarium -> museum -> tissue sample) can you tell me which of those are occurrences and therefore get an occurrenceID?

@tucotuco
Copy link
Member

Occurrence: "An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time."

A record of the time and place the video of the fish (Organism) was taken is a good candidate for a distinct Occurrence.
A record of the time and place the fish was put in the aquarium is a good candidate for a distinct Occurrence.
A record of the time and place the fish was photographed in the aquarium is a good candidate for a distinct Occurrence.
A record of the time and place the fish died in the aquarium is a good candidate for a distinct Occurrence.

A record of the time and place the fish was accessioned in the museum could be an Occurrence, but not one that anyone in our community has expressed publicly as an interesting one from the perspective of science, rather, it is interesting from the perspective of collection management. Similarly with the specimen as it moves around.

@Jegelewicz
Copy link
Collaborator

Jegelewicz commented Nov 12, 2021

Wouldn't they all be occurrences? However, I don't think that establishmentMeans does what is necessary here. None of the terms in the controlled vocabulary accurately describe any of the occurrences besides the fish in the water. The paleo people have discussed the use of in situ/ex situ as a method for getting to "natural" or not.

@RogerBurkhalter
Copy link

I think many of the terms used in Occurrence are legacy terms and definitions from museum collections. I do not agree with changing catalogNumber to occurrenceNumber, we have occurrenceID to handle that. A catalogNumber infers a physical object that has been cataloged in a museum or repository, where machine and human observations are not routinely cataloged or numbered by a collection (or have not been). Yes, when I run across an observation in a field notebook or measured section log of an occurrence that has no corresponding collected objects, I record that as an observation as a humanObservation, and the CMS gives it an occurrenceID, but I use a UUID that does not resemble my museum catalog number for my institution. Of the hundreds of thousands of images (analog photographs and digital images) we have, not all are of objects collected nor are all of the objects in the images reposited at our institution. These machine observations I have not even begun to work on until such time as we have a DAM and students/volunteers to scan negatives and prints.

+1 @tucotuco

@deepreef
Copy link

@dagendresen:

Describing the occurrences of things is a rather poor proxy for describing the things themselves.

The value of dwc:Occurrence is not about serving as a proxy for describing the things themselves. It is about understanding something about biodiversity in space and time. That is the primary value of aggregating data about specimens (in the context of their collecting events) and observations. But just because that's the primary value, doesn't mean it's the only value. This group (dwc:MaterialSample) is focused on is accommodating information exchange needs of physical objects curated in collections, independently of whatever biodiversity research value can be derived from understanding the distribution of biodiversity in space and time.

@Jegelewicz:

[Occurrence =] Evidence of an Organism at an Event.

I'm not comfortable with this collapsing of "Occurrence" into "Evidence of an Occurrence". To me, the Occurrence instance has always been, and should always remain, the abstract fact of a particular organism existing at a particular place and time. There may be multiple pieces of Evidence supporting the truth of an Occurrence (a specimen, a photo, a record in a field notebook, a published reference, etc.) Each one of these individual pieces of Evidence should not be treated as separate instances of Occurrence. So I would re-frame your definition to something like:

Occurrence = Presence of an Organism at an Event, [or absence of an Organism at an Event] that is supported by some form of documented Evidence.

I included the "absence" clause in brackets because I'm not sure there is universal consensus that absences are in scope for dwc:Occurrence (I support it). But the good news is that we've recently affirmed that the scope of an instance of dwc:Organism can extend up to and including every single individual that is identifiable to a particular taxon. Thus, a single instance of dwc:Organism can be a direct proxy for every individual of a dwc:Taxon, and therefore can be used as the dwc:Organism instance participating in an "absence" dwc:Occurrence instance. Indeed, this approach gives us the flexibility (through dwc:Identification) to tie that dwc:Organism to a specific taxon concept (via anchoring to a TNU). But I digress...

The point is, a definition of this sort would accommodate recording both absences of individuals at an Event (e.g., "Wolf # 427 was not with the pack at this location today"), as well as absences of Taxa at an event (e.g., "we saw no individuals of this taxon at this place and time".) The only difference is the scope of the associated Organism instance.

catalogNumber - seems like this would be better labeled as occurrenceNumber

I agree with @RogerBurkhalter on this. We generate values of dwc:catalogNumber as a human-friendly convenience mechanism for humans to refer to a specific instance of a thing that other humans can understand easily. This makes perfect sense in the context of MaterialSample because these are the things we humans curate and manage (by definition in my view). But it doesn't make as much sense to me to assign "occurrenceNumber" values to the more abstract "presence of an Organism at a place and time" instances, because we humans less often need to communicate with other humans about specific Occurrence instances. More often we aggregate Occurrence instances (e.g., same taxon at same location), and only tunnel down to individual instances in cases of doubt or other need for verification. Thus, I think computer-friendly GUIDs (captured with occurrenceID) are better for referencing instances of dwc:Occurrence. In contrast, we humans communicate with other humans about MaterialSample instances all the time (identifications, loans, storage locations, tissue extractions, etc., etc.), so it makes sense to maintain a human-friendly catalogNumber value to them.

The way catalogNumber has and continues to be used in our community, I think it makes MUCH more sense to organize this term with the MaterialSample class, rather than re-define it as something like "occurrenceNumber". In summary, I don't think we need a human-friendly proxy for occurrenceID, but I think there is value in having a human-friendly proxy for materialSampleID; and catalogNumber seems to fit that role perfectly.

@dagendresen
Copy link
Contributor

I think we might have an idealized idea of Occurrence as Organism at Event. (my intention is not to argue against this concept)

And then we have how Occurrence dominantly is used in practice as the Evidence of an Organism at Event. (what I think for the most part is the actual nature of the things that are identified by occurrenceID)

I tend to think that our community might have painted itself into a corner and that maybe accepting that dwc:Occurrence has become predominantly used as the Evidence of might maybe be a possible least bad way out. _ (... and MAYBE instead consider minting a new class "OrganismOccurrence")

In my mind, the semantics of an "occurrenceNumber" is already exactly covered by dwc:recordNumber.

recordNumber = An identifier given to the Occurrence at the time it was recorded. Often serves as a link between field notes and an Occurrence record, such as a specimen collector's number.

I also think that we are in agreement of the value of "Occurrence" - and that we agree (as is the motivation for this task group) that we need ANOTHER concept to describe objects in collections (PreservedSpecimen, MaterialSample, ...). It is the latter need I intended to express by "the occurrences of things is a rather poor proxy for describing the things themselves".

@deepreef
Copy link

I tend to think that our community might have painted itself into a corner and that maybe accepting that dwc:Occurrence has become predominantly used as the Evidence of might maybe be a possible least bad way out.

I think the real corner we painted ourselves in, years ago, was the (mis)interpretation that specimen=Occurrence. This group and the class MaterialSample offer us a pathway out of that corner by separating the physical things we deal with (preservation methods and subsampling and loans and such) from the research value of those things (traditionally focused on information about the where and when of its extraction from nature). Most of it, I think, is pretty straightforward. The main fuzzy part (in terms of both idealized conceptualization and practical implementation) is this business of the boundary between dwc:Organism and dwc:MaterialSample, and the respective lifespans of each. Depending on how we lock in those boundaries and lifespans, and whether and to what extent they overlap in space and time, we may (or may not) have another question of how to manage instances of "An existence of a MaterialSample (sensu however we end up defining it) at a particular place at a particular time."

In my mind, the semantics of an "occurrenceNumber" is already exactly covered by dwc:recordNumber.

Agreed!

I also think that we are in agreement of the value of "Occurrence" - and that we agree (as is the motivation for this task group) that we need ANOTHER concept to describe objects in collections (PreservedSpecimen, MaterialSample, ...). It is the latter need I intended to express by "the occurrences of things is a rather poor proxy for describing the things themselves".

Ah! OK, understood. In that case, we agree -- and I think the convergence on MaterialSample (and its definition, scope, and properties) -- which is what this Task Group is focused on -- is the path to sorting that stuff out (i.e., once and for all dispelling the "specimen=Occurrence" issue). But I don't think we need to mess around with dwc:Occurrence, except to steal a few of the properties that are organized in that class, so they are instead organized in the MaterialSample class.

@dagendresen
Copy link
Contributor

Is not

the (mis)interpretation that specimen=Occurrence

just a subset of the larger (in scope) misconception that Occurrence = any evidence of an organism-occurrence? (as in effect treating specimens only as such evidence)

@RogerBurkhalter
Copy link

I Occurrence data as one of the main paths forward in paleontology as, especially human observations in the form of measured section notes as a primary information source of new finds and new studies. Often, a researcher with a bias towards collecting a particular taxon type, for example, Devonian gastropods. While documenting the section, they happen upon an occurrence of ostracods. The gastropod researcher may have no interest in those ostracods, but note the occurrence. Later, when another researcher is seeking Devonian ostracods for research, having that occurrence documented and findable is a major plus. There are literally tens of thousands of detail documented measured sections, published and unpublished, in museum collections and other repositories (like the USGS) that have hundreds of thousands of human observations of similar type occurrences. These are very important and under-documented resources that could certainly influence the future of study.

@deepreef
Copy link

Is not "the (mis)interpretation that specimen=Occurrence" just a subset of the larger (in scope) misconception that Occurrence = any evidence of an organism-occurrence? (as in effect treating specimens only as such evidence)

I guess you could look at it that way, but the history more or less boils down to:

  • Originally DwC was meant to exchange specimen data
  • It was changed to Occurrence to accommodate observation data, but most Museums interpreted Occurrence=Specimen (i.e., Specimen at the time and place of its extraction from nature)
  • MaterialSample was established with the intent for application on water/soil samples, but was defined more generally to include specimens
  • Several of us started talking about the need for some sort of "Evidence" class to accommodate multiple lines of evidence supporting an Occurrence (documented in Darwin-SW)

So, yeah, "specimens"/MaterialSample were certainly part of the "Evidence" conversation, but the conflation of Specimen=Occurrence predates that by quite a bit.

However, I guess it is fair to say that "Specimen=Occurrence" is something of a subset of Occurrence=Evidence -- and this is also supported by cases where the same Occurrence instance is represented separately for the specimen and for the image of the specimen. But that's often another set of issues because in many cases, the image wasn't taken at the same place and time where the specimen was collected; rather some time later at a different location (e.g., in a lab). So in those cases, the image doesn't even represent evidence of the same occurrence that the specimen represents.

In any case... my original point is that we should not re-define "Occurrence" as being the Evidence (as here). In other words, the specimen and the image and the field notebook are not the Occurrence -- the Occurrence was the presence of the organism at the place and time.

@Jegelewicz
Copy link
Collaborator

on one side hand we have a name and a label for a concept - a shortcut or alias, and on the other side we have a definition (canonically in English) to convey the better understanding of what it means. The definition has to stand on its own, and can not rely on the label for further edification.

@tucotuco that may be, but clearly we don't have a common idea of what "organism" means.

It is if there is an understanding of what "organism" (not dwc:Organism). If there isn't, then that needs to be added to the definition or the usage notes to clarify what meaning of "organism" in English is being used.

So I think a clarification is needed, because until we can disentangle dwc:Organism from dwc:MaterialSample, I don't think we can move on.

The main fuzzy part (in terms of both idealized conceptualization and practical implementation) is this business of the boundary between dwc:Organism and dwc:MaterialSample, and the respective lifespans of each. Depending on how we lock in those boundaries and lifespans, and whether and to what extent they overlap in space and time, we may (or may not) have another question of how to manage instances of "An existence of a MaterialSample (sensu however we end up defining it) at a particular place at a particular time."

@Jegelewicz
Copy link
Collaborator

Depending upon our clarification for "organism", the second wrench in the works that I think we need to address is how are dwc:Organism and dwc:LivingSpecimen different?

@dagendresen
Copy link
Contributor

dagendresen commented Nov 14, 2021

Would something along the lines of ... be useful:

  • dwc:occurrenceID (based on current pre-dominant use 🙈 🙉) be renamed as recordID and at least be moved from Occurrence to the record-level terms (:stuck_out_tongue_winking_eye:)
  • dwc:recordNumber be renamed as occurrenceNumber (and remained as Occurrence property)
  • dwc:catalogNumber and other MaterialSample properties be moved from Occurrence to MaterialSample

@stanblum
Copy link
Member

stanblum commented Nov 14, 2021 via email

@deepreef
Copy link

A living specimen is, of course, an organism. I think the key distinction between the two concepts is that LivingSpecimen is a kind of MaterialSample, whereas the DwC Organism class is intended to represent an organism that is inferred to exist or to have existed (past tense).

I agree, and was going to make similar points in the as-yet-unwritten "Chapter 6" of my unsolicited dissertation.

To me, the two core properties of a MaterialSample are:

  1. It consists, in essence, of physical matter; and
  2. It is under the direct control and care of humans.

These need to be fleshed out more (as I had intended within my concluding "Chapter 7"), and I'm still getting my head around whether I agree that the "sample" necessarily requires it to be some subset of a larger thing and/or whether the verb part of "sample" is definitive.

The critical role for the Organism class is that the concept and in particular the property dwc:OrganismID ties together multiple occurrence records that derive from the same organism.

I would consider that "a" critical role; not "the" critical role. Certainly it was the original critical role (sensu the old dwc:individualID within the Occurrence class). But I think the other roles you alluded to:

Other properties derive from the Organism class (most importantly what taxon the organism represents), but in our "shorthand" practice they are commonly recorded as properties of something that has a 1:1 relationship with organism, i.e., the Occurrence or the whole-animal MaterialSample.

... which I would summarize these as:

  1. The bridge between a dwc:MaterialSample and an dwc:Identification; and
  2. The bridge between a dwc:MaterialSample and an dwc:Occurrence.

... are actually more directly relevant to this Task Group, and in our modern thinking of representing DwC as more than just a bag of terms loosely organized into Classes.

@dr-shorthair
Copy link

To me, the two core properties of a MaterialSample are:

  1. It consists, in essence, of physical matter; and
  2. It is under the direct control and care of humans.
  1. It is a sample of something

@stanblum
Copy link
Member

@deepreef wrote:

I agree, and was going to make similar points in the as-yet-unwritten "Chapter 6" of my unsolicited dissertation.

Maybe I've started your Chapter 6? In any case, I've been working one myself, which I posted on the Wiki home page.

I think we can start working towards definitions, and analyzing scenarios (not really full use cases) to come up with recommendations about how the resulting records should be published and interpreted. I've been struggling a little with formatting and how to represent the critical concepts and data structures. So please make edits or add new representations if what I've done is unclear.

@dr-shorthair
Copy link

By which I mean

  • it was obtained by an act of sampling
  • there is an intention that it be representative of something bigger, which should be identifiable now or later

@deepreef
Copy link

deepreef commented Nov 14, 2021

  1. It is a sample of something
    By which I mean
  • it was obtained by an act of sampling
  • there is an intention that it be representative of something bigger, which should be identifiable now or later

Yeah, I get that. But here's why I'm still wrestling with it:

So if I collect a specimen of a bird and put it in a collection, what bigger thing is it a representative of? A flock? A population? A species? A vector of a disease? Ok, let's say one of those works, and it doesn't matter which. What, then, is an example of something physical that is not a representative of something bigger? I mean, if it's made of matter, then isn't it ultimately a representative of the universe?

I guess my question is: what are some examples of physical things that would not fulfill this third criterion? If it doesn't help us understand what is not in scope, then what purpose does it serve in the definition?

EDIT

OK, maybe when you say "is a sample of something", you mean the same thing that I mean when I say "It is under the direct control and care of humans"? That is, "it was obtained by an act of sampling" means the same thing as "it was taken into custody by humans". If those two cancel each other out, then that leaves the criterion that it must be a subset of something larger. To which I refer back to the rest of this post above.

@deepreef
Copy link

@stanblum: Thanks for the link to the Wiki page! Maybe I should have captured my "Dissertation" in that sort of template, rather than a series of Issue posts? I can reformat accordingly.

@stanblum
Copy link
Member

I think the wiki formatting tools are too limited and too hard to edit. I think we should switch over to GoogleDocs. Do we have a folder already?

@dr-shorthair
Copy link

dr-shorthair commented Nov 14, 2021

Q. Why do you collect and manage a sample?
A. So that you can make observations on it.
Q. Why are the observations interesting?
A. In a science context: Because they tell us something about the taxon/population/ecosystem ...
In a non-science context: Because we want to describe the artefact in its own right.

I think we are doing science, right?
It is certainly true that the artefact may represent more than one thing, in context of different observations.
It is also true that for some samples we don't know what they represent at the time they are collected and catalogued.
But if we are doing science, then the path is from the particular to the general, and we should keep the general in view from the beginning.

@smrgeoinfo
Copy link

@deepreef -- examples of 'physical things that would not fulfill this third criterion' (not samples):
The rocks I've picked up in the desert to bring home to use for landscaping the yard
The wine glasses in my kitchen cabinet
The boxes of laundry detergent on the shelves at the grocery store.

They are just things-- yes they can be categorized, but there is no intention of using them to learn anything about the world, they are just attractive or useful.

I assume the bird that is collected and preserved to put in a museum is not just a decoration-- there is some intention to learn something about the world from it...

@deepreef
Copy link

I think we are doing science, right?

Probably, but I don't think our definitions should hinge on intent. I think these things should focus on capturing facts, regardless of whether we want to do science with the information, or just look at pretty dead bugs. I mean, pretty dead bugs in a non-scientist's personal/private collection still function as evidence of occurrence -- assuming the data are accurate.

The rocks I've picked up in the desert to bring home to use for landscaping the yard

If you recorded the kind of rocks they were, and where they came from, then wouldn't that still be potentially useful information? How is it different for rocks in your yard vs. scientific specimens that are lost or destroyed after they are collected. In both cases, the Occurrence data are still valuable, and during the period of time when the samples (noun) were in possession /control of a Human, I would still consider them to be candidates for instances of MaterialSample.

As for the wine glasses and laundry detergent, these are out of the TDWG scope (non-biological), but I wouldn't automatically rule them out of scope for non-biological data nerds. Imagine I was a collector of rare wine glasses and found a dead insect in one of them. From my perspective, the insect would be worthless, but to an Entomologist, it might represent a new geographic record.

I realize I'm stretching things here, but I guess my point is that motivation/intent should probably not be among the criteria for defining the scope of Occurrences and MaterialSamples. What should matter is that someone took the time to record and document the information, and to share the information -- whatever their motivation was.

@Jegelewicz
Copy link
Collaborator

motivation/intent should probably not be among the criteria for defining the scope of Occurrences and MaterialSamples.

Agree! I think there are a lot of things currently recorded in museum catalogs that were collected because they were pretty or unique without any intent to study them. That doesn't make them less valuable or unable to be used as MaterialSamples now especially if there is some data to go with them (but for some forms of study, data isn't even that important).

@dr-shorthair
Copy link

On the contrary - I suggest that motivation/intent is central here. We do science. We deliberately design a sampling and observational program, in order to describe the world in a systematic way. This is not random.

@stanblum
Copy link
Member

stanblum commented Nov 17, 2021

Natural existence versus human intention: maybe the compromise is to acknowledge that nearly infinite organism-space-time intersections have existed in nature, from the origin of life to now, but we can't/don't document them all. They enter our world of "stuff we care about and document as data" when we "sample" them or observe them. They cross the threshold into our information space. Acknowledging the similarity between biodiversity specimens and other material samples lets us "play nice" with the rest of the Organization for Biomedical Ontologies (OBO) world. I don't think accepting that subclassing scheme imposes a cost or an impediment. While I don't know what the logical implied benefits might be (thinking ontologies and reasoning), it seems worth it.

@stanblum
Copy link
Member

stanblum commented Nov 17, 2021

It's also probably uncontroversial that our specimens/samples enable us to discover and document the characteristics of the biological systems they were drawn from. The systems represented don't have to be declared at the time of collection. The systems represented can be determined later from the documentation of context.

@deepreef
Copy link

@stanblum : Agreed!

@dr-shorthair :

On the contrary - I suggest that motivation/intent is central here. We do science. We deliberately design a sampling and observational program, in order to describe the world in a systematic way. This is not random.

I see where you're coming from, and it reminds me of a debate I had a while back with an esteemed anthropologist. His point was that you need to design science projects (and data models for capturing results) around hypotheses, so you need to know in advance why you're gathering the data, so your sampling design (and data model) allows you to properly tests your hypothesis.

I agreed, but countered that the mark of a good data model is that it allows you to answer questions you never even thought to ask when you were gathering the data.

I think both of these are in play here. I suspect that the vast majority of specimens in Museums (fodder for MaterialSample) were captured/killed/preserved with scientific intent. But when I record an observation of a fish on a reef using my video camera, I may have no idea at the time that it represents a depth record or a geographic range extension. So my intent in recording the video doesn't change the scientific value of the Occurrence record that it documents. This is true even if I am taking video of another diver, and the fish just happens to swim into frame. This is why I think intent (at the time of documenting an Occurrence record) isn't a prerequisite to capturing useful information. Obviously, if I killed the fish and put it in a Museum as an instance of MaterialSample, it's more likely that scientific intent was in play. But what if the fish was regurgitated from the stomach of a larger fish that I caught for dinner? I still think that counts as useful data destined for MaterialSample records to be shared with GBIF, even if there was zero scientific intent when the MaterialSample was obtained.

@Jegelewicz
Copy link
Collaborator

closing for focus on MaterialSample and properties

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests