Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

materialSampleID #20

Closed
Jegelewicz opened this issue Oct 14, 2021 · 19 comments
Closed

materialSampleID #20

Jegelewicz opened this issue Oct 14, 2021 · 19 comments

Comments

@Jegelewicz
Copy link
Collaborator

Jegelewicz commented Oct 14, 2021

Current Definition

https://dwc.tdwg.org/terms/#dwc:materialSampleID

An identifier for the MaterialSample (as opposed to a particular digital record of the material sample). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the materialSampleID globally unique.

Comments

Recommended best practice is to use a persistent, globally unique identifier.

Please suggest changes/improvements in this issue.

See also

@Jegelewicz Jegelewicz changed the title Other Deliverable - materialSampleID materialSampleID Oct 14, 2021
@tucotuco
Copy link
Member

tucotuco commented Oct 14, 2021 via email

@dagendresen
Copy link
Contributor

dagendresen commented Oct 14, 2021

I would prefer removing "In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the materialSampleID globally unique." from the definition.

@Jegelewicz
Copy link
Collaborator Author

HMMMM, I think @dagendresen has a point. It feels like we are encouraging not-so-good practices right in the definition.

@albenson-usgs
Copy link

albenson-usgs commented Oct 14, 2021

I disagree with @dagendresen and @Jegelewicz. After many issues with creating globally unique occurrenceIDs I have come to believe that it's better to create a unique ID using information in the dataset itself as opposed to creating a completely arbitrary GUID. My reason for this is that a unique ID can be recreated when needed (e.g. dataset updates) whereas randomly generated GUIDs cannot be recreated. They must be stored somewhere and that is currently not happening. Until we have some kind of DOI-like system for creating occurrenceIDs, materialSampleIDs, measurementIDs, and all the other IDs I think it's actually a good practice to create a globally unique identifier from a combination of identifiers in the record.

@tucotuco
Copy link
Member

I hear you, @albenson-usgs . My reason for not changing is more pragmatic, if we change this one, we should chenge every DwC ID term that uses the same pattern of definition. I'm not opposed to that. I think it would actually be cleaner to put that part in the usage comments.

@albenson-usgs
Copy link

albenson-usgs commented Oct 14, 2021

Sorry @tucotuco that my comment was not clear. I agree with you that it should not be changed. I think it's good advice actually.

Although if we are talking about taking it out of the definition and moving it to the usage comments instead that would be fine with me. I am not in agreement that it should be removed completely from anywhere in the term.

@dagendresen
Copy link
Contributor

I would also be opposed to moving a similar text to the usage comments. I think composite identifiers are really bad advice!

@stanblum
Copy link
Member

stanblum commented Oct 15, 2021

As a point of clarification: What do you think the prevailing attitude would be now towards pushing a distinction between the physical-object (<= material-sample <= preserved-specimen) and its digital catalog-record (=>information artifact)? The ID on the physical thing, our traditional catalog (aka accession# in botany), ties the specimen to the information record in the catalog (and is often reported in Material Examined of scientific publications); whereas the ID of the digital record containing data about the specimen can be made to comply with modern requirements (persistent, resolvable, globally unique). I added this distinction to the second figure I posted on a wiki page. Is this still be viewed as an unnecessary complication, or is there growing compliance in creating and using GUIDs for the digital record?

image

@dagendresen
Copy link
Contributor

dagendresen commented Oct 15, 2021

@stanblum If I read your question correctly I do not think we should publically publish identifiers for the database record (Information artifact) - I believe most people working in the collections would think of the specimen IDs as identifying the physical specimen - and yes, that persistently identifying the database record of the specimen would be a confusing complication.

I think that materialSampleID identifies the physical material sample.

Note that I think the Digital Extended Specimen concept on the other hand is useful. (As in a Digital Specimen concept that anyone in the world can contribute to describing).

@dagendresen
Copy link
Contributor

Regarding pseudo-identifiers composed of pieces of data, or even worse of pieces of other identifiers, maybe we need something along materialSampleCode, similar to the institutionCode and collectionCode and catalogNumber (!) (could we simply use catalogNumber?). I would maybe rather suggest recommending to leave materialSampleID blank if the data publisher has no appropriate persistent identifier to put here ;-)

@jbstatgen
Copy link

@stanblum #20 (comment)

To me, your proposal makes sense, and your diagram is intuitively clear and helpful for visualizing the overall concept.

@dagendresen : my hope is that using both IDs and gaining experiences with them will make them intuitively understandable and matter-of-fact'ly acceptable. This process towards understanding and acceptance worked in the journal sector. There,

  • the DOI corresponds to the ID of the Digital Specimen, ie. the "digital catalog-record (=>information artifact)". It is a persistent, globally unique, etc. ID (a PID).
  • The local catalog (and/or accession) number/ID of the physical specimen corresponds to the <JournalName> <volume>(<issue>): <pages> (<year>) notation. This is a local ID, which in addition is a compound of other IDs.

Both IDs are (more or less) human- and machine-readable and -actionable.

Twenty years ago, when DOIs where first introduced, personally I had no use for them and thought them a bit superfluous, a tech thing. However, after two decades of practical experience, today I, and I would guess other users too, now seem to understand the advantages of both IDs and work with both routinely. For example, to find similar papers bundled in eg. a special issue, the old-fashioned "hardcopy" ID still is a good starting point. On the other hand, it is simply nifty to be able to click on a DOI in a publication's reference list. Yet, my impression is (from my personal use) that DOIs, ie. PIDs, are used more and more, while the old notation might be phasing out over the next decades.

@deepreef
Copy link

As a point of clarification: What do you think the prevailing attitude would be now towards pushing a distinction between the physical-object (<= material-sample <= preserved-specimen) and its digital catalog-record (=>information artifact)? The ID on the physical thing, our traditional catalog (aka accession# in botany), ties the specimen to the information record in the catalog (and is often reported in Material Examined of scientific publications); whereas the ID of the digital record containing data about the specimen can be made to comply with modern requirements (persistent, resolvable, globally unique). I added this distinction to the second figure I posted on a wiki page. Is this still be viewed as an unnecessary complication, or is there growing compliance in creating and using GUIDs for the digital record?

This is a complicated issue. I see a distinction between "computer-generated identifier assigned to represent the conceptual object" (e.g., a UUID), and "ID of the digital record containing data about the specimen". It's a subtle, but important, distinction.

For example, I have a field that automatically assigns a UUID to every instance of MaterialSample in our databases. Although the UUID was "born" digitally, and does indeed uniquely represent the digital record for the physical object; the intention of the the identifier is that it represents the physical thing; not the digital record for the thing. So, when I generate an IPT dataset that is shared with GBIF, and the contents of that record are captured in the GBIF aggregated dataset, the UUID is retained. If the identifier was for the digital record, then it should not be transmitted to GBIF, because the record in GBIF is actually a different digital record, so would need to have a different identifier to represent that different digital record.

Indeed, I think that our community only very rarely assigns or uses unique identifiers for digital records, and when they do, it should be extremely explicit (e.g., "this is the identifier assigned to the Bishop Museum database record for this specimen, and this is the identifier assigned to the GBIF database record for the same specimen, [etc.]")

Yes, catalog numbers and accession numbers and sheet numbers and other human-friendly identifiers can be thought of as additional identifiers assigned to the same physical object, and that's fine -- many physical and abstract instances captured as data records in databases have more than one unique(ish) identifier assigned to them (e.g., see bioguid.org).

But my point is, I think we need to be extremely explicit when trying to distinguish identifiers intended to represent the physical (or abstract) object, vs. identifiers intended to represent a particular digital record about that object.

As to the issue at hand, I'm fine leaving the definition unchanged, mostly for the reasons mentioned by @tucotuco. My personal philosophy of persistent identifiers matches closely the sentiments expressed by @dagendresen.

@cboelling
Copy link
Member

cboelling commented Oct 15, 2021

I think we need to be extremely explicit when trying to distinguish identifiers intended to represent the physical (or abstract) object, vs. identifiers intended to represent a particular digital record about that object.

I wholeheartedly agree.

An identifier, first and foremost, is like a name for a thing. It is used to refer to that thing and pick it out among other things (without necessarily providing a detailed representation of that thing). The scope of its uniqueness and its suitability for one purpose or another (e.g., human discourse, machine actionability) are important, but orthogonal concerns. While striving for uniqueness (a given identifier is used for one thing and one thing alone - and vice versa) is a worthwhile goal, in many cases a thing will effectively be referred to by more than one identifier (name), if alone for legacy reasons.

Regarding pseudo-identifiers composed of pieces of data, or even worse of pieces of other identifiers, maybe we need something along materialSampleCode, similar to the institutionCode and collectionCode and catalogNumber (!) (could we simply use catalogNumber?)

Using composite keys composed of attribute data is an acceptable way to enforce uniqueness and therefore act as identifier. However, persistence of these is quite problematic (knowledge about things can change) - which is why artificial, opaque identifiers that carry no semantics by themselves are preferred in many use cases.

@cboelling
Copy link
Member

I would be happy with this definition update:

An identifier for a the MaterialSample. (as opposed to a particular digital record of the material sample). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the materialSampleID globally unique.

Comments

Recommended best practice is to use a persistent, globally unique identifier.

@Jegelewicz
Copy link
Collaborator Author

Jegelewicz commented Oct 15, 2021

But my point is, I think we need to be extremely explicit when trying to distinguish identifiers intended to represent the physical (or abstract) object, vs. identifiers intended to represent a particular digital record about that object.

See also #6 (comment)

@Jegelewicz
Copy link
Collaborator Author

Jegelewicz commented Apr 4, 2022

From 2022-03-16 monthly meeting notes:

  1. Not requiring GUID for materialSampleID - we need to be aware of the consequences. Do we need to add anything or change anything?
  2. Do we need better examples? How do you pass a doi? Include the https or no? UGH!
  3. Do we need materialSampleLODID? Or materialSampleActionableID - make recommendation for “higher up” in TDWG to think about this
  4. Do we steer people away from the DwC triplet?

Stephen R. Jutta B. and Teresa M. discussed this at length on 2022-03-31 during working hour.

  1. Physical samples WILL have more than one identifier - how do you choose the one that goes here? (Jutta discuss identifier class)
  2. Are we commingling identifiers for the physical object and the digital representation?
  3. Do we need a term for the identifier for the digital representation of the MaterialSample?
  4. Identifiers have/need their own set of metadata: who issued them?, who applied them?, are they verified?

@Jegelewicz
Copy link
Collaborator Author

Also discussed today with @deepreef

See http://bioguid.org/about

What will happen when institutions have multiple sample (skin,skeleton, tissue) with only a single identifier (catalog number)? We will need methods for making the "split" easy?

@dr-shorthair
Copy link

I'd suggest being very clear about which IDs are keys, in what context;
and which 'IDs' are being stored as 'annotations' related to some prior context.

@Jegelewicz
Copy link
Collaborator Author

closing as no change needed and identifier class is out of scope for this Task Group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants