Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Organism ID #1966

Closed
Jegelewicz opened this issue Mar 13, 2019 · 302 comments
Closed

Organism ID #1966

Jegelewicz opened this issue Mar 13, 2019 · 302 comments
Labels
Aggregator issues e.g., GBIF, iDigBio, etc Function-CodeTables Function-Relationship Priority-High (Needed for work) High because this is causing a delay in important collection work..

Comments

@Jegelewicz
Copy link
Member

Issue Documentation is http://handbook.arctosdb.org/how_to/How-to-Use-Issues-in-Arctos.html

Is your feature request related to a problem? Please describe.

We have been working with organisms for which we have multiple occurrences, specifically Mexican Wolves in the Mexican Wolf recovery program. Throughout their lives, samples of blood are taken from these animals and deposited in the genomic resources collection at MSB. Traditionally, each set of samples (all from the same day) have been given a single catalog number. This results in multiple cataloged items for a single organism, which we can link to each other using the “same individual as” relationship.

image

These relationships are nice, but they don't allow us to see ALL events for an individual in one place. and they require the addition of a new relationship for ALL related cataloged items every time a new collection of blood is made. Each cataloged item includes the other ID “Mexican Wolf Studbook Number” and we have modified the Other ID url so that clicking this other ID allows us to find all of the samples from any given animal.

image

This method works, but there is one issue we need to address.

When our data leaves Arctos and is ingested by aggregators such as GBIF and iDigBio, there is no easy way for anyone using the data there to make the connection that the various cataloged items are all from the same animal. Although the Mexican Wolf Studbook numbers are included in the list of related IDs, the connection just isn’t as tight as we would like it to be.

image
image

Describe the solution you'd like

Our proposed solution is to make use of the Darwin Core field “Organism ID”. We envision this as a separate and distinct other ID – one which provides a link to all related specimens (the results of that link would look just like the search result you see when you search one of the Mexican Wolf Studbook numbers):

image

This identifier would be passed to aggregators in the “Organism ID” field – allowing those using the data there to make the appropriate connection between the related cataloged items. Currently it appears that we are just passing the catalog item to that field

image

which is what led to the solution we have been attempting to make work in #1545. This has created problems with data entry and maintenance on our end. This new solution will allow us to keep events matched with parts and parts matched with accessions. It will simplify data entry and end the need for the links between events and parts.

We envision a new code table: CTCOLL_ORGANISM_ID set up very much like CTCOLL_OTHER_ID_TYPE where:

IDType = text “Mexican Wolf Studbook Number”

Description = definition of the IDType Studbook number assigned by the Mexican Wolf Recovery Program

BaseURI = http://arctos.database.museum/SpecimenResults.cfm?oidtype=Mexican%20wolf%20studbook%20number&oidnum=

When the Organism ID is used, there would be no need for all of the “same organism as” relationships, but they could be used if a collection so desired. Every cataloged item that included an Organism ID would instead appear like this:

image

With the text “Mexican Wolf Studbook Number: 1216” being a link taking you to the search results:

image

We would hope that this link could also be what appears at the aggregators in their “Organism ID” field:

image

Describe alternatives you've considered
The major challenge we see with this method is how to assign unique Organism IDs for things where there isn’t an obvious one. The Mexican Wolves (and eventually the Red Wolves that are expected to come in from Arkansas) and NEON recaptures are examples of when we would be using this method. These all have obvious unique identifiers (studbook numbers and NEON sample ID numbers). However, when the skin and skeleton of an animal are at DMNS and the tissues for that same animal are at MSB, there is no obvious organism ID type and we would need to come up with one. We are open to suggestions for how best to accomplish this.

What have we missed?

Additional context
See above

Priority
I would like to have this resolved by date: soonish

@Jegelewicz
Copy link
Member Author

I have passed this by John Wieczorek and here is our discussion:

The proposal to use dwc:organismID in Darwin Core resource is right on target. That is exactly what the field is meant for. You are right that Arctos is passing the id for the cataloged item in that field right now. The reasoning was based on the majority of cases, where the cataloged item corresponds to an Organism. Rigorously speaking, I think this is a mistake, because cataloged item does not always correspond to an Organism, and in Arctos, we don't have a fail-proof method of a knowing when it does, and when it doesn't. Given that, I think we should unmap organismID from the cataloged item in all Arctos resources.

I have looked at the proposal for the new code table (CTCOLL_ORGANISM_ID). I think this is unnecessary and unsustainable. I think a sufficient solution, which is also the most scalable, is to add a new type in CTCOLL_OTHER_ID_TYPE, called "organism identifier" or similar. Curators would have the freedom to create a (single) organism identifier, and that should be a persistent resolvable GUID. It could refer to any organism within Arctos, or outside it. Note that in the case of the Mexican Wolf Studbook Number, there would be two entries in the COLL_OTHER_ID table for each cataloged item - one with type "Mexican Wolf Studbook Number", which holds the number, and one with type "organism identifier" with the resolvable GUID to the organism.

There will be issues of "persistence" and of primacy (if two data publishers have distinct organismIDs, which should be used?), but those will exist outside of the scope of the immediate problem anyway. It's something that could conceivably be solved at a level above the publication of primary occurrence data.

Following what I am proposing above, there would be no need to communicate anything to GBIF, iDigBio, or GGBN. We would be following the intended use of dwc:organismID. The misunderstandings from iDigBio and GGBN are around the conflation of Occurrences by Arctos, not about the concept of Organism. The proposed solutions do not save us with respect to GBN either. With them the issue is that they want records of tissue samples, while everyone else in the world expects Occurrences, and these are not always the same thing, especially in Arctos. So, we still have to make distinct resources for GGBN, unfortunately.

My response:

I'm not sure I can wrap my brain around the other ID type solution. I feel like what you describe is what we do with the Mexican Wolves now - how would the GUIDs be created and where would they "live"?

I'm not the most technical person, so without a demo, it's just hard for me to see how two independent Other IDs will resolve to a GUID somewhere...but the idea seems the same as what I proposed just technically more stable? If so, I am on board and I agree that we need to stop sending catalog number as Organism ID AND that MSB needs to stop trying to catalog all collections for a single wolf in a single catalog number - which is why I proposed the solution I did - it is just too messy and information is lost in the process.

This is coming to the forefront for other reasons: tdwg/dwc-qa#131

I'd like to create a simple solution to the organism issue - it really shouldn't be that difficult within Arctos. The problem of everyone agreeing on an ID when you consider stuff outside of Arctos is something we need to tackle as a larger community and is related to unique identifiers in general. Let me know how I can help push a solution forward and I'll do everything I can!

John responds:

I'm not sure I can wrap my brain around the other ID type solution. I feel like what you describe is what we do with the Mexican Wolves now - how would the GUIDs be created and where would they "live"?

In Arctos the GUIDs would live in the Coll_Obj_Other_ID_Num table with an OTHER_ID_TYPE of "organism identifier". Curators would be responsible for entering these (read "danger").

I'm not the most technical person, so without a demo, it's just hard for me to see how two independent Other IDs will resolve to a GUID somewhere...but the idea seems the same as what I proposed just technically more stable? If so, I am on board and I agree that we need to stop sending catalog number as Organism ID AND that MSB needs to stop trying to catalog all collections for a single wolf in a single catalog number - which is why I proposed the solution I did - it is just too messy and information is lost in the process.

Two independent Other IDs do not resolve to a GUID somewhere. One of the IDs says "I am this Mexican Wolf Sudbook Number", the other says, "my dwc:orgnismID is this". Hey, maybe that's what to put in the CTOTHER_ID_TYPE table - "dwc:organismID" - it would be quite explicit.

This is coming to the forefront for other reasons: tdwg/dwc-qa#131

I'd like to create a simple solution to the organism issue - it really shouldn't be that difficult within Arctos. The problem of everyone agreeing on an ID when you consider stuff outside of Arctos is something we need to tackle as a larger community and is related to unique identifiers in general. Let me know how I can help push a solution forward and I'll do everything I can!

True. It is a community issue. Arctos is a great resource for pushing the limits of what we are able to do. For many outside it is way too far ahead, despite the fact that for some inside it doesn't do all we might want.

From me:

In Arctos the GUIDs would live in the Coll_Obj_Other_ID_Num table with an OTHER_ID_TYPE of "organism identifier". Curators would be responsible for entering these (read "danger").

The "danger"is what I was hoping to avoid with the separate table for organism ID - using "Mexican Wolf Studbook Number" as the base of the ID means we don't get "Mexican wolf studbook number 1216", "Mex Wolf Studbook No. 1216", etc.

@Jegelewicz
Copy link
Member Author

Two independent Other IDs do not resolve to a GUID somewhere. One of the IDs says "I am this Mexican Wolf Sudbook Number", the other says, "my dwc:orgnismID is this". Hey, maybe that's what to put in the CTOTHER_ID_TYPE table - "dwc:organismID" - it would be quite explicit.

To be clear - I don't propose there be two IDs, but to MOVE those other IDs that are truly Organism IDs to the new table.

@dustymc
Copy link
Contributor

dustymc commented Mar 13, 2019

In general, I think having some sort of "individual ID" would be very useful. It's not at all clear to me why it would be in a separate table; that invites more denormalization (doing the same thing multiple ways), inevitably leading to even bigger messes.

If the scope of this is Arctos, we could exploit relationships to assemble "individuals" and/or individualID without adding any overhead - there's much more discussion on that in #1545 - and see below.

I believe that this is implicitly a proposal to recatalog http://arctos.database.museum/guid/MSB:Mamm:292063 as 5 specimens. At least for some use cases that goes against the "catalog the item of scientific interest" mantra; eventually two of the samples from the same wolf will be compared in a publication. I'm not sure that's more evil than the current situation, where 5 samples collected at different times under different conditions are likely seen as equivalent to 5 tubes from the same liver of another specimen, but it should be acknowledged. I think any consistent documented approach is an improvement.

"Occurrences" are occasionally recorded in different collections, both in and out of Arctos, so cataloging Occurrences rather than individuals would make Arctos data more comparable with the rest of the world. I'm not sure how much weight that should carry, but again it is a consideration that should be addressed.

All of that said, I don't think Arctos can or should dictate how material is cataloged. I think the most we can do is to provide documentation/guidance.

This should extend beyond Arctos. A sample of http://arctos.database.museum/guid/MSB:Mamm:292063 stored in another system and shared with GBIF would ideally bear the same "individual ID" as the record(s) in Arctos. If it did, it would be trivial to assemble the individual in GBIF or similar systems.

The "danger" is in assigning the identifiers, and I don't believe there is any technical solution to that - it's a social problem that needs a social solution. It took seconds to find https://arctos.database.museum/guid/MSB:Mamm:317312 and https://arctos.database.museum/guid/MSB:Mamm:324187 which share a NEON ID and probably are not the same organism. I have never encountered a "number series" that didn't have similar issues, and if that exists the NEON ID cannot do what you want. I think this would be best implemented as GUIDs, and for social reasons those should probably not be minted by Arctos. Drawing those from an independent source would let Curators determine what is or is not an Individual on a case-by-case basis independent of any problems with identifiers assigned by other organizations, and at least maintains some possibility that other collections holding material from the same individuals would buy in and assign those IDs to their specimens. Two candidates are UUIDs, which would not be resolvable or actionable, or ARKs which could be resolvable and could point to some shared view (eg, GBIF, which in turn could point to the various bits and pieces of the individual in various systems/collections).

I think that also could be implemented only as guidance; I don't think Arctos can or should prevent someone from using "1" as an IndividualID, but we can help them understand the implications of doing so.

@Jegelewicz
Copy link
Member Author

How would this not be denormalization?

organismID = Mexican Wolf Studbook Number 1216
organismID = Mex Wolf Studbook No 1216
organismID = Mexican wolf studbook number 1216

These are all the same organism, but now we have three IDs for it. If we have:

ORGANISM_ID where:

IDType = text “Mexican Wolf Studbook Number”

Description = definition of the IDType Studbook number assigned by the Mexican Wolf Recovery Program

BaseURI = http://arctos.database.museum/SpecimenResults.cfm?oidtype=Mexican%20wolf%20studbook%20number&oidnum=

At least we eliminate the problem of the many ways "Mexican Wolf Studbook Number" might be spelled.

I think this would be best implemented as GUIDs, and for social reasons those should probably not be minted by Arctos.

I agree with this statement - but no one is stepping up to the plate for biological specimens (at least no one I am aware of). While the solution above does not fix the problems of the world, it would be a start for Arctos collections and maybe we could use that to press the issue with the community.

I looked up ARKs and I'm not clear on how that works - if is a solution, then let's explore, but I need an example because it seems very fuzzy to me and doesn't solve the social problem as far as I can tell.

@Jegelewicz
Copy link
Member Author

I believe that this is implicitly a proposal to recatalog http://arctos.database.museum/guid/MSB:Mamm:292063 as 5 specimens. At least for some use cases that goes against the "catalog the item of scientific interest" mantra; eventually two of the samples from the same wolf will be compared in a publication.

Yep - and the cataloging of separate events with one catalog number results in events and parts that are not properly associated with their accessions, their collectors and preparators, nor their attributes. (The event links are OK, but easily broken or incorrectly made).

@Jegelewicz
Copy link
Member Author

Should OrganismIDs be a DOI?

@dustymc
Copy link
Contributor

dustymc commented Mar 13, 2019

I'm still not following. You want another table that's the same structure and does the same thing as OtherIDs??

And yes those data are denormalized - that's a lot easier to deal with that denormalized structure, and one of many reasons a GUID of some sort would be a useful value.

There is no technical solution to social problems. We can make it enticing to assign unifying IDs, but that's about it.

ARKs are functionally much like DOIs, but they're free (and don't come with the buy-in, which I suspect means they also don't come with the persistence).

https://n2t.net/ark:/87299/x6d50k1v

If I a couple million dollars and nothing better to do, everything in Arctos would have a DOI. DOIs would be great "individialIDs" but I don't think I can supply them. And that would lead back into the whole "controlled by Arctos" thing, which I don't think has any chance of being adopted by anyone outside of Arctos. I can provide tools, but the folks who own these specimens should also own the unifying identifiers.

@Jegelewicz
Copy link
Member Author

Jegelewicz commented Mar 13, 2019

I'm still not following. You want another table that's the same structure and does the same thing as OtherIDs??

EXCEPT - those IDs would be passed to GBIF and other aggregators as "Organism_ID".

I have also considered just using a check box in the Other_ID table "this is an organism ID"....

@dustymc
Copy link
Contributor

dustymc commented Mar 13, 2019

Thanks - I might actually get it now!

It's Arctos-centric and not very pretty, but at least it's not denormalization: http://arctos.database.museum/SpecimenResults.cfm?oidtype=Mexican%20wolf%20studbook%20number&oidnum=none is a perfectly valid value for other_id_type=OrgID (whatever we call it).

That could be generated by a "this is an orgid" button. I could even abstract it to a saved search or ARK, but that gets us back to the "Arctos-centric" thing.

And again, if the scope of this is just "works for Arctos" then I think we'd be better off doing something with relationships. (@tucotuco pointed out that an ID works from a spreadsheet where a relationship may not, so "something" might be generating a URL that finds ID=value as above - IDK, that's details, I'm totally open to ideas).

@campmlc
Copy link

campmlc commented Mar 13, 2019 via email

@dustymc
Copy link
Contributor

dustymc commented Mar 13, 2019

No, in the interface.

http://arctos.database.museum/SpecimenResults.cfm?oidtype=Mexican%20wolf%20studbook%20number&oidnum=none is a GUID - and an actionable one at that. There's only one of them on the planet and it's easy to tell what it does. (It's not very pretty and may or may not be very persistent, but that's details.)

Mexican Wolf Studbook Number: 1216 is a string. Anyone can use it for any purpose anywhere; it doesn't natively do anything, and trying to do anything with it comes with a big pile of indefensible assumptions.

Edit for completeness: https://n2t.net/ark:/87299/x68g8hqw currently does the same thing as http://arctos.database.museum/SpecimenResults.cfm?oidtype=Mexican%20wolf%20studbook%20number&oidnum=none. It's prettier and likely more persistent. If I find another Occurrence of "none" I could re-point the ARK to somewhere mutually agreeable (eg, GBIF) in order to build a more complete picture of the Organism. It's a MUCH better solution than the URL, but also likely to take more investment than clicking a button.

2nd edit: I'm throwing ARKs around only because they're not-Arctos and super easy to create. They're not the only possible GUID, just a convenient and functional example.

@tucotuco
Copy link

tucotuco commented Mar 13, 2019 via email

@Jegelewicz
Copy link
Member Author

Jegelewicz commented Mar 13, 2019

Mexican Wolf Studbook Number: 1216 is a string. Anyone can use it for any purpose anywhere; it doesn't natively do anything, and trying to do anything with it comes with a big pile of indefensible assumptions.

I don't get how what you propose is different from:

IDType = text “Mexican Wolf Studbook Number”

Description = definition of the IDType Studbook number assigned by the Mexican Wolf Recovery Program

base URL = http://arctos.database.museum/SpecimenResults.cfm?oidtype=Mexican%20wolf%20studbook%20number&oidnum=

@tucotuco
Copy link

I had been thinking there would be only one allowed organismID. Maybe that is silly. Maybe it is fine to have as many as you like. That way you could include your own AND those of other collections (in or out of Arctos). That way you could also potentially go directly to GBIF to get the set of Occurrences for all matching organismIDs.

@Jegelewicz
Copy link
Member Author

HMMMM..I hadn't considered that.

Maybe it is fine to have as many as you like. That way you could include your own AND those of other collections (in or out of Arctos). That way you could also potentially go directly to GBIF to get the set of Occurrences for all matching organismIDs.

BUT when searching AT GBIF, how would they be related - so that some person who was unaware the two organism IDs were the same organism could make the connection?

@campmlc
Copy link

campmlc commented Mar 13, 2019 via email

@tucotuco
Copy link

I don't get how what you propose is different from:

IDType = text “Mexican Wolf Studbook Number”

Description = definition of the IDType Studbook number assigned by the Mexican Wolf Recovery Program

base URL = http://arctos.database.museum/SpecimenResults.cfm?oidtype=Mexican%20wolf%20studbook%20number&oidnum=

It is very different outside the world of Arctos. The organismID would have to be constructed from this, and what would you do to create the organismIDs of the ten collections that have parts of the same plant? Create ten new ID types and base URLS (just to cover that one organism - multiply by all the collections that share any parts of any Organisms in Arctos)?

@dustymc
Copy link
Contributor

dustymc commented Mar 13, 2019

different

It eliminates data stored in arbitrary places.

only one

Yea, I suspect reality will find a way to stomp all over that, but it would be nice....

link specimens

Arctos can link to anything with a URL, and provides a mechanism for incoming links.

shared field number

Everybody starts at "1." If you want links, you need actionable GUIDs. If you want discoverable, you need shared actionable GUIDs. You might get at "shared" by tracking down the other 40 samples in GBIF and adding their IDs to Arctos, although "here's a nice neutral persistent actionable identifier, would you mind using it so we can talk to each other?" would greatly simplify things.

@tucotuco
Copy link

All share the same field number, they are all the same organism, but how would we relate them in GBIF if AMNH assigns one and MSB assigns a different one?

I think that is what I am getting at in tdwg/dwc-qa#131 (comment)

@tucotuco
Copy link

Something akin to IGSNs, but for Organisms instead of for samples.

@Jegelewicz
Copy link
Member Author

Jegelewicz commented Mar 13, 2019

The organismID would have to be constructed from this, and what would you do to create the organismIDs of the ten collections that have parts of the same plant? Create ten new ID types and base URLS (just to cover that one organism - multiply by all the collections that share any parts of any Organisms in Arctos)?

I don't understand - you would only need one ID type. From any record in Arctos, I can click the link from the Mexican Wolf Studbook Number (no matter what number it is) and I'll get the specimen results page that show all of the wolves that share the same number.

If UTEP or UMNH or any other Arctos collection had a wolf specimen and put the studbook number in the "Mexican Wolf Studbook Number" other ID, then it would show up in the search too, because the link is an actionable guid like Dusty described.

It would be a social issue to decide upon an "ID Type" for the situation that you describe, but we should only need one. The challenge - as I pointed out in the very beginning is assigning the individual organism ID numbers, so that all collections with parts of the same plant would use "Individual Plant ID" = 1, etc.

I guess I am missing something (which doesn't surprise me...) The wolves are easy because they are all here and they have a (somewhat) logical identifier. Everything else will be messy until we have a unique BOI (Biological Organism Identifier).

@campmlc
Copy link

campmlc commented Mar 13, 2019 via email

@dustymc
Copy link
Contributor

dustymc commented Mar 13, 2019

organisms, mint compliant ID

Don't half-bake this! - I want those for events, localities, agents, .... too.

Seriously, Arctos is built to plug in to something like that. If we have a local identifier for something it's only because nobody else would do it for us.

relationships are pairwise

Not really - there's always an implied second THING out there, but we don't have to be able to find it. "{whatever relationship of} ABC:XYZ:1234" is fine even if ABC:XYZ isn't online, "{whatever relationship of} NK 1" is fine even if 40 specimens (that we can find) wear "NK 1", etc.

reciprocally

I don't think a lack of reciprocity will ever be Arctos' fault.

I know many of your examples are not capable of acting as unique identifiers, and I suspect that's true of all of them.

Can we mint DOIs

Yes, in limited quantities - there are "get a DOI" links scattered all over the place.

IGSNs

Beats me - if they have a service and are willing to provide access we should be able to.

We could also mint ARKs in unlimited quantities if there's a reason to do so.

@Jegelewicz
Copy link
Member Author

relationships are pairwise

Not really - there's always an implied second THING out there, but we don't have to be able to find it. "{whatever relationship of} ABC:XYZ:1234" is fine even if ABC:XYZ isn't online, "{whatever relationship of} NK 1" is fine even if 40 specimens (that we can find) wear "NK 1", etc.

But we WANT to find it! 40 fish with "same lot as" requires 39 relationships on all 40 records and then I have no easy way to see them all in one place (or I just don't know how to do it). In the same way - 20 events of blood samples from Mexican wolf studbook number 1216 requires 19 relationships on 20 records (and a relationship needs to be added to ALL of them every time a new set of samples comes in! It is a lot of work....

@campmlc
Copy link

campmlc commented Mar 14, 2019 via email

@dustymc
Copy link
Contributor

dustymc commented Mar 14, 2019

easy way to see them

That's an interface problem.

a relationship needs to be added to ALL of them every time a new set of samples comes in!

That MAY be an interface problem too - eg, MAYBE I could just magic in reciprocals instead of the email. Not much problem technically, but there are social implications.

40 fish

That does occasionally happen, but more normal is a coyote, a beaver, 3 mice (all because the printer stuck), and all of their parasites (for reasons that don't make much sense to me).

siblings

There's an Issue somewhere about making inferences from relationships - also just a display problem.

organism IDs to deal with the latter, and relationships that can deal with the former.

Yea, there's some overlap that I don't think we can avoid. I think we need both anywhere we can - orgID is useless unless all of the bits are accessible, and relationships can't be used to find all the bits in places like GBIF. I'm not real happy with that, but I think it's reality.

@ccicero
Copy link

ccicero commented May 24, 2021

OK, I'll be there at noon.

I'm not doing something right. Here is a record with two events:
https://arctos.database.museum/guid/MVZ:Bird:193195

I created an observational record for the second event:
https://arctos.database.museum/guid/MVZObs:Bird:4777

and selected for both a 'Organism ID' identifier
https://arctos.database.museum/entity/0709-02237

(manually entered the URL which I'm sure is not correct, but I didn't see a base URL in the code table)

When I click on the Organism ID link, I get "Entity not found! Please let us know what happened."

@dustymc
Copy link
Contributor

dustymc commented May 24, 2021

"Entity not found!

You didn't create one.

https://handbook.arctosdb.org/documentation/entity.html

I did this for you:

Screen Shot 2021-05-24 at 10 30 43 AM

Screen Shot 2021-05-24 at 10 30 50 AM

nope not there so

Screen Shot 2021-05-24 at 10 30 55 AM

Screen Shot 2021-05-24 at 10 32 03 AM

and now you have the bare minimum.

The next step would (ideally - this is now functional) be to add the components.

Then clicking "pull" and accepting whatever it says would add some discoverability.

@ewommack
Copy link

@Jegelewicz was amazing and added the office hours to the calendar.
Do we want any note or explanation @dustymc?
"Dusty's Office Hours are discussions with Dusty on specific problems and production developments in Arctos. Come join the conversation and help us figure out how to make Arctos better"

@dustymc
Copy link
Contributor

dustymc commented May 24, 2021

Thanks!

I'm up for anything. I'll probably be more useful with some warning, I think we can/should prioritize if someone wants to schedule a topic, otherwise just see what happens?

@ewommack
Copy link

I'll probably be more useful with some warning, I think we can/should prioritize if someone wants to schedule a topic

How about:
"Dusty's Office Hours are discussions with Dusty on specific problems and production developments in Arctos. Suggest a topic ahead of time in GitHub, or just come join the conversation and help us figure out how to make Arctos better"

@dustymc
Copy link
Contributor

dustymc commented May 25, 2021

From meeting:

  • clarify search before create functionality
  • Search is one field, hits everything possible, has usage hint
  • show derived data (component IDs and such) in some less-central way

Changes

  • entityID is assigned by Arctos; you get what you get and don't have a fit
  • entity description (new field in table entity, required, editable, @campmlc will write documentation)
  • pull is automagic
  • manage_collection is required to create/edit

Unresolved:

  • show more dynamic view in search result
  • DO NOT show more dynamic view in search result

It's less-dynamic for now, not sure we have the CPU to pull everything in anyway. Looking forward, this needs to (theoretically) work for hundreds (zoo critters have a rough life) if not thousands (GPS collar, maybe) of components, which probably demands separate search results and 'details' views.

Needs further discussion:

Entities are but one option for Organism ID, and therefore the code is "Entity-centric." Organism ID can be exported from Entities to catalog records, but Entity ID cannot be exported/created from catalog records. I suggest that this is sufficient; Entities are "super objects" that only need exist when there's something additional to say. If the only goal it a common identifier for Organism ID, there are many options which do not involve Entities (bird banding lab numbers, for example). Entities are "better" identifiers, and making sure that they are in fact "better" requires a small amount of focus.

Yea But Anyway:

Consider something in SpecimenResults-->Manage-->Add All Records to {pick an entity}

  • I think I'm comfortable with this to ADD, not so sure about CREATE

Needs Clarification

re: "bird banding lab numbers, for example" above: There is confusion around this point, it needs clarified somewhere. A number may/should be used in multiple types, because those types convey different information and have different functionality. For example, to use a BBL number as an Organism ID, the following should be entered (assuming BBL was an OtherID Type in Arctos):

  • BBL: 12345
  • Organism ID: BBL 12345

The BBL number supports "find records with a BBL number" (and perhaps value, but free-text fields aren't very good at that), and potentially (should BBL come online) can serve as a link to external resources or additional data.

The Organism ID serves as an Organism ID; it's an identifier that spans multiple Occurrences and links them together as one THING. In this case that link is dependent on users being consistent (eg, not using Organism ID: BBL{nospace}12345 in one of the involved records), and should be recognized as having limited scope (somewhere on the planet, there's probably an unrelated, perhaps even similar, "BBL 12345.") There's no realistic way for machines to determine if BBL{nospace}12345 and BBL 12345 should be the same thing; error detection requires (patient) humans.

Entities (of type Organism) serve the same purpose; they're linking identifiers. They differ in two significant ways:

  • There's a verifiable "correct" format; identifiers issued by Arctos behave differently than those which were not (eg, typos).
  • The Entity can carry data of its own, and this data can be used in things like error detection.

tl;dr: Any string can serve as Organism ID, but some can DO THINGS that others cannot.

Bulk Tools:

MSB's biopark data is recent and decent, but should have enough problems to be interesting. Try to make and "componentize" Entities from it, with a view towards developing bulk tools. (This may address any gaps left by the entity-centric approach described avove.)

Reports:

  • See if the stuff from edit entity (components don't use entity ID, records using entity ID aren't components) can be made into reports and/or bulkloaders.
  • All entities should have components or preferred entity ID

Possibilities:

Rather than Export, we could write to the ID loader with status=autoload

  • Yay: one click instead of ~4
  • Not so yay: Fixing the giant messes that approach is capable of creating could be a tremendous amount of work (which usually means it'll never happen, and then nobody will use this because it's all a giant mess). Suggest the small amount of review required to manually use the loader is well invested.

"Reports" above has the same implications; we could save a few minutes by automating, which might then require much more than a few minutes to fix the giant mess which could result from a relatively minor error.

@campmlc @Jegelewicz @ccicero what'd I miss/mangle?

@dustymc
Copy link
Contributor

dustymc commented May 25, 2021

There's some new stuff in test, https://handbook.arctosdb.org/documentation/entity.html#the-process-v2 documents creating http://test.arctos.database.museum/entity/2

Questions:

  1. What should I auto-pull into Entity Assertions from catalog records; what data might lead someone to an existing Entity and prevent them from creating a duplicate?
  2. What should I dynamically pull on the detail page; what's useful there?

@ewommack
Copy link

Not sure if this will be helpful, but here are several references for BBL bands: https://www.usgs.gov/centers/eesc/science/about-federal-bird-bands?qt-science_center_objects=0#qt-science_center_objects

BBL bands always have two sets of numbers XXXX-XXXX or XXXX-XXXXX. The first string relates to the size of the band, and the second string is in sequence numerically assigned to individual banders. I can't find a reference for the numeric codes for the different sizes, but I'm sure it exists somewhere. I could dig deeper if you need me to.
They keep strong track of which of us has which bands, because as you can guess mistakes get made all the time. That way they know who to poke/yell at if a warbler band comes back being reported on a Red-tailed Hawk.

@dustymc
Copy link
Contributor

dustymc commented May 26, 2021

Thanks. Nothing can really change how unresolvable strings work, but entities could serve as a place to gather identifiers - the Entity itself can hold all the variations that might be found in GBIF-n-such (BBL:XXXX-XXXX; BBL XXXX-XXXX; XXXX-XXXX, XXXXXXXX, etc., etc.) and that has some possibility of leading users to those records if they find the Arctos record.

@Jegelewicz
Copy link
Member Author

What should I auto-pull into Entity Assertions from catalog records; what data might lead someone to an existing Entity and prevent them from creating a duplicate?

Identification (taxon)
All other identifiers
Attributes (of the catalog record item)

@dustymc
Copy link
Contributor

dustymc commented May 28, 2021

Latest is in production, I rebuilt the two Entities I could, old data is in arctos-assets.

@Jegelewicz
Copy link
Member Author

Sorry I haven't worked on this - I've been busy cleaning ichnotaxa and part names.....

@dustymc
Copy link
Contributor

dustymc commented May 28, 2021

I think we've all had our distractions lately!

@campmlc
Copy link

campmlc commented May 28, 2021 via email

@Jegelewicz
Copy link
Member Author

One problem with the "multiple events for a cataloged organism" model. This one, where NONE of the parts are associated with any one of the 12(!) events.

I can tell you that at GBIF and iDigBio, each of the 12 occurrences includes all 28 parts, which is pretty misleading. Here is one of the GBIF occurrences: https://www.gbif.org/occurrence/1300283344

image

Also, ALL media are associated with ALL occurrences at GBIF, again misleading. This is sort of true at iDigBio as the "associated media" field links up with a search of media by the catalog number (at least I think that is what is happening) although this link has 9 results and there are 10 images at GBIF).

How does this stuff look at GGBN? Interestingly enough, I was unable to find any Canis lupus baileyi at all through their search page! @campmlc you may want to follow up on why this is so. I did find Canis lupus baileyi x Canis familiaris

I notice that GGBN results include this:

78 records found (unique samples, not counting multiple samples from the same specimen).

Well if all of the samples for this "specimen" get narrowed down to just one vial of blood in search results, then people would be missing out on the "over time" component of the sampling. Not to mention the fact that there may be more than one kind of sample (hair, blood, serum). HOWEVER, there's this

image

so what exactly is a "specimen"?

If I were someone looking in on this, it just looks a big pile of things and I don't have the time or inclination to sort it out amongst the 4 different resources (Arctos, GBIF, GGBN, iDigBio). The information for one cataloged item should really not look so incredibly different in all of these resources. Some of that is on the resources, but some of it is on us.

Sorry for this, but I am looking into issues related to MaterialSample and as I was researching, I fell into this rabbit hole. I wanted to document it so when the time is right I can return to it.

@Jegelewicz
Copy link
Member Author

And extra infuriating is this. It looks like GGBN takes all of our individual "occurrences" and mashes them together.

See https://www.ggbn.org/ggbn_portal/search/record?unitID=MSB%3AMamm%3A255471&collectioncode=Mamm&institutioncode=MSB

image

WHY do we have to split everything up for them? I don't understand how they couldn't take the data at GBIF and parse it into the separate "samples".

https://www.gbif.org/occurrence/1229671489

image

And why in blazes are there only three samples when the "preparations" clearly show 6?

AND the individual samples don't even show what they are?, just "tissue"

image

@campmlc
Copy link

campmlc commented Aug 5, 2021

Great observations - we need a designated discussion on this.
@jldunnum

@Jegelewicz
Copy link
Member Author

I'd really like you guys to look at some of your stuff in all the various portals and think about what is happening!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Aggregator issues e.g., GBIF, iDigBio, etc Function-CodeTables Function-Relationship Priority-High (Needed for work) High because this is causing a delay in important collection work..
Projects
Status: To do
Development

No branches or pull requests

8 participants