Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DWC mapping #7348

Open
dustymc opened this issue Jan 30, 2024 · 47 comments
Open

DWC mapping #7348

dustymc opened this issue Jan 30, 2024 · 47 comments
Labels
Aggregator issues e.g., GBIF, iDigBio, etc Enhancement I think this would make Arctos even awesomer!

Comments

@dustymc
Copy link
Contributor

dustymc commented Jan 30, 2024

The Map

https://docs.google.com/spreadsheets/d/1aCBYX9ErjicL8VdNdHbJUI0JTwWu6L4D_37gJ7IneRY/edit?gid=0#gid=0 will be the primary Arctos-->DWC mapping document; please make suggestions/corrections/etc in this issue.

Mapping Test

Here's a sample of DWC generated from the spreadsheet: temp_dwc_sample.csv.zip

  • is it mapping correctly?
  • are there any encumbered data or other problems?

Let me know if you need to see this with some particular data, or what I can do to make things clear.

Goals

A clear and functional DWC mapping document.

  • a spreadsheet map which can be updated, expanded, clarified, versioned, etc. as necessary
  • a place to discuss all issues and needs in context of the whole

Scope

This Issue is for mapping to "flat DWC" (DwC-A). Media/AudubonCore (existing mapping) can be addressed elsewhere. Extensions (new mapping) would also need dedicated Issues and justification. (Because some - perhaps most - don't do much.)

Major Change

@mkoo and I believe mapping should be simplified, where only each "best occurrence" (eg what's in FLAT) is shared via DWC; that's in line with current cataloging practices, will exclude mostly things like lower-quality georeferences, will be a huge simplification in mapping and understanding the data, and will not require us to mint fake identifiers (which make GBIF nervous and might well end up in publications).

working comments

In progress: "translate" SQL (https://github.com/ArctosDB/PG_DDL/blob/master/shared_data/dwc_occurrence.sql) to spreadsheet (in a way that can be used to write dynamic SQL).

I'll merge related issues here so they can be addressed in context. It'll take a while.

Some possibly-related issues: https://github.com/ArctosDB/arctos/issues?q=is%3Aissue+is%3Aopen+label%3A%22Aggregator+issues%22

@dustymc dustymc added Enhancement I think this would make Arctos even awesomer! Aggregator issues e.g., GBIF, iDigBio, etc labels Jan 30, 2024
@dustymc dustymc added this to the Needs Discussion milestone Jan 30, 2024
@Jegelewicz

This comment was marked as resolved.

@campmlc

This comment was marked as resolved.

@ekrimmel

This comment was marked as resolved.

@Jegelewicz

This comment was marked as off-topic.

@tucotuco

This comment was marked as resolved.

@AJLinn

This comment was marked as resolved.

@Jegelewicz

This comment was marked as resolved.

@Jegelewicz

This comment was marked as outdated.

@Jegelewicz

This comment was marked as resolved.

@Jegelewicz

This comment was marked as resolved.

@Nicole-Ridgwell-NMMNHS

This comment was marked as resolved.

@Jegelewicz

This comment was marked as off-topic.

@dustymc

This comment was marked as off-topic.

@Jegelewicz

This comment was marked as resolved.

@dustymc

This comment was marked as outdated.

@Jegelewicz

This comment was marked as off-topic.

@happiah-madson
Copy link

Also, is the order of these fields intentional? Does it only go to aggregators so it doesn't matter?

@dustymc
Copy link
Contributor Author

dustymc commented Nov 26, 2024

collection code is TEACH

It's just a dumb mapping, we don't really have anything equivalent (but I think maybe can't leave it NULL??).

submit an issue

https://github.com/ArctosDB/code-table-work/issues/77 (we're aiming for fewer, not more! - but comment there...)

order of these fields

Order is arbitrary/irrelevant (not just here...) - although I think IPT has some very not-great ideas about order which might eventually influence something.

@happiah-madson
Copy link

not-great ideas about order

😭

@dustymc
Copy link
Contributor Author

dustymc commented Dec 13, 2024

GGBN meeting discussed this (https://docs.google.com/document/d/1Qs-RQKkIqpJr5xx2VbfDnxPbw4vFDcypGWcVDjyBYo4/edit?tab=t.0#heading=h.mn8up28dmjnq) consensus is to move ahead with simple remapping, ideally before talking to GGBN in ~January.

@dustymc
Copy link
Contributor Author

dustymc commented Dec 13, 2024

@happiah-madson says

in the old days of OGL, we put all external identifiers in this field (DWC:recordNumber) and ignored fieldNumber.

This sounds most excellent to me, I propose merging

  • otherCatalogNumbers
  • recordNumber
  • fieldNumber

somewhere, but I think perhaps https://dwc.tdwg.org/terms/#dwc:otherCatalogNumbers is better than https://dwc.tdwg.org/terms/#dwciri:recordNumber

and associatedSequences is a bit redundant - should we drop it too?

@dustymc
Copy link
Contributor Author

dustymc commented Dec 13, 2024

@mkoo in case its handy for your new column (I changed the name to make valid CSV), here a sample of MVZ:Herp data built from the spreadsheet as of NOW.

temp_dwc_sample.csv.zip

@dustymc
Copy link
Contributor Author

dustymc commented Dec 13, 2024

@campmlc here's a sample of some MSB Organism-having records:

temp_dwc_sample(1).csv.zip

@dustymc
Copy link
Contributor Author

dustymc commented Dec 13, 2024

I added a new tab, I'm bringing Media into this discussion. Media and Occurrences are closely linked, and the unanimous consensus seems to be to act on this sooner rather than later so let's just fix everything at once.

Here's some sample data generated by the Media tab of https://docs.google.com/spreadsheets/d/1aCBYX9ErjicL8VdNdHbJUI0JTwWu6L4D_37gJ7IneRY/edit?gid=1920472984#gid=1920472984

See also https://github.com/ArctosDB/internal/issues/365

temp_media_test.csv.zip

@happiah-madson
Copy link

Somewhere, but I think perhaps https://dwc.tdwg.org/terms/#dwc:otherCatalogNumbers is better than https://dwc.tdwg.org/terms/#dwciri:recordNumber

I think there could be an argument for actual other collection catalog numbers being in the former and all the rest in the latter (only b/c in an issues meeting, there seemed to be strong preference from the community to distinguish between catalog numbers and all the other chaos (i.e., yes, we'll use "identifier" for nearly everything except for actual catalog numbers).

@happiah-madson
Copy link

Also, @dustymc, I was going to file an issue to put life stage into cache, but filtered_flat.age_class doesn't seem to be in the DwC mapping anymore? Am I just confused?

@dustymc
Copy link
Contributor Author

dustymc commented Dec 17, 2024

actual other collection catalog numbers

I haven't seen the slightest glimmer of a hint that there might be an actual distinction in there (and if there is then it's probably completely fatal to the idea of collector number) so this seems very premature. If such a distinction somehow arises then remapping to it should be trivial.

strong preference

I get that, but I can't write code to it and until that's possible (and obvious enough that everyone else who's sharing data to DWC makes the same choices) it can only serve as a means to make data less accessible.

file an issue to put life stage into cache

Go for it (and please frame it as a functional need!).

filtered_flat.age_class doesn't seem to

... exist, because the source of it doesn't exist.

@dustymc
Copy link
Contributor Author

dustymc commented Dec 19, 2024

Crazy-maybe-dumb cake+eat idea:

identifiers -- > https://dwc.tdwg.org/terms/#dwc:otherCatalogNumbers
othercatalognumbers --> https://dwc.tdwg.org/terms/#dwciri:recordNumber

Example:

select jsonb_pretty(identifiers) identifiers,othercatalognumbers from flat where guid='DMNS:Mamm:11098';


                              identifiers                              |                                                                                                 othercatalognumbers                                                                                                  
-----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [                                                                    +| collector number JRD423 | DZTM: Denver Zoology Tissue Mammal 178 | DOI Foundation identifier https://doi.org/10.5061/dryad.52mp1 | NCBI Nucleotide (GenBank) identifier http://www.ncbi.nlm.nih.gov/nuccore/KJ139497
     {                                                                +| 
         "remarks": null,                                             +| 
         "issued_by": null,                                           +| 
         "identifier": "JRD423",                                      +| 
         "assigned_by": "unknown",                                    +| 
         "assigned_date": "2022-04-27",                               +| 
         "identifier_type": "collector number"                        +| 
     },                                                               +| 
     {                                                                +| 
         "remarks": null,                                             +| 
         "issued_by": null,                                           +| 
         "identifier": "178",                                         +| 
         "assigned_by": "unknown",                                    +| 
         "assigned_date": "2022-04-27",                               +| 
         "identifier_type": "DZTM: Denver Zoology Tissue Mammal"      +| 
     },                                                               +| 
     {                                                                +| 
         "remarks": null,                                             +| 
         "issued_by": "DOI Foundation",                               +| 
         "identifier": "https://doi.org/10.5061/dryad.52mp1",         +| 
         "assigned_by": "unknown",                                    +| 
         "assigned_date": "2022-04-27",                               +| 
         "identifier_type": "identifier"                              +| 
     },                                                               +| 
     {                                                                +| 
         "remarks": null,                                             +| 
         "issued_by": "NCBI Nucleotide (GenBank)",                    +| 
         "identifier": "http://www.ncbi.nlm.nih.gov/nuccore/KJ139497",+| 
         "assigned_by": "unknown",                                    +| 
         "assigned_date": "2022-04-27",                               +| 
         "identifier_type": "identifier"                              +| 
     }                                                                +| 
 ]                                                                     | 
(1 row)


@campmlc
Copy link

campmlc commented Dec 19, 2024

So my understanding (and I believe everyone else who has been in this discussion over the years) is that the only "othercatalognumber" identifier type the community considers to be an actual catalog number is "institutional catalog number", not collector number or anything else (except for Arctos Guids, which are catalog numbers which would only apply here if the relationship is "same individual as" to this occurrence) Apologies if I am not familiar enough with the mapping issue and am misunderstanding.
The msb organism ID mapping looks OK.
What do non-Arctos collections map to AssociatedSequences? It appears from the symbiota file I received from another collection submitting samples to DGR that they map the GenBank accession number there for each sample=part. So we would add all GenBank accessions here?

@happiah-madson
Copy link

What do non-Arctos collections map to AssociatedSequences? It appears from the symbiota file I received from another collection submitting samples to DGR that they map the GenBank accession number there for each sample=part. So we would add all GenBank accessions here?

That is certainly what we use to do: concatenated list of all genbank/bold accession numbers.

@happiah-madson
Copy link

identifiers -- > https://dwc.tdwg.org/terms/#dwc:otherCatalogNumbers
othercatalognumbers --> https://dwc.tdwg.org/terms/#dwciri:recordNumber

I feel like this should be the other way round w/ identifiers as record number and othercatalognumbers as otherCatalogNumbers. Am I missing something @dustymc

Also: we use the identifier type "othercatalognumbers" rather than an actual collection bit like DZTM: Denver Zoology Tissue Mammal. Is that okay? If we had every other collection that we have materials co-located in, this code table would become dare I say, even more unwieldy.

@dustymc
Copy link
Contributor Author

dustymc commented Dec 19, 2024

other way round

Sure, no problem (but it's the same data...).

temp_dwc_sample(2).csv.zip

So @mkoo @happiah-madson @dbloom can I just replace our current ipt_cache.occurrence with this (and the media tab) and figure it out from there? There's a timeline (#8301 (comment)) and IDK how else to proceed...

@happiah-madson
Copy link

Sure, no problem (but it's the same data...).

Your data made perfect sense to me.

@dustymc
Copy link
Contributor Author

dustymc commented Jan 9, 2025

Pulling some @happiah-madson comments here, they need more discussion than I can tolerate in a comment chain:

Is this mapped to assigned by agent because mapping it to locality_attribute_determiner for the attribute "georeference source" would be an ungodly decision?

assigned by agent is correct because they're the person who's linked the record and a place
assigned by agent is incorrect because that may have nothing to do with georef
assigned by agent is convenient because there's one in how we flatten

locality_attribute_determiner is correct because it is (sometimes)
locality_attribute_determiner is weird because this attribute is very optional (it's just part of the locality stack in my view, I'd never willingly use it...)
locality_attribute_determiner is inconvenient because there may be zero or 6745 of them in how we flatten

This whole conversation is a little weird in that DWC has a baked-in assumption that coordinates come from georeferences fed by primary locality strings, which of course isn't how reality or Arctos sees the world. Possibly ignoring all of the 'georef' fields would be the most appropriate response?

@happiah-madson
Copy link

they need more discussion than I can tolerate in a comment chain

sorry 😬😬😬😬😬😬😬

Possibly ignoring all of the 'georef' fields would be the most appropriate response?

That's kind of what I'm wondering! We give all the appropriate attributions in Arctos and the work is tracked, but...incorrectly linking it in DwC seems unnecessary.

@dustymc
Copy link
Contributor Author

dustymc commented Jan 10, 2025

I'm making the mappings live per conversation with @mkoo

@dustymc
Copy link
Contributor Author

dustymc commented Jan 11, 2025

Tables have been rebuilt (and performance is, comparatively anyway, fabulous), I believe next step is to publish.

arctosprod@arctos>> select count(*) from ipt_cache.occurrence;
  count  
---------
 5436959
(1 row)
arctosprod@arctos>> \d ipt_cache.occurrence;
                             Unlogged table "ipt_cache.occurrence"
             Column             |            Type             | Collation | Nullable | Default 
--------------------------------+-----------------------------+-----------+----------+---------
 collectionid                   | text                        |           |          | 
 language                       | text                        |           |          | 
 license                        | character varying           |           |          | 
 basisofrecord                  | character varying(20)       |           |          | 
 recordedby                     | text                        |           |          | 
 informationwithheld            | character varying           |           |          | 
 individualcount                | integer                     |           |          | 
 modified                       | timestamp without time zone |           |          | 
 associatedoccurrences          | character varying           |           |          | 
 occurrenceremarks              | character varying           |           |          | 
 typestatus                     | character varying           |           |          | 
 associatedtaxa                 | character varying           |           |          | 
 othercatalognumbers            | character varying           |           |          | 
 recordnumber                   | character varying           |           |          | 
 organismid                     | character varying           |           |          | 
 catalognumber                  | character varying(67)       |           |          | 
 associatedsequences            | character varying           |           |          | 
 fieldnumber                    | character varying           |           |          | 
 sex                            | character varying           |           |          | 
 dynamicproperties              | text                        |           |          | 
 identificationremarks          | character varying           |           |          | 
 family                         | character varying           |           |          | 
 higherclassification           | character varying           |           |          | 
 genus                          | character varying           |           |          | 
 identificationreferences       | character varying           |           |          | 
 identifiedby                   | character varying           |           |          | 
 taxonrank                      | character varying(255)      |           |          | 
 kingdom                        | character varying           |           |          | 
 dateidentified                 | character varying(22)       |           |          | 
 nomenclaturalcode              | character varying           |           |          | 
 preparations                   | character varying           |           |          | 
 class                          | character varying           |           |          | 
 order                          | character varying           |           |          | 
 phylum                         | character varying           |           |          | 
 scientificname                 | character varying(255)      |           |          | 
 previousidentifications        | character varying           |           |          | 
 associatedmedia                | character varying(121)      |           |          | 
 type                           | text                        |           |          | 
 samplingprotocol               | character varying           |           |          | 
 degreeofestablishment          | character varying(15)       |           |          | 
 locationaccordingto            | character varying(255)      |           |          | 
 georeferencedby                | character varying(255)      |           |          | 
 georeferenceddate              | timestamp without time zone |           |          | 
 habitat                        | character varying           |           |          | 
 georeferenceverificationstatus | character varying(40)       |           |          | 
 eventremarks                   | text                        |           |          | 
 geodeticdatum                  | character varying(55)       |           |          | 
 verbatimeventdate              | character varying(60)       |           |          | 
 verbatimlocality               | character varying           |           |          | 
 verbatimcoordinatesystem       | character varying(20)       |           |          | 
 eventdate                      | text                        |           |          | 
 eventtime                      | text                        |           |          | 
 day                            | text                        |           |          | 
 month                          | text                        |           |          | 
 year                           | text                        |           |          | 
 dayofyear                      | integer                     |           |          | 
 verbatimcoordinates            | character varying(255)      |           |          | 
 coordinateuncertaintyinmeters  | double precision            |           |          | 
 decimallatitude                | double precision            |           |          | 
 decimallongitude               | double precision            |           |          | 
 locality                       | character varying(255)      |           |          | 
 georeferenceprotocol           | character varying(255)      |           |          | 
 georeferencesources            | character varying           |           |          | 
 minimumdepthinmeters           | double precision            |           |          | 
 maximumdepthinmeters           | double precision            |           |          | 
 minimumelevationinmeters       | double precision            |           |          | 
 maximumelevationinmeters       | double precision            |           |          | 
 locationremarks                | character varying           |           |          | 
 continent                      | character varying(255)      |           |          | 
 country                        | character varying(50)       |           |          | 
 county                         | character varying(50)       |           |          | 
 island                         | text                        |           |          | 
 islandgroup                    | text                        |           |          | 
 waterbody                      | character varying(50)       |           |          | 
 stateprovince                  | character varying(75)       |           |          | 
 highergeography                | character varying(255)      |           |          | 
 institutioncode                | character varying(20)       |           |          | 
 collectioncode                 | character varying(5)        |           |          | 
 earliesteraorlowesterathem     | character varying           |           |          | 
 earliesteonorlowesteonothem    | character varying           |           |          | 
 earliestepochorlowestseries    | character varying           |           |          | 
 earliestageorloweststage       | character varying           |           |          | 
 earliestperiodorlowestsystem   | character varying           |           |          | 
 formation                      | character varying           |           |          | 
 group                          | character varying           |           |          | 
 member                         | character varying           |           |          | 
 occurrenceid                   | text                        |           |          | 
 institutionid                  | character varying(20)       |           |          | 
 accessrights                   | text                        |           |          | 
 footprintwkt                   | text                        |           |          | 
 scientificnameid               | character varying           |           |          | 

arctosprod@arctos>> select count(*) from  ipt_cache.audubonmedia;
  count  
---------
 1320184

arctosprod@arctos>> \d ipt_cache.audubonmedia
                     Table "ipt_cache.audubonmedia"
    Column    |          Type           | Collation | Nullable | Default 
--------------+-------------------------+-----------+----------+---------
 collectionid | text                    |           |          | 
 occurrenceid | text                    |           |          | 
 type         | character varying(255)  |           |          | 
 format       | character varying(255)  |           |          | 
 identifier   | text                    |           |          | 
 references   | text                    |           |          | 
 title        | character varying       |           |          | 
 description  | character varying       |           |          | 
 created      | character varying(4000) |           |          | 
 creator      | character varying       |           |          | 
 contributor  | text                    |           |          | 
 publisher    | text                    |           |          | 
 audience     | text                    |           |          | 
 source       | text                    |           |          | 
 license      | character varying(255)  |           |          | 
 rightsholder | text                    |           |          | 
 datasetid    | text                    |           |          | 

@dustymc dustymc modified the milestones: DWC, Active Development Jan 11, 2025
@happiah-madson
Copy link

fabulous

image

@dbloom
Copy link

dbloom commented Jan 24, 2025

@dusty Testing the theory, so to speak....

Currently building a resource on the VN IPT for UCM FossilVert (data-migration#1968 above).

Noting that re: DwC Occurrence Core:

  1. identificationVerificationStatus is no longer represented in the list of field in the tables.
  2. dayofyear is now present in the table, but is not mapped automatically given there is no matching dwc term. QUESTION: I can map dayofyear to dwc:startdayofyear and dwc:enddayofyear to demonstrate that the even occurred only on that day OR I can leave it unmapped. Please advise.

I am mapping the Audiovisual Media Ext (formerly Audubon Media), although I cannot test it's effectiveness since there are no media associated with this collection, but I can report the following in the IPT.

  1. The following fields do not map automatically: collectionID, references, created, contributor, publisher, audience, license, rightsholder datasetID.

Before I spend a lot of time detailed how I might map each of these, do you want all of these fields mapped. If so, I will respond with recommended mappings. Please advise.

Additional Question: With the new tables and fields to any of the following fields contain URIs?
type
source
creator
format

I ask because those fields map automatically to both dc:format and dcterms:format. The latter of each of these is expecting a URI as the standard input. It has been the VN policy dating back to Laura Russell to map only to dc:format and to un-map to dcterms:format because these fields have not contained URIs. If this has changed, please let me know (and I'll update all 200000 Arctos resources on the IPT).

@dustymc
Copy link
Contributor Author

dustymc commented Jan 24, 2025

identificationVerificationStatus

Yes, I think that's correct, we have no such concept. (Err, we do, but it involves analyzing agents and dates and methods and such, not a text string.)

dayofyear

Unless there's some compelling use case for retaining it, I propose to remove this mapping. (Arctos data are complicated - there could be like 40, or none but still dates, etc., etc. - attempting to pull this out seems potentially misleading.)

I will respond with recommended mappings.

PLEASE!!

History is that we had an antique and inappropriate mapping, I flailed around trying to figure out what to do that properly includes license, then did some random things. See also https://github.com/ArctosDB/internal/issues/365. We do want to publish media, do not know how, any help greatly appreciated.

any of the following fields contain URIs?

type - no, tentatively mapped to terms in https://arctos.database.museum/info/ctDocumentation.cfm?table=ctmedia_type
source - maybe, I'm not clear what's intended, POSSIBLY this could be mapped to https://arctos.database.museum/info/ctDocumentation.cfm?table=ctmedia_relationship#derived_from_media, it's not mapped for now
creator - maybe-eventually this could be agentID eg https://arctos.database.museum/agent/10002371, it's string for now
format - strings from https://arctos.database.museum/info/ctDocumentation.cfm?table=ctmime_type

If this has changed

In all of the above, I think we are all very receptive to any sort of suggestion, nudge, hand-holding, whatever. I don't look at DWC more than I have to, things change WAY faster than I can keep up (and I didn't necessarily know what I was doing when I thought I did...), none of this is very "core" to me, I think we'd all like to share as much as our resources allow, any and all help in getting there is greatly appreciated.

THANKS!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Aggregator issues e.g., GBIF, iDigBio, etc Enhancement I think this would make Arctos even awesomer!
Projects
None yet
Development

No branches or pull requests

9 participants