Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rebuild ggbn stuff #3699

Closed
dustymc opened this issue Jul 1, 2021 · 43 comments
Closed

rebuild ggbn stuff #3699

dustymc opened this issue Jul 1, 2021 · 43 comments
Assignees
Labels
Priority-Low (Wish list) I don't want to forget this, but it doesn't need to be done immediately

Comments

@dustymc
Copy link
Contributor

dustymc commented Jul 1, 2021

From https://github.com/ArctosDB/arctos/issues/1460

these use column is_tissue of table ctspecimen_part_name and need rebuilt to use attribute-centric view of tissues

drop view ipt_cache.ggbn_pre_permit_view;
drop view ipt_cache.ggbn_pre_permit_type_cache_view;
drop view  ipt_cache.ggbn_permit_view;
drop view ipt_cache.v_ggbn_flat_amplification;
drop view ipt_cache.ggbn_tissue_view;

and turn the scheduler back on

@dustymc dustymc added the Priority-Low (Wish list) I don't want to forget this, but it doesn't need to be done immediately label Jul 1, 2021
@dustymc dustymc added this to the Next Task milestone Jul 1, 2021
@dustymc
Copy link
Contributor Author

dustymc commented Aug 24, 2021

These are MIA from @dbloom 's schedule, I think there was talk of someone (@Jegelewicz ??) talking to GGBN to see if they can use the normal DWC core file instead of the horribly denormalized mess we've been sending. I'm not sure what to do nor how to prioritize it - help please.

@Jegelewicz
Copy link
Member

I think the main reason GGBN cannot use our "normal" DwC-a is that it includes things other than tissues and does NOT include permit information. As the TDWG MaterialSample Task Group works through some things, I hope we can eventually get a single DwC format that everyone can use, but that is probably more than a year away (probably longer unless someone tackles the permit stuff).

@Jegelewicz
Copy link
Member

See also #1966 (comment)

and #1966 (comment)

@dustymc
Copy link
Contributor Author

dustymc commented Aug 24, 2021

I'm not seeing any barrier in there.

  • "Normal DWC" has OccurrenceID as PKEY
  • Material sample has OccurrenceID as FKEY
  • Permit has OccurrenceID as FKEY

For GGBN, we're building an Occurrence for every relevant part-at-placetime.

At some point I'll likely complain about GGBN having their own flavor of materialsample and making me guess what they think a 'tissue' is rather than filtering on whatever users want to filter on, but that's not much of a barrier for now, and the specialized file makes ignoring not-tissue unavoidable (vs. "super easy" under a more generalized approach).

This has little/nothing to do with OrganismID or our imperfect Occurrence mapping.

https://www.gbif.org/occurrence/1229671489 is licensed https://creativecommons.org/licenses/by-nc/3.0/

You are free to...Adapt — remix, transform, and build upon the material

They've been given explicit permission to do (almost) whatever they want with those data, I'm not sure why we'd then complain about them doing what we've said they can do??

@Jegelewicz
Copy link
Member

I hear you - and my first response to this was a tirade about GGBN and "their" idea of what DarwinCore is/should be that I erased in favor of a more politically correct tirade.

I have already complained in excess about this to people at TDWG and GBIF, so hopefully this will get taken seriously and something will be done sooner rather than later.

All I have been saying for the last six months is - "I should not have to provide more than one DwC-a of my data. If it is really an exchange standard, once my data is in it, anyone should be able to take it and know what to do with whatever I have provided."

But there are two problems.

  1. DwC terms are poorly defined and often overlap, so providers don't know what to put where and users cannot be certain they are getting what they expect.
  2. DwC is occurrence-centric, but museums are object-centric these two points of view don't always work well together.

@dustymc
Copy link
Contributor Author

dustymc commented Aug 24, 2021

should not have to provide more than one DwC

Amen.

DwC terms are poorly defined

https://dwc.tdwg.org/rdf/ exists, but I don't know how to write to it.

occurrence-centric, but museums are ...

... whatever the heck someone felt like cataloging-centric

It took a while, but I've come around to the idea that that's a feature, not a bug - standardization is good, even if it's not "native" to CMS's and we're imperfect at mapping to it.

@Jegelewicz
Copy link
Member

Perhaps we will get relief from https://github.com/ArctosDB/internal/issues/198

@dustymc
Copy link
Contributor Author

dustymc commented Sep 20, 2023

From email "Updating GGBN resources"

  • GGBN is not likely to change to follow the Standard
  • There is some indication that this might be rebuilt using only part name 'tissue'
  • Perhaps rebuilding should receive some priority, but still not clear

@dustymc
Copy link
Contributor Author

dustymc commented Nov 14, 2023

tables rebuilt, need to turn scheduler back on after confirming that everything's mapped correctly

@dustymc dustymc modified the milestones: Next Task, Active Development Nov 14, 2023
@dbloom
Copy link

dbloom commented Nov 14, 2023

@dustymc I assume the Arctos IDs will need to be updated for all of the GGBN resources on the IPT. That is not something I completed in the last round of updates, but I will do it if (a) you confirm that I should, and (b) after you confirm that your mappings are correct and ready for prime time.

@dustymc
Copy link
Contributor Author

dustymc commented Nov 14, 2023

Arctos IDs will need to be updated for all of the GGBN resources

Not a clue...

confirm that your mappings are correct and ready for prime time

Ditto, hoping @Jegelewicz will help with that....

@Jegelewicz
Copy link
Member

confirming that everything's mapped correctly

Point me to the mapping and I'll take a look.

@dustymc
Copy link
Contributor Author

dustymc commented Nov 14, 2023

@Jegelewicz
Copy link
Member

For review - I am assuming we are attempting to build the GGBN MaterialSample extension - https://rs.gbif.org/extension/ggbn/materialsample.xml

@Jegelewicz
Copy link
Member

OOF - Are we sending ALL of the GGBN extensions or just some? I found this document that I think was the original mapping.

@dustymc
Copy link
Contributor Author

dustymc commented Nov 14, 2023

GGBN MaterialSample

I think so, but it can't be done as an extension of the 'normal' DWC (Because Reasons) - I think we'd mapped every part as an Occurrence.

I was wondering if it would be a simplification to pretend that every tissue-having Occurrence has precisely one tissue for GGBN, which lead me to wondering if pretending that every record has at most one Occurrence wouldn't be a better way to deal with eg https://github.com/ArctosDB/internal/issues/253.

ALL of the GGBN extensions

Beats me, whatever the minimum required to get to Arctos is what I'd want if they were my data....

OOF

yea....

@dbloom
Copy link

dbloom commented Nov 14, 2023

All I'm asking is if, for example, DNS Mamm = 45 now needs to become https://arctos.database.museum/collection/DMNS:Mamm in the SQL statement in the IPT. I am assuming this is the case, but want to confirm.


Separately, the GGBN resources all used the DwC Occurrence Core. They also used the following extensions:

Preparation
Amplification
Permit
Materials Sample
Resource Relationship

All of these were populated with data from Arctos Tables:

Arctos [sql]
arctosggbnpermit [sql]
ggbn_applification [sql]

@dustymc
Copy link
Contributor Author

dustymc commented Nov 14, 2023

, DNS Mamm = 45 now needs to become https://arctos.database.museum/collection/DMNS:Mamm

AHA - yes.

all used the DwC Occurrence Core.

I think no, if I'm not lost (and there's no certainty in that assumption!!) those pivot off of ggbn_specimen_view, not ipt_cache.occurrence (something about star schemas).

@Jegelewicz
Copy link
Member

if it would be a simplification to pretend that every tissue-having Occurrence has precisely one tissue for GGBN

I don't think that would work. In addition to tissues, we want to present instances of DNA (with appropriate terms). I do think it might be easier to think of the things listed by Dave above as separate files to build instead of trying to mash everything together, which seems like what we are doing now? I need a day or two to work through the extensions and think about it. Unfortunately, I have other tasks in process that also need my attention. I'd like to put together my thoughts then share them with everyone - especially interested parties that are publishing to GGBN.

@Jegelewicz Jegelewicz self-assigned this Nov 14, 2023
@dbloom
Copy link

dbloom commented Nov 14, 2023

Sorry @dustymc, my error. Existing SQL statements:

select * from ipt_cache.ggbn_tissue where collection_id=105 which is sourcename arctos in IPT

SELECT * from ipt_cache.ggbn_permit where collection_id = 105 which is sourcename arctosggbnpermit in IPT

select * from ipt_cache.ggbn_flat_amplification where collection_id=105 which is sourcename ggbn_amplification in IPT

@Jegelewicz
Copy link
Member

Separately, the GGBN resources all used the DwC Occurrence Core. They also used the following extensions:

Preparation
Amplification
Permit
Materials Sample
Resource Relationship

@dustymc Do they actually use the OccurrenceCore or are we creating a separate OccurenceCore for GGBN? When I look at the mapping above, it appears we are sending all kinds of stuff not requested in any of the GGBN extensions....

@Jegelewicz
Copy link
Member

Here is what I am starting with - the GGBN extensions listed above by @dbloom

In our current mapping we have things like sex, identification, etc. that are included in the OccurrenceCore. I get that any given occurrence might have multiple tissue parts, but why can't we just associate each individual part with one occurrence that we are passing to GBIF? I feel like we have made this harder than it should be? Or maybe there is something I don't know?

@dustymc
Copy link
Contributor Author

dustymc commented Nov 14, 2023

select * from ipt_cache.ggbn_tissue where collectionid='https://arctos.database.museum/collection/DMNS:Mamm' 


SELECT * from ipt_cache.ggbn_permit where collectionid='https://arctos.database.museum/collection/DMNS:Mamm' 

select * from ipt_cache.ggbn_flat_amplification where collectionid='https://arctos.database.museum/collection/DMNS:Mamm' 

creating a separate OccurenceCore for GGBN?

YES!!!! That's why this issue exists!!

why can't we just associate each individual part with one occurrence that we are passing to GBIF?

That's been my ongoing question for a few years now.... (Something about DWC technically being a star schema and that involving RDBMS-ish relationships, I think.)

@dbloom
Copy link

dbloom commented Nov 14, 2023

@Jegelewicz let me know if you want to look at the mappings in the IPT. Might help to see what things are going where.

@Jegelewicz
Copy link
Member

@dbloom once I get my head around things, I'll set up a meeting - probably bring @dustymc along for the ride....

@Jegelewicz
Copy link
Member

It seems like we need a file with a row for every "tissue" or DNA part that includes everything in the Occurrence file we build for GBIF for the record plus whatever we are sending from the GGBN the extensions. No need to re-build the GBIF stuff, just use it and tack on whatever is needed? Possible?

Then we need to decide what to send from the extensions and how that is mapped from Arctos.

That doesn't seem awfully hard, unless I am really missing something.

@Jegelewicz
Copy link
Member

@Jegelewicz
Copy link
Member

As it is, we are apparently calling everything "tissue" or we are not passing any DNA information.

@dustymc
Copy link
Contributor Author

dustymc commented Nov 14, 2023

No need to re-build the GBIF stuff, just use it and tack on whatever is needed? Possible?

Not sure what that means. That's sorta what's there now, but with unique OccurrenceIDs involving a pile of denormalizers - which is about as far from a tacking-on as one can get....

apparently calling everything "tissue" or we are not passing any DNA information.

What gives you that idea???

@Jegelewicz
Copy link
Member

What gives you that idea???

I can find no DNA when searching the MSB collections at GGBN.

Also - line 114 in the mapping

'tissue' materialSampleType,

seems to indicate we are setting all rows to "tissue" for materialSampleType?

@dustymc
Copy link
Contributor Author

dustymc commented Nov 14, 2023

searching the MSB collections at GGBN.

That'll be kinda antique and involving an entirely (and entirely arbitrary, record-by-record) definition of 'tissues' - not sure anything there could surprise me....

setting all rows to "tissue" for materialSampleType?

AHA - correct, but not for preparationType. Let me know if I need to remap something.

@Jegelewicz
Copy link
Member

Also - I cannot see that we are sending any information from the loan extension, but there is data at GGBN

image

@Jegelewicz
Copy link
Member

Let me know if I need to remap something.

See my working document. Once I am done, we need to get a few others to weigh in.

@dustymc
Copy link
Contributor Author

dustymc commented Nov 14, 2023

loan extension

Sorta-sure that's all just hardcoded in on their end but IDK

working document

Missing from my perspective are the keys - those are the Great Mystery from here.

@Jegelewicz
Copy link
Member

the Great Mystery from here.

Agree!!! I don't understand how these are all supposed to be connected! I suspect that it is resourceRelationshipID, but that is really hard to say?

@dustymc
Copy link
Contributor Author

dustymc commented Dec 22, 2023

I've rebuilt these tables for our current model:

ipt_cache.ggbn_specimen_view
ipt_cache.ggbn_tissue
ipt_cache.ggbn_permit
ipt_cache.ggbn_permit_type_cid
ipt_cache.ggbn_flat_amplification

Hopefully that's all correct, it would still be INCREDIBLY useful to find some way to not replicate a bunch of stuff. I didn't consider anything in the mapdoc, I just found a performant way to calculate tissueness at the part level and rebuilt the old code (plus some recentish changes) around it.

I have not turned any automation on, that will need done once the dust - of which there is hopefully none - has settled.

@Jegelewicz
Copy link
Member

I didn't consider anything in the mapdoc

is disappointing because there are things in there we should be sending and probably aren't because they weren't around when the original map was made (but I could be wrong about that). In any case, my part of this work was creating that - it would be nice to have GGBN publishers review it and have it considered as part of the task.

for example

ratioOfAbsorbance260_280

Also - it seems like we should be using the resource relationship extension to link up "tissue occurrences" from the same catalog record with the sameAs relationship.

Just for grins (more likely angry scowls), here is a sample from the files OGL has been sending to GGBN from their FileMaker database. Will we be able to replicate that? Doubtful, but we should make our best effort when the time comes.

IPTExtracts20220815_editmacro_forTeresa.xlsx
IPTSamples20220815_editmacro_forTeresa.xlsx

@dbloom
Copy link

dbloom commented Dec 22, 2023 via email

@dbloom
Copy link

dbloom commented Jan 5, 2024

@dustymc Attempting to update GGBN resources in the VN IPT.

Right out of the gate and error, e.g.,:

Previous SQL: select * from ipt_cache.ggbn_tissue where where collection_id = 3

With new ArctosID (based on other non-GGBN resources):
Select * from ipt_cache.ggbn_tissue collectionID = 'https://arctos.database.museum/collection/UAM:Mamm'

Error says "syntax error at or near "=" Position 50

Tried removing spaces, same issue. Don't know enough SL to know why = won't work here, while it works for all of the non-GGBN rources.

@dustymc
Copy link
Contributor Author

dustymc commented Jan 5, 2024

Two wheres is too many and zeros not enough!

Select * from ipt_cache.ggbn_tissue WHERE collectionID = 'https://arctos.database.museum/collection/UAM:Mamm'

should do it.

update

I think so but I'll try to do whatever Teresa wants. Their requirements are difficult to understand and don't play nice with anything else, I say we get the basics functional again and then maybe think about improvements. (And maybe someone can somehow help them find some way of using the normal DWC, this thing is no fun.)

@dbloom
Copy link

dbloom commented Jan 5, 2024

WHERE does it. Thank you. Updating IP addresses, too. Will notify when done. It's something like 23 or so resources, each with at least three table data sources and as many as 6 unique mappings.

@Jegelewicz
Copy link
Member

Sounds like the beer tab is coming due....

@dustymc
Copy link
Contributor Author

dustymc commented Feb 16, 2024

AFAIK this is done and happy.

@dustymc dustymc closed this as completed Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority-Low (Wish list) I don't want to forget this, but it doesn't need to be done immediately
Projects
None yet
Development

No branches or pull requests

3 participants