-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rebuild ggbn stuff #3699
Comments
These are MIA from @dbloom 's schedule, I think there was talk of someone (@Jegelewicz ??) talking to GGBN to see if they can use the normal DWC core file instead of the horribly denormalized mess we've been sending. I'm not sure what to do nor how to prioritize it - help please. |
I think the main reason GGBN cannot use our "normal" DwC-a is that it includes things other than tissues and does NOT include permit information. As the TDWG MaterialSample Task Group works through some things, I hope we can eventually get a single DwC format that everyone can use, but that is probably more than a year away (probably longer unless someone tackles the permit stuff). |
See also #1966 (comment) and #1966 (comment) |
I'm not seeing any barrier in there.
For GGBN, we're building an Occurrence for every relevant part-at-placetime. At some point I'll likely complain about GGBN having their own flavor of materialsample and making me guess what they think a 'tissue' is rather than filtering on whatever users want to filter on, but that's not much of a barrier for now, and the specialized file makes ignoring not-tissue unavoidable (vs. "super easy" under a more generalized approach). This has little/nothing to do with OrganismID or our imperfect Occurrence mapping. https://www.gbif.org/occurrence/1229671489 is licensed https://creativecommons.org/licenses/by-nc/3.0/
They've been given explicit permission to do (almost) whatever they want with those data, I'm not sure why we'd then complain about them doing what we've said they can do?? |
I hear you - and my first response to this was a tirade about GGBN and "their" idea of what DarwinCore is/should be that I erased in favor of a more politically correct tirade. I have already complained in excess about this to people at TDWG and GBIF, so hopefully this will get taken seriously and something will be done sooner rather than later. All I have been saying for the last six months is - "I should not have to provide more than one DwC-a of my data. If it is really an exchange standard, once my data is in it, anyone should be able to take it and know what to do with whatever I have provided." But there are two problems.
|
Amen.
https://dwc.tdwg.org/rdf/ exists, but I don't know how to write to it.
... whatever the heck someone felt like cataloging-centric It took a while, but I've come around to the idea that that's a feature, not a bug - standardization is good, even if it's not "native" to CMS's and we're imperfect at mapping to it. |
Perhaps we will get relief from https://github.com/ArctosDB/internal/issues/198 |
From email "Updating GGBN resources"
|
tables rebuilt, need to turn scheduler back on after confirming that everything's mapped correctly |
@dustymc I assume the Arctos IDs will need to be updated for all of the GGBN resources on the IPT. That is not something I completed in the last round of updates, but I will do it if (a) you confirm that I should, and (b) after you confirm that your mappings are correct and ready for prime time. |
Not a clue...
Ditto, hoping @Jegelewicz will help with that.... |
Point me to the mapping and I'll take a look. |
For review - I am assuming we are attempting to build the GGBN MaterialSample extension - https://rs.gbif.org/extension/ggbn/materialsample.xml |
OOF - Are we sending ALL of the GGBN extensions or just some? I found this document that I think was the original mapping. |
I think so, but it can't be done as an extension of the 'normal' DWC (Because Reasons) - I think we'd mapped every part as an Occurrence. I was wondering if it would be a simplification to pretend that every tissue-having Occurrence has precisely one tissue for GGBN, which lead me to wondering if pretending that every record has at most one Occurrence wouldn't be a better way to deal with eg https://github.com/ArctosDB/internal/issues/253.
Beats me, whatever the minimum required to get to Arctos is what I'd want if they were my data....
yea.... |
All I'm asking is if, for example, DNS Mamm = 45 now needs to become https://arctos.database.museum/collection/DMNS:Mamm in the SQL statement in the IPT. I am assuming this is the case, but want to confirm. Separately, the GGBN resources all used the DwC Occurrence Core. They also used the following extensions: Preparation All of these were populated with data from Arctos Tables: Arctos [sql] |
AHA - yes.
I think no, if I'm not lost (and there's no certainty in that assumption!!) those pivot off of ggbn_specimen_view, not ipt_cache.occurrence (something about star schemas). |
I don't think that would work. In addition to tissues, we want to present instances of DNA (with appropriate terms). I do think it might be easier to think of the things listed by Dave above as separate files to build instead of trying to mash everything together, which seems like what we are doing now? I need a day or two to work through the extensions and think about it. Unfortunately, I have other tasks in process that also need my attention. I'd like to put together my thoughts then share them with everyone - especially interested parties that are publishing to GGBN. |
Sorry @dustymc, my error. Existing SQL statements:
|
@dustymc Do they actually use the OccurrenceCore or are we creating a separate OccurenceCore for GGBN? When I look at the mapping above, it appears we are sending all kinds of stuff not requested in any of the GGBN extensions.... |
Here is what I am starting with - the GGBN extensions listed above by @dbloom In our current mapping we have things like sex, identification, etc. that are included in the OccurrenceCore. I get that any given occurrence might have multiple tissue parts, but why can't we just associate each individual part with one occurrence that we are passing to GBIF? I feel like we have made this harder than it should be? Or maybe there is something I don't know? |
YES!!!! That's why this issue exists!!
That's been my ongoing question for a few years now.... (Something about DWC technically being a star schema and that involving RDBMS-ish relationships, I think.) |
@Jegelewicz let me know if you want to look at the mappings in the IPT. Might help to see what things are going where. |
It seems like we need a file with a row for every "tissue" or DNA part that includes everything in the Occurrence file we build for GBIF for the record plus whatever we are sending from the GGBN the extensions. No need to re-build the GBIF stuff, just use it and tack on whatever is needed? Possible? Then we need to decide what to send from the extensions and how that is mapped from Arctos. That doesn't seem awfully hard, unless I am really missing something. |
As it is, we are apparently calling everything "tissue" or we are not passing any DNA information. |
Not sure what that means. That's sorta what's there now, but with unique OccurrenceIDs involving a pile of denormalizers - which is about as far from a tacking-on as one can get....
What gives you that idea??? |
I can find no DNA when searching the MSB collections at GGBN. Also - line 114 in the mapping
seems to indicate we are setting all rows to "tissue" for materialSampleType? |
That'll be kinda antique and involving an entirely (and entirely arbitrary, record-by-record) definition of 'tissues' - not sure anything there could surprise me....
AHA - correct, but not for preparationType. Let me know if I need to remap something. |
See my working document. Once I am done, we need to get a few others to weigh in. |
Sorta-sure that's all just hardcoded in on their end but IDK
Missing from my perspective are the keys - those are the Great Mystery from here. |
Agree!!! I don't understand how these are all supposed to be connected! I suspect that it is resourceRelationshipID, but that is really hard to say? |
I've rebuilt these tables for our current model: ipt_cache.ggbn_specimen_view Hopefully that's all correct, it would still be INCREDIBLY useful to find some way to not replicate a bunch of stuff. I didn't consider anything in the mapdoc, I just found a performant way to calculate tissueness at the part level and rebuilt the old code (plus some recentish changes) around it. I have not turned any automation on, that will need done once the dust - of which there is hopefully none - has settled. |
is disappointing because there are things in there we should be sending and probably aren't because they weren't around when the original map was made (but I could be wrong about that). In any case, my part of this work was creating that - it would be nice to have GGBN publishers review it and have it considered as part of the task. for example ratioOfAbsorbance260_280 Also - it seems like we should be using the resource relationship extension to link up "tissue occurrences" from the same catalog record with the sameAs relationship. Just for grins (more likely angry scowls), here is a sample from the files OGL has been sending to GGBN from their FileMaker database. Will we be able to replicate that? Doubtful, but we should make our best effort when the time comes. IPTExtracts20220815_editmacro_forTeresa.xlsx |
Would you like me to update the GGBN resources on the IPT with the new
Arctos IDs, now that the tables have been rebuilt, or should I wait until
you decide what you want to do with the additional mapdoc content?
…On Fri, Dec 22, 2023 at 7:08 AM Teresa Mayfield-Meyer < ***@***.***> wrote:
I didn't consider anything in the mapdoc
<https://docs.google.com/spreadsheets/d/1nY8ppu0Sz_YJGN4QbNXVPO9mrMRjVNP7WjUP-EjqoqM/edit#gid=0>
is disappointing because there are things in there we should be sending
and probably aren't because they weren't around when the original map was
made (but I could be wrong about that). In any case, my part of this work
was creating that - it would be nice to have GGBN publishers review it and
have it considered as part of the task.
for example
ratioOfAbsorbance260_280
Also - it seems like we should be using the resource relationship
extension to link up "tissue occurrences" from the same catalog record with
the sameAs relationship.
Just for grins (more likely angry scowls), here is a sample from the files
OGL has been sending to GGBN from their FileMaker database. Will we be able
to replicate that? Doubtful, but we should make our best effort when the
time comes.
IPTExtracts20220815_editmacro_forTeresa.xlsx
<https://github.com/ArctosDB/arctos/files/13753603/IPTExtracts20220815_editmacro_forTeresa.xlsx>
IPTSamples20220815_editmacro_forTeresa.xlsx
<https://github.com/ArctosDB/arctos/files/13753604/IPTSamples20220815_editmacro_forTeresa.xlsx>
—
Reply to this email directly, view it on GitHub
<#3699 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHGC3Y567X3BC3B2HULNZ3YKWO65AVCNFSM47VQYR2KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBWG43TSNBSGM2A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dustymc Attempting to update GGBN resources in the VN IPT. Right out of the gate and error, e.g.,: Previous SQL: select * from ipt_cache.ggbn_tissue where where collection_id = 3 With new ArctosID (based on other non-GGBN resources): Error says "syntax error at or near "=" Position 50 Tried removing spaces, same issue. Don't know enough SL to know why = won't work here, while it works for all of the non-GGBN rources. |
Two wheres is too many and zeros not enough!
should do it.
I think so but I'll try to do whatever Teresa wants. Their requirements are difficult to understand and don't play nice with anything else, I say we get the basics functional again and then maybe think about improvements. (And maybe someone can somehow help them find some way of using the normal DWC, this thing is no fun.) |
WHERE does it. Thank you. Updating IP addresses, too. Will notify when done. It's something like 23 or so resources, each with at least three table data sources and as many as 6 unique mappings. |
Sounds like the beer tab is coming due.... |
AFAIK this is done and happy. |
From https://github.com/ArctosDB/arctos/issues/1460
these use column is_tissue of table ctspecimen_part_name and need rebuilt to use attribute-centric view of tissues
and turn the scheduler back on
The text was updated successfully, but these errors were encountered: