Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

part vocabulary (or model?) #1020

Closed
dustymc opened this issue Jan 4, 2017 · 24 comments
Closed

part vocabulary (or model?) #1020

dustymc opened this issue Jan 4, 2017 · 24 comments
Labels
Function-CodeTables Priority-High (Needed for work) High because this is causing a delay in important collection work..

Comments

@dustymc
Copy link
Contributor

dustymc commented Jan 4, 2017

We recently added "shell (fossil)" as an Inv part, and now usage is expanding. I believe we now have multiple ways of saying "fossilized shell" - by being explicit, or, some of the time in some collections, by where the part is cataloged.

We also have no indication of what we mean by "fossil" (and there are many accepted-by-someone-for-some-purpose definitions).

I think the situation is actively preventing discovery, not facilitating it.

I don't think adding a parenthetical "fossil" to some parts is an acceptable solution; I don't think we could possibly agree on a workable definition of "fossil," and I don't see how we could add that determination to existing specimens if we could.

This is a time-sensitive issue; if we wish to recover to less ambiguous ground, we need to move soon while the data can (hopefully) still be separated.

(I don't really have any great ideas. Requiring definitions for parts might help me grasp the situation. Perhaps some new part attribute could be used to assert fossilness; at least those are easy for users to avoid!)

@dustymc dustymc added Function-CodeTables Priority-Critical (Arctos is broken) Critical because it is breaking functionality. labels Jan 4, 2017
@dustymc dustymc added this to the Needs Discussion milestone Jan 4, 2017
@campmlc
Copy link

campmlc commented Jan 4, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Jan 4, 2017

different types of fossils

If we use the Wikipedia definition ("preserved remains or traces of animals, plants, and other organisms from the remote past") then eg http://arctos.database.museum/guid/UAM:Mamm:53942 would use something like "muscle (frozen) (fossil)"....

A few other random issues that should be considered if we're doing anything serious:

Some "preservation method" information is (or would be if we all had the resources to fully embrace the container model) duplicated from container environment. That is, ideally I'd be able to get "frozen" because the part is in a tube which is in a ..... freezer which has a temperature history (from which I could ideally tell HOW frozen - eg, maybe it's in LN2 now but went through a freezer failure after 20 years at 0F).

At some point we'd defined "bare" parts to be "the normal thing" but I don't think that really works. Eg, we have....

UAM@ARCTOS> select part_name || ' @ ' || count(*) from specimen_part where part_name like '%muscle%' group by part_name order by part_name;

PART_NAME||'@'||COUNT(*)
------------------------------------------------------------------------------------------------------------------------
muscle @ 1260
muscle (95% ethanol) @ 2658
muscle (DMSO) @ 47
muscle (RNAlater) @ 7072
muscle (dry) @ 1354
muscle (ethanol) @ 80
muscle (ethanol-fixed) @ 76
muscle (frozen) @ 39549

The existence of "muscle (frozen)" (what I'd assume to be "normal") makes me wonder what plain ol' "muscle" is.

That query actually returns....

UAM@ARCTOS> select part_name || ' @ ' || count(*) from specimen_part where part_name like '%muscle%' group by part_name order by part_name;

PART_NAME||'@'||COUNT(*)
------------------------------------------------------------------------------------------------------------------------
blood serum, muscle (frozen) @ 8
heart, kidney, liver, muscle (frozen) @ 23
heart, liver, muscle (alcohol) @ 1
heart, liver, muscle (frozen) @ 10748
heart, muscle (frozen) @ 571
kidney, muscle (frozen) @ 2
liver, muscle (frozen) @ 577
muscle @ 1260
muscle (95% ethanol) @ 2658
muscle (DMSO) @ 47
muscle (RNAlater) @ 7072
muscle (dry) @ 1354
muscle (ethanol) @ 80
muscle (ethanol-fixed) @ 76
muscle (frozen) @ 39549
muscle, eye (frozen) @ 36
muscle, spleen @ 1
muscle, spleen (frozen) @ 15

.... a bunch of "compound parts" - ideally we'd have "muscle" and "eye" which just happen to be in the same container rather than existing as a mixed part, but again that would rely on a more complete usage of containers.

And just to make sure it stays on the radar, if I search for "rib" I should also get things which contain ribs - "whole organism" and "skeleton" and ....

@dustymc
Copy link
Contributor Author

dustymc commented Jan 5, 2017

see also #991

@dustymc dustymc mentioned this issue Jan 5, 2017
@Jegelewicz
Copy link
Member

I also do not like the combination of part with preservation. I am one without funds or capability to make use of containers, so I probably don't see the whole picture. That being said, from my perspective it would be nice to have a field for each part to tell potential users how it is preserved and just leave the part name to describe the part.

With regard to "fossil": In my mollusk collection I have both recent and "fossil" shells. From the standpoint of someone doing research, they might want to know the relative age of the specimens and the collection date wouldn't do that for the "fossil" specimens. In that case, perhaps instead of "wild caught" we could use a different collecting source to demonstrate their "fossilness"? As I've cataloged my paleo collection it has always seemed strange to call the specimens "wild caught" when "extracted from matrix" would be more appropriate. Of course this is still ambiguous as a freshly dead shell picked up off the beach could be given either collecting source. The other option is to add Geology data to any fossil specimens, thus telling users they are "fossil".

Don't know if that was helpful or not, but it's what I have right now... :-)

@dustymc
Copy link
Contributor Author

dustymc commented Jan 6, 2017

Yes, very helpful, thanks. The more we all know about what everybody else is thinking, the more likely finding a good solution seems.

I keep leaning towards part attributes for both use cases.

Preservation method (storage environment, etc.) changes, and part attributes can handle that:

part=somepart
-- attribute=presmeth date=date1 value=whatever1
-- attribute=presmeth date=date2 value=whatever2
....
as many times as you need.

Basically I'm agreeing with you - confounding what a thing IS with what we've done to it can't be the "correct" approach from a data modeling standpoint. Part attributes are modeled as metadata of parts, and I think that lines up perfectly with those sorts of data.

(Parenthetical BUT: We used to have a lot more structure, and that caused a couple orders of magnitude more distinct values - much less discoverability - than we have now primarily because the distinction between things like preservation method and condition is very hard to define, so they get used interchangeably. And it still couldn't deal with eg, changing preservation methods. If we do add structure, it should be very targeted and unambiguous - nobody should have to guess which field might be most appropriate for some data.)

I don't like using metadata of the conceptual stuff (cataloged items - the things that get a specimen event type/collecting method, defined as "whatever some Curator felt like slapping a catalog number on") to attempt inferences regarding physical bits (parts). I can imagine lots of ways that trying to assert "fossilness" in collecting method or via geology would get complicated - I'm not sure how often float gets a geology determination, frozen critters that don't seem very fossil-ey to me sometimes do get that, stuff gets "collected" on ebay, etc., etc. (And FWIW collecting method is now NULLable - the ethnologists had strong opinions about wild-caught motorcycles and such - and we're always up for new/better vocabulary ideas, there or anywhere else.)

"Extracted from matrix" or "it's a fossil because I say so" or etc. (on date by person, optionally) fits in part attributes, and I think that's a much more direct assertion which would lead to much more predictable/discoverable/understandable data.

All that said, part attributes have usability issues - they're 2 big steps away from specimens, so eg, adding them to the bulkloader (6 extra fields * number of parts [currently 12*7 columns] * number of needed part attributes) is not going to be much fun to deal with, and flattening them out for things like specimenresults (eg, 10 parts each with 10 part attributes [each with 6 "columns"]) could get messy and unreadable very quickly.

And that's not so different than the situation which lead us to denormalize container data into parts - even if you do have everything containerized and barcoded, it's a long and expensive trip from specimens to parts then up a container tree until you find something frozen and .... - "part like %frozen%" is just MUCH easier to interact with. I'd rather have weird and redundant data than a perfect model that nobody can use!

@campmlc
Copy link

campmlc commented Jan 6, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Mar 28, 2017

I don't think "fossil" is that much different from "ethanol" or "frozen".

Maybe....

What concentration of ethanol? Fixed or stored? How many changes of solution? By whom? When? Frozen how? At what temp? For how long? How quickly after demise? Freeze/thaw history?

That all fits in container environment, and so the 'frozen' (pickled, whatever) in part name can be seen as an indication that there is/might be more data elsewhere - it's a convenience. You don't have to use the "more data" bits, but it is available and so it's not necessary to embed that information into part strings. To treat fossils the same way, I'd like to see something roughly equivalent to container environment - some way of fully expressing WHY someone thought it was a fossil. Maybe part attributes is sufficient.

(The details of frozen-ness could be handled in part attributes as well, but not elegantly. Containers let you update 50K parts by recording the temperature of a freezer, for example.)

every possible part ... additional value

Yup.

UAM@ARCTOS> select count(distinct(part_name)) from ctspecimen_part_name;

COUNT(DISTINCT(PART_NAME))
--------------------------
		       939

One untested model revision is a dictionary: have a list of terms

  • left
  • right
  • shell
  • frozen

and let people put them together however they want as part names. That MIGHT let us be more precise (how many of us add to the code tables when we find a formalin-fixed left eyeball?) with fewer terms ("heart" appears once, rather than in 46 - really! - combinations), wouldn't allow scull (creative spelling) as a part, but it would set us up to find "frozen left right" parts cataloged. It lacks predictability - there's no finite set of part terms associated with that model. A thousand part strings probably don't look terribly finite to most users either....

@campmlc
Copy link

campmlc commented Mar 28, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Mar 28, 2017

I will strongly resist any efforts to resuscitate preservation method. I'm certainly open to different models, but chucking the thing that made giant messes (~10K unique "part strings" with presmeth-->~100 without) right back in there in an effort to fix a relatively small mess doesn't really make much sense to me. It's also wholly incapable of doing it's one job, as you pointed out above - what is the presmeth for a formalin-->ethanol-->oh crud-->better ethanol-->freezer-->oops-->colder freezer pathway? (That all fits nicely in container environment, much of it could be automated, and it's normalized so updates affect many parts.)

That aside, I'm thinking one dictionary, as many parts as you want.

heart, kidney, lung, spleen would be a viable part. (So would/is heart + kidney + lung + spleen as four parts in one container - you can fix that mess now if you want to.)

I suppose we'd have to make "95%" a term, so "95% ethanol" could be constructed. (I'd rather just say "ethanol" and use container environment.)

"formalin-fixed, ethanol-preserved, currently-frozen, heart, liver, eyeball, lung, spleen" could be constructed, if someone insists.

"95% frozen" would also be a valid (and perhaps occasionally accurate...) part name; I don't see how to add "grammar" controls to this, it is a less-structured model which will demand a bit more care from operators (and that may be a fatal flaw).

I'm not sure how the UI would work - that would take some experimentation, there are lots of things that might be technically feasible, hopefully some of them are also usable.

I grabbed unique "part terms" (space-split current data).

create table temp_pt (t VARCHAR2(255));

declare
  l_str    varchar2(4000);
  v_tab parse_list.varchar2_table;
   v_nfields integer;
begin
  for r in (select distinct part_name from ctspecimen_part_name) loop
    parse_list.delimstring_to_table (r.part_name, v_tab, v_nfields,' ');
    for i in 1..v_nfields loop
      insert into temp_pt(t) values(v_tab(i));
    end loop;
  end loop;
end;
/
create table temp_ptu as select distinct(t) from temp_pt;

temp_ptu.csv.zip

There are 479 terms, including at least a few dozen very obvious duplicates (photo, photograph; section(s), sectioned, sections - should we clean some stuff up now?). 50% fewer choices! (And infinitely more combinations...)

@campmlc
Copy link

campmlc commented Mar 28, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Mar 28, 2017

We'd need to alter data entry form

No, that's one of the powerful things about the container environment model. Just record the "environment" (ethanol concentration in a jar, freezer temp, room humidity, whatever) of a container and that data become available to the specimens in that container. If you have a probe which can talk to the Internet (eg, freezer temp log) I could set up an API for it to talk to. (And that could lead to things like "your freezer is melting" alerts.)

clean up

The part code table is http://arctos.database.museum/info/ctDocumentation.cfm?table=CTSPECIMEN_PART_NAME - I'm happy to SQL-merge stuff or whatever.

@campmlc
Copy link

campmlc commented Mar 28, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Mar 28, 2017

@campmlc none of that need be explicitly recorded in relation to any specific part. When the part is scanned into a container with an environment (not necessarily directly - "part-->tube-->position-->.....-->freezer [{temp}@{time} recorded by {agent}]" works), that environmental data is accessible from the part. (Forms will surely need developed if this gets used, but you can get there now by clicking part history and browsing up the container tree.)

Arctos maintains container history, so when you pull the tube out of the freezer and scan it into ethanol you still don't need to do anything extra - you can see that the tube was {there} which has {environmental history} and on {date} was moved {here} which has {more environment} etc. by following around existing linkages (which again could be summarized however we want).

Nothing (except $ perhaps!) prevents you from starting that process by slapping a battery-powered temp logger on the LN2 dewar you take to the field.

Arctos just pulls together the data about parts (from data entry), location (from scanning stuff), and environment - you're already using containers, the only thing you need to do to use this is to record container metadata in Arctos, and you probably already have that information - "none of the jars on {shelf} looked particularly funky on {date}" is useful information, even if it's not quite as precise as we'd all like.

@dustymc
Copy link
Contributor Author

dustymc commented Apr 14, 2017

Doesn't look like we're going to find a quick solution, de-escalating priority a bit

@dustymc dustymc added Priority-High (Needed for work) High because this is causing a delay in important collection work.. and removed Priority-Critical (Arctos is broken) Critical because it is breaking functionality. labels Apr 14, 2017
@campmlc
Copy link

campmlc commented Sep 15, 2017

discussed denormalizing parts to have fixation and preservation as part attributes, which can be added iteratively as parts are transfered to different environments.

@dustymc
Copy link
Contributor Author

dustymc commented Sep 15, 2017

#1119 (comment)

MSB prefers (1), and the obvious place to "do something weird" is in part attributes. "Sorta stinky, but frozen again" would be recorded as multiple Attributes:

part_name=muscle

  • part_attribute "preservation method (or whatever)"=frozen (optionally by PERSON on DATE etc.)
  • part_attribute "preservation method (or whatever)"=thawed (optionally by PERSON on DATE etc.)
  • part_attribute "preservation method (or whatever)"=stinky (optionally by PERSON on DATE etc.)
  • part_attribute "preservation method (or whatever)"=frozen (optionally by PERSON on DATE etc.)

A "combined history display value" could be auto-generated - eg, the above example could display as "part_name=muscle (frozen, thawed, stinky, frozen)." (Details or uncombined data would be available from the partdetail specimen results column, edit forms, and probably the parts grid on specimendetail.)

No model changes are necessary. New part attributes are likely necessary (code table addition), and we may want to control vocabulary for some attributes (would require app development).

This approach would probably also require a bulkloader/data entry update to include part attributes, and possibly a display adjustment.

A complete implementation would involve normalizing part name, so our current 18 parts containing "muscle" might become one ("muscle") and a bunch of Attributes.

Those 18 parts include things like 'heart, muscle (frozen)', which could (now, and it's always been possible) be two parts in the same container. You'd need to update both parts' attributes when something happens to the tube.

@campmlc
Copy link

campmlc commented Sep 16, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Sep 16, 2017

change current data entry

This change would require duplicating even more data than we do now - saying the same thing multiple places - and I'd expect that to be apparent in the entry tools. We can certainly make the forms better than they are now, but creating more data is ultimately going to be a more complex process.

parts autocreated separately

Something like that might be possible, but I'd expect it to reduce initial data quality and add work to the approval process, which may or may not be a good trade-off. Error logs suggest we already struggle with one controlled vocabulary, I think you're suggesting multiple instances of multiple vocabularies in the same "field" (how many ways are there to say "formalin-fixed ethanol-preserved heart, kidney, lung, spleen that we keep in the freezer"?).

automate ... bulkloaders

That's technically trivial but has data quality implications - eg, you could "approve" data which you never see by approving the specimen record. This deserves it's own issue.

pop-up

That's how part attributes works now?

@dustymc
Copy link
Contributor Author

dustymc commented Sep 26, 2017

A possible solution to a few of these problems:

  • add one new field to the specimen bulkloader, JSON_PARTS
  • Adjust the parts grid on the data entry screens, or perhaps add an alternative form if there's some reason to keep "the old way" as an option
  • add a JSON parser to the server-side bulkloader to deal with the new "field."

The popup-form (or sub-form or whatever - there's lots of flexibility in presentation) could be infinitely expandable:

screen shot 2017-09-26 at 9 00 34 am

so 500 parts each with 500 attributes works, and it would be easy to add more part-stuff (container-stuff, for example).

Most users would not need to know about any of this - the parts grid would just have some new possibilities.

The form-data would be compressed into a string for transport, so negligible effect on the bulkloader.

The JSON would be available in the normal place as part of the (potential) specimen record, so no need to blindly "approve" things you may not bother looking at (eg, parts in the parts bulkloader linked to specimens by local unique IDs).

"View JSON in a form" links could be scattered around wherever they're useful (eg, specimenresults/partdetail).

JSON is a Standard, so converting your locally-produced data (eg, spreadsheet with columns "part_name_17" and "part_name_17_attribute_value_23" (=17 parts, at least one of them having 23 attributes) into standard JSON should be straightforward (and Arctos could provide a service).

The parser (takes the JSON string, creates parts+attributes+whatever) should be relatively straightforward, and isn't anything that users need to be concerned with.

@campmlc
Copy link

campmlc commented Dec 11, 2017

We need to move this forward for the GGBN grant. Dusty, if this were implemented, what would the interface look like? The JSON string in part detail in specimen results is not very pretty, and not something we can ask students to come up with. How hard to put something together in test for us to look at?

@dustymc
Copy link
Contributor Author

dustymc commented Dec 11, 2017

I think there are two things here.

  1. For GGBN we need a way of addressing tissue quality. I think the verbiage in the proposal contains my assumption that we're going to use container environment - things like freezer temperature - to do so. We seem to be heading in a different direction, so ya'll need to develop protocols and vocabulary - I see no model or major interface changes in that approach. I'm happy to help with the vocab however I can, but ya'll know what you have and what you can tolerate at data entry and what your users need and etc.

  2. If we're denormalizing, then we'll need to move more stuff around with every part, and smooshing it all into a compact transport protocol like JSON is occasionally a convenient way of doing so.

JSON is beautiful, and anyone who says otherwise should be sentenced to another decade of XML!

what would the interface look like

#1020 (comment) is one possibility.

People can't type JSON; JSON must be generated.

I don't think ANYTHING in Arctos has a "the interface." These are presumably data you'd want to capture in the field, so one interface might be Excel (at least until we can build an app). If we're committed to this approach I can develop WHATEVER as ya'll need it, but I don't think there's going to be any sort of demo that's more informative than the screenshot above or the current edit parts form (it generates data which could easily be converted to JSON).

JSON is a transport mechanism. Instead of ~80 bulkloader columns for each part (that covers 10 attributes, which would almost certainly quickly become limiting anyway) there'd be one column into which you could stuff however many parts each with however many attributes you want (eg, by clicking "save" on some app that looks like the screenshot mockup). Arctos would just add a JSON unroller wherever those data might land, and once they're unrolled they work like all other normalized data.

You don't have to see JSON for any of that to happen.

JSON is also sometimes a convenient way to display complex data in a simple format, which is all that's going on with specimenresults/partdetail. I'm happy to do something else there, I just need to know what ya'll want to see/how you want to see it.

Maybe we need an all-hands meeting dedicated to "preservation method"? As a replacement for container environment this is a major change in direction, and I'm not sure how effectively that's being communicated.

@campmlc
Copy link

campmlc commented Dec 11, 2017 via email

@dustymc
Copy link
Contributor Author

dustymc commented Dec 11, 2017

I don't think we need to get rid of container environment, but I don't think we currently have the resources to develop both, and I think having two ways to getting at the same thing would be a major usability issue. Eg if attributes doesn't get us where we want to be we'd probably need another proposal to further develop/integrate containers, document that mixed approach, etc.

I can initially provide GGBN with part condition, and it will be easy to add/replace/adjust that as we begin supplementing those data with part attributes. I don't think this will be a replacement, which would require "translating" existing part condition (and remarks and wherever else these types of data have been recorded). The data we have is just what we have for "legacy" tissues (eg, those collected before today), GGBN has provided a framework for going forward.

@dustymc
Copy link
Contributor Author

dustymc commented Mar 4, 2020

@dustymc dustymc closed this as completed Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Function-CodeTables Priority-High (Needed for work) High because this is causing a delay in important collection work..
Projects
None yet
Development

No branches or pull requests

3 participants