-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sex - curation before uploading first vocabulary version #83
Comments
I'd like to work on this one, have requested access. |
As we define this vocabulary, Isabel (DanBIF) writes:
|
Nice dataset indeed :) But to the question, I'd say no. I've seen similar cases happen with sex / reproductive condition / life stage. Back to castes, I understand that there is no specific field to capture "worker" right now, and that we don't want to loose information. However, misusing another term (like sex) does not help solve the problem. An alternative would be to try to find community support for creating a new "caste" term in Darwin Core. Otherwise, if just using Simple Darwin Core, "worker" would be best shared under dwc:dynamicProperties, encoded as key:value. (and yes, sadly a bit obscured...) |
I will begin preparing this vocabulary for production. |
|
I have removed mapping of a sex + ? to the sex, as I would interpret this as unknown/best guess, and it is now mapped to unknown instead. |
W.r.t. comment #83 (comment) The argumantation in the comment is valid, but I really think it is a pity too loose information on caste for datasets on Hymenoptera. How about "redefining" the term "Sex" to be "Sex/Caste"? - then we could include "worker" as a valid value |
We would want to keep the vocabularies strict to only account for one specific term to avoid confusion for publishers as well as users. Values that refer to e.g. 'queen', 'worker', 'drone' etc. are present in the sex field in 0.34% of the datasets and in 0.05% of occurrences, so it is not a prevalent issue. It would still be possible to get the values from the verbatim data if you carry out a full download. Caste systems/eusociality and other sorts of hierarchical social structures extend beyond Hymenoptera and Insecta, and it might be interesting to try to capture the variation in a DwC term (although for practical purposes it may be too complex and specific to capture in a standard), For now, I will map queen = female, but workers are not necessary females (e.g. in termites) so they will not be interpreted as female (worker would still be in the data as the verbatim sex). I am not aware of whether the term drones is used beyond Hymenoptera(?) but will leave it unmapped for sex, to avoid any misinterpretation. |
@DanBIF A proposal has been made for a term "caste" in Darwin Core. It would be good to lend support for the term in this issue. |
Ok @jhnwllr over to you. We have >3,500 hidden values for four concepts so please only
I have not added any new concepts, but have decided to translate VertNet's |
I missed this and have accounted for all the numbers after all... they are the reason for the high number of hidden values.
|
The vocabulary is ready for you to check now @jhnwllr. Please use the |
@CecSve this is my review. Feel free to disagree or question my comments. might mean MALE https://en.wikipedia.org/wiki/XO_sex-determination_system I think all of these should be Unknown
Why is this marked Unknown? Seems fair to call this Female This might mean Male https://en.wikipedia.org/wiki/Hemipenis This should be Unknown I bet this should be Female. (later these are all marked as Female)
Should be Unknown. Should be Unknown. Previously such combinations were Male. Should be Unknown. I think this means Adult Male. Should be Female. Should be Unknown. All Exx ex ex. ect have to be some system... I think this is Male. This should be Mixed. Interesting that this one is considered Female but other similar are marked Unknown. Should be Mixed or Unknown I bet this stands for female gonads or something. Should Hermaphrodite There are a lot of the bracketed reclassifications. I think we need some standard way of dealing with them. Others with question mark have been marked as unknown, but these are marked as male. I think this should be marked Unknown Should be mixed These are all marked as Unknown but in other places similar are marked as Male or Female.
Should be Female. |
In my experience the parenthetical and bracketed entries often signify uncertainty. Whether uncertainty should result in "unknown" or in the suggested probable interpretation should be, at the very least, consistent. |
Be aware that a null interpreted Sex field can be populated based on data in dynamicProperties gbif/pipelines#478 - both interpretations use the same vocabulary to map |
Thank you for the thorough breakdown @jhnwllr - I will go through them one by one. Based on discussions with the NAOC group, the I will let you know if I have any questions. |
After consulting with the NAOC work group a while back, I have switched the |
In cases where there is no information on sex, e.g. only values relating to age: |
Non-binary sexes will be mapped to the concept |
The |
The vocabulary is now uploaded to PROD: https://registry.gbif.org/vocabulary/Sex |
When the vocabulary is ready to be implemented in the pipeline, the following clean-up of the verbatim values should be carried out before mapping the values: remove trailing, remove within text string, remove (leading)numbers (not zeros) recode:[] and {} and {] and [} to () |
I was taking a look at the hidden values that we have and I think we need to redefine the cleanup rules. We have hidden labels that contain
So if we apply the suggested cleanup they won't match with any concept. For example, a verbatim value Also, if there are several rules to apply it's important to consider the order. For example, removing leading or trailing @CecSve what are the cases that we want to solve with the cleanup so I can get a better understanding of what we need to clean? |
Thank you for checking! I was unsure whether the clean-up made 100% sense (I redid it 3 times). I see the two comments I made here are conflicting #83 (comment) and #83 (comment). Let us stick to this (probably need to remap some values so the vocabulary is consistent with what users see, though it won't affect interpretation):
I wonder if it is worth me going over it all first with these rules and checking to see if it still makes sense with the clean-up suggested in #83 (comment)? I can't find my script from last time, but then I can share it with you this time? |
@CecSve There are conventions in Mammalogy and probably in other disciplines by association that parentheses signify uncertainty, so strings that contain them should be mapped to indeterminate also. The square brackets are different, as the convention for that is to signify information that was not recorded originally, but that does not carry with it the uncertainty of the parentheses, so that one should be fine to interpret by dropping the square brackets. |
Thanks for the reminder John! You have mentioned this to me before and I forgot. I will correct the suggestion. I have still opted to have |
It's important to remember that the mappings should be in the labels so the interpretation is transparent and there are no rules in the Java code that can change the mappings of the labels. The only thing we can do is to clean up the values before the interpretation so we don't have to create a hidden label for each possible case. But the cleanup should be for characters that don't have any value. For example, in
And we can't add a hidden label for every number. Therefore, I can't do things like this because it's a mapping done in java that doesn't take the vocabulary into account:
Removing the brackets is fine although I see that we have many hidden labels that contain brackets so it doesn't feel right to me. If we are not sure about the need of the cleanup I suggest not to do it and add it a later point if we see it's necessary. |
I agree. I have decided to take one last look to see if any cleanup would make sense. So I am remapping everything and will share the JSON with the steps I took. |
@marcos-lg I think I am missing something. If you can clean up by removing things like square brackets, why can you not clean up by substituting "indeterminate", which is in the vocabulary, for anything with the patterns @CecSve mentions? |
My concern is that the cleanup was intended to remove characters that don't have any significant value but a substitution might override the labels that are present in the vocabulary. And I see that we have hidden labels like |
To put it in context what I am preparing with the example you gave.
I do this to have a shorter list than 3000+ verbatim values to map, leaving me with approximately half the amount of verbatim values to map. I will upload the original sheet and the JSON for the cleaning steps soon. |
@marcos-lg I have attached the edits I made in OpenRefine using GREL in JSON format. I haven't attached the full history of what I did since it is only relevant for standardizing concepts and hidden values before vocabulary mapping. We end up with 1.447 verbatim values mapped to 5 concepts. If you think the following makes sense, then I will update the sex vocabulary based on what I just did:
Does this plan make sense, please? I hope this makes it more transparent what type of cleanup leads to constructing a vocabulary. |
What does this expression mean? We can try with that cleanup. I still see many hidden labels that won't be used (for example all the labels that contain numbers) but I checked some in prod and don't seem to be used anymore. |
This tells me that the approach is not right. The method isn't sacred, the result is. So what is the goal? I would have thought that the goal is to do the best matching with the least vocabulary maintenance. I know vocabulary maintenance is important, or GBIF would not have decided to discard hidden labels that apply to very few records or very few data sets. @CecSve also mentions it in this issue. So what happens when "male, maybe, sort of?" starts getting published and isn't in the list of hidden values. The proposed approach means that "male, maybe, sort of" has to be in the hidden values and mapped or it will not be improved. If processing turned everything with a '?' in it into "indeterminate", every appropriate improvement would be made immediately without any reliance on vocabulary maintenance. That seems like a serious win to me. Again, if I misunderstand something, my apologies. |
AFAIK the goal was to move the interpretation from java code(enums and parsers) to the vocabulary so it's users who decide how values should be mapped. That's why hardcoding that rule of replacing One thing we can do is to extend the vocabulary to allow regular expressions (in the hidden labels or as a new field) although this will bring the case where 2 regular expressions in 2 different concepts might overlap (right now this doesn't happen because I check that the labels are unique within the vocabulary). |
Ok, I understand. It just seemed that if you were doing things in code
(which it still seems like you are), why not do something even more useful?
Regular expressions in the vocabulary would be interesting, but would be
beyond most vocabulary maintenance participants.
if re.search(r'.*[\?\(].*', input_string): return "indeterminate". :-)
…On Mon, Sep 2, 2024 at 4:28 AM Marcos Lopez Gonzalez < ***@***.***> wrote:
AFAIK the goal was to move the interpretation from java code(enums and
parsers) to the vocabulary so it's users who decide how values should be
mapped. That's why hardcoding that rule of replacing ?with indeterminate
breaks this approach.
One thing we can do is to extend the vocabulary to allow regular
expressions (in the hidden labels or as a new field).
—
Reply to this email directly, view it on GitHub
<#83 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADQ72YC4LZQLHLJLATB7HLZUQHRVAVCNFSM4ZMIMZ32U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMZSGQYDAOJRGYZQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yeah, that's true. To be honest I'm not 100% sure that adding the cleanup in the code was the best decision but it was convenient at that time. I'll give a thought to the regular expressions to see if we can allow them without introducing much more problems. |
Since we have the lookup endpoint in UAT already I did a test with the verbatim values of this vocab(e.g.: https://api.gbif-uat.org/v1/vocabularies/Sex/concepts/lookup?q=macho). I took the verbatim values from prod with this query:
I tried the lookup without any cleanup and I got these results: Matches: 301 Then I played with the suggested cleanups in #83 (comment) (trimming and the case are not needed since they are handled by the lookup service in the API):
So the best cleanup was by removing these characters:
Then I took a look at the hidden labels and I noticed that we were missing some combinations for the Having those labels I ran the lookup by using a cleanup that only removes numbers and the results were: Matches: 1645 The verbatims values that weren't mapped were this:
We could add some other hidden labels to fix some of those cases too. There are also many hidden labels that are not needed because the cleanup takes care of those cases such as:
but we can leave them there since they don't do any harm and we don't spend time on that. I did this because the less cleanup we do the better since we have less logic in the java code. Because changes in the java code take us more time to do than changes in the labels. @CecSve @tucotuco if you guys agree with this I'll create the hidden labels in prod and start integrating this vocabulary into the pipelines interpretation with the cleanup that removes numbers only. |
I agree that these are moves in the right direction. This may not be the place for it, but out of the BDQ Task Group 2, we believe that the right model for controlled vocabularies is to have a controlled community vetting of all published values without arbitrary cutoffs by popularity. Everything is transparent that way, with nothing hidden in code. I believe @ArthurChapman had the chance to express this view at TDWG in Japan this year. |
As discussed with @CecSve and @ymgan at TDWG2024 in Okinawa, the operation of tests against the GBIF Vocabularies in the proposed BDQ Core work better if we can test against as many as possible terms - especially where these can be linked as synonyms. As mentioned by @marcos-lg (#83 (comment)) above we would strongly support the suggestion of @tucotuco of retaining all published values (except for those cleaned up such as numbers, etc.). This would apply to all the vocabularies maintained by GBIF. |
Arctos code tables have |
Here is a file to edit: https://drive.google.com/file/d/1qyBLQnpLyF3qXlNJ2oP0QN6uoHbIbhxd/view?usp=sharing
It contains:
NB: We can make sure that the matching doesn't take into account any numbers for the values. For example, no need to match
1 Macho | 1 Hembra
and1 Macho | 2 Hembra
, the data can simply beMacho | Hembra
. We just need to notify @marcos-lg before he does the import.Pease check instructions here: #70
The text was updated successfully, but these errors were encountered: