-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
magic taxonomy relationships #757
Comments
Find part of them with...
Should we create a "potential alternate spelling" (better name? Should probably be distinguish between "pattern looks similar" and "a person asserts") relationship, populate it with these, and set a trigger to autorelate new (potential) gender variations? |
See also http://arctos.database.museum/info/reviewAnnotation.cfm?ANNOTATION_GROUP_ID=1343. Two names exist, one of them may be a "literature artifact," nobody is going to immediately invest the time to track down the three-or-more relevant publications. A "these are probably the same" relationship type (eg, something explicitly less authoritative than "synonym of") would be very useful. |
Sounds like a can of worms to me. I'm not sure but doing this properly would require programming in Latin & Greek equivalencies so Arctos would be able to guess correctly which epithets are merely different versions of the same thing. I expect someone somewhere is already working on this problem. In the meantime I think it would be a better use of time & help solve this issue to improve the synonymy (name-relationship) tools so users can more easily assert 2 or more names are synonyms and specify which is the valid name. (Currently, for example, if I assert name X is the senior synonym of name Y but forget to specify that name Y is the junior synonym of name X the relationship is half-formed. It would be better if it auto-completed the other half). It would also be nice if when someone was say searching for an ID to add to a specimen Arctos provided more than just a list of names - but also a little information about the names like if they were valid or not. |
If they are and they've shared with GlobalName, then users can find specimens by either variant and it's not much of a problem. (But I know of no such initiatives, I can't recall seeing those sorts of data from GlobalNames, and most "taxon projects" seem focused on "current" or similar. I'd really love to be wrong....)
To do that we'd have to move relationships to, or closer to, classifications. I don't THINK that's an insurmountable obstacle, but there are some details that would need considered.
I strongly agree - but see #735; we have some cleaning to do before we can get there.
See #756 - I think clearing up relationships (and possibly moving them closer to classifications) would be useful (necessary?) here too. |
agreed! |
I don't have anything substantive to add here, really, other than to emphasize that these issues are important to our collection (UAM Herbarium) as well. It's a bit of a mess when specimens of the same taxon get entered under several different names. Anyone searching for specimens by scientific name isn't going to get all of them unless they do substantial research to figure out all the potential names. I just changed 6 records recorded as Veronica wormskioldii (with an "i") to (what is nearly universally considered the correct transliteration) V. wormsjioldii (with a "j"). I'd rather not have to do it again a year from now. It's easy to just hit the first name that comes up on data entry rather than looking carefully at them. And not everyone doing data entry might know which one is considered valid. But we do have to retain the option of using currently unaccepted names when entering type specimens and literature citations. It would just be nice to have clues as to what current conventions are. Sometimes these names change back and forth. The accepted name for the arctic larkspur went from Delphinium brachycentrum in Hulten 68 to D. chamissonis in Hulten 73 back to D. brachycentrum in the Flora North America volume. Another example is that Thlaspi arcticum in Hulten 68 is now Noccaea arctica. I often forget which name is currently accepted for Alaskan specimens, so it would be nice to be able to give myself (and whoever is doing data entry) a clue. For us at least, we would like to have a fairly well-established list of accepted names for the Alaskan flora and would like clues in the taxonomic choices that appear during data entry to help us enter names consistently. However, we also commonly get specimens from other places (eastern Canada, Chukotka, Greenland) so need access to other names for those. I'm sorry but I just don't understand the internal working of Arctos well enough to have any suggestions as to how all this could be accomplished. Thanks! |
closing for consolidation with #1136 |
Reopening and reprioritizing. There are a bunch of recent names (not created by the WoRMS scripts) that look a lot like 'synonyms' to me. I'm hesitant to use "synonym of" because these would be a different type of data - existing relationships are "some person says..." where these would be "some script, based only on string patterns, thinks that it's possible...." I suspect there are three things in these data:
That would help users get where they want to be, possibly help us detect and eliminate data which is just wrong, and occasionally lead to false positives in certain search results. Here are some recent examples that caught my eye.
I'm not sure how many of those scripts might detect, or how many existing variations there might be in Arctos; this might not work at all. All of those could be valid, but it would probably take way more resources than we're likely to have (I think this is Teresa helping a new collection) to really tell on an individual basis. I was going to suggest some sort of multiple-review process, but that's probably just twice as many people who don't have the resources to do anything about it.... All of these are "valid" according to at least one of the sources in our validator (wikidata, globalnames, marinespecies, eol, gbif). That may be sufficient reason to include them in Arctos even if they're unambiguously wrong; they're used elsewhere and so people are likely to search by them. I don't think these data are outliers - this is about what I'd expect when any big blob of data comes in.
Here's the last ~20 days of new taxonomy, excluding names from WoRMS. https://docs.google.com/spreadsheets/d/19deCk0WSdJ4x7IjrrpPlfOFrZ7UUybtmUWFdXnBEYrI/edit?usp=sharing |
FWIW, most of these names were taken from Hymenoptera Online as I attempted to fill in needed taxa for wasps I am bulkloading. I have no idea if the "valid" assertion - which comes directly from HOL is good, bad, or indifferent; but I elected to trust it for the purposes of creating so much taxonomy AND I generally did not load any taxa that were labeled as "invalid" by HOL (may or may not be a good thing) because I did not want to go back and create all of the "synonym" relationships, especially since many of the accepted names are also not in Arctos. I think that any way we can facilitate search is useful, so the fuzzy matches are good. A suggestion for the new relationship: "potential spelling variation" |
https://docs.google.com/spreadsheets/d/1YmCqv0aeVa5JpMqvlYPpfng9YCBHczpjfzDF1rmEgA0/edit?usp=sharing PLEASE let me know of any problems in that, particularly things that are in the spreadsheet and do not need flagged as "potential spelling variation." And please let me know if there's any pattern I haven't detected. This is essentially a cartesian join and will have to be heavily throttled - it's going to take ~weeks to run, I'd like to get it as correct as possible as early as possible. Things not in the spreadsheet: "Higher taxonomy" variations. I intend to ignore these for the first pass; I'm not sure we want to push all 500 ...ii subspecies to ...i and vise-versa, but I'm also not sure I see a less-evil alternative.
Are these detectable variations or just jumbly things that happen to share a few characters? Baeckea carnosula (possible check: same characters, different order) Bactrocera neocognata (possible check: one extra character) Bactrocera nigella (possible check: same root) |
I like this approach.
- shall I try to detect minor string variations? (DLM vote: yes)
- if so, what should we do with them? (DLM vote: new relationship)
…On Fri, Mar 8, 2019 at 1:32 PM dustymc ***@***.***> wrote:
1.
This should be implemented such that it can be run against new names,
or can create the relationships as the name is created. (I think we *DO
NOT* want to prevent new names - they're useful for search - we just
want to flag them as they're introduced. Are we in agreement on that?)
2.
HELP!! I scrolled through the first couple thousand names that start
with A or B and grabbed stuff that jumped out at me.
https://docs.google.com/spreadsheets/d/1YmCqv0aeVa5JpMqvlYPpfng9YCBHczpjfzDF1rmEgA0/edit?usp=sharing
PLEASE let me know of any problems in that, particularly things that are
in the spreadsheet and do not need flagged as "potential spelling
variation."
And please let me know if there's any pattern I haven't detected. This is
essentially a cartesian join and will have to be heavily throttled - it's
going to take ~weeks to run, I'd like to get it as correct as possible as
early as possible.
Things not in the spreadsheet:
"Higher taxonomy" variations. I intend to ignore these for the first pass;
I'm not sure we want to push all 500 ...ii subspecies to ...i and
vise-versa, but I'm also not sure I see a less-evil alternative.
***@***.***> select scientific_name from taxon_name where scientific_name like 'Abarenicola claparedi%';
SCIENTIFIC_NAME
------------------------------------------------------------------------------------------------------------------------
Abarenicola claparedi
Abarenicola claparedi oceanica
Abarenicola claparedi vagabunda
Abarenicola claparedii
Are these detectable variations or just jumbly things that happen to share
a few characters?
Baeckea carnosula
Baeckea carnulosa
(possible check: same characters, different order)
Bactrocera neocognata
Bactrocera neocongnata
(possible check: one extra character)
Bactrocera nigella
Bactrocera nigra
Bactrocera nigrescens
Bactrocera nigrescentis
Bactrocera nigricula
Bactrocera nigrifacies
Bactrocera nigrita
Bactrocera nigrivenata
Bactrocera nigrofemoralis
Bactrocera nigroscutata
Bactrocera nigrotibialis
Bactrocera nigrovittata
(possible check: same root)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#757 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOH0hPKFpdz9Z3GCemh8EQx-mTiBmcSPks5vUsjDgaJpZM4GCBbx>
.
|
Here are potentially related names from a first run. Please scroll around a bit - should any of these not be linked, did I miss anything obvious, etc. I think some of these are probably NOT related, but I don't know how to avoid that and still get the ones that obviously are, and the only effect should be the occasional false positive in search results. If nobody stops me in a few days I'll create a new relationship and, when these pairs are not already related, link them. New relationship: potential alternate spelling=Possible spelling variation detected by automation. |
60K names to scroll through - not sure I'll catch anything. Seems like the best thing would be to wait for someone to tell us they are definitely NOT spelling variations, then remove the relationship and probably add a "not the same as" relationship so that scripts don't undo the work? |
Yea, the problem with that is currently all relationships are functionally identical. I guess I'm surprised we made it this far with that! I don't think there's much of anything too funky in there, but sometimes something will jump out for new eyeballs. |
Since I no longer use the Arctos source, I probably shouldn't comment, but I would be very cautious. When I sort alphabetically, the majority of the names are plants in my garden waiting for spring and I have no idea how their taxonomy works. Dusty, can you split these into Arctos and Arctos Plants so we know where to concentrate our efforts? But I did find a few mollusks and some were ok to link (one was invalid - usually misspelled but in WoRMS as a misspelled taxon - and one was valid). So it the linkage only helps if it also says which is valid and which is invalid. Do any of these have Taxon Status? Also, in my experience, most of the Mollusca misspellings came from UCMP so it would be helpful to know the source. Lastly, the author is very important as shown in the first example. Some of them should not be linked.
So I would start by wanting the author if there is one and if it isn't the same on both taxa, then I wouldn't be comfortable linking them. The higher classification should also be the same at least one level above (e.g. family).
Again, what is the author? These are two different genera in different families and are totally unrelated.
These are both in WoRMS, so perhaps start by running all of these against WoRMS and adding at least whether it's valid or invalid and that it's in WoRMS so there's more context and a more meaningful link.
Unlike Dusty, I don't like having totally erroneous taxa in Arctos even if it's somewhere out there on the Internet (fake taxonomy?). Despite lots of training and instructions to copy directly from WoRMS into Arctos during data entry, we always had a few errors pop up when invalid taxa autofilled. If I were still using Arctos, I would use this list to find misspelled taxa and delete them rather than linking them to possibly valid taxa, but it looks like a lifetime's work. I picked out one plant taxon at random and found this: Acacia wightii = Acacia wightii Graham ex Wight & Arn. This is a synonym of Albizia amara (Roxb.) Boivin Here's the last one Tulipa grisebachii -Tulipa grisebachii Borbás is a synonym of Tulipa sylvestris subsp. Sylvestris would be linked to So these are both junior synonyms of the same species but they are different to start with. Source http://www.theplantlist.org/tpl/record/kew-309073. Maybe we should load their list? |
I'm really looking for "scripts are [not] completely busted" and not this level of detail at this time. I am looking only at names. I think the definition ("Possible spelling variation detected by automation.") covers the possibility that eg, Acacia wightii and Acacia wightiana are bycatch and have nothing to do with each other beyond some common characters. "Synonyms" are a problem mostly when they're obscure. StudentA used Acartophthalmus nigrinus so now it's in their autofill, StudentB uses Acartophthalmus nigrina so now it follows them around, a user searches one or the other, presumes they've found everything, goes away - that can (and did - it's why we're here) happen for a long time. This fixes that - users find what they're looking for (from "any taxon") with this relationship. (And maybe some stuff they're not looking for, but that's ~less-evil.) Hopefully with time "we" will add better relationships (possibly including Teresa's "these have nothing to do with each other") and indicate preference and all that jazz, but that's beyond the scope of what I'm trying to accomplish right now. Baby steps! I like "fake taxonomy" when it DOES SOMETHING. We could do more with relationships ("horrible mangling of"?) and/or UI (exclude those from various views, maybe), but a user copying a name from some generally-trusted source and then not being able to use it to find specimens in Arctos because we've invested extra work in making our data more obscure just seems evil to me. If you can get CSV or something like it from anyone, I'm happy to help load it. |
taxonomy committee: go! |
I agree that finding more than one wants is less bad than finding less, but
it's a stop-gap. We need to do better, eventually. Unfortunately, to do so
requires lots of human, rather than machine labor.
…-Derek
On Wed, Mar 20, 2019 at 1:07 PM dustymc ***@***.***> wrote:
taxonomy committee: go!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#757 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIraM4um_PmIo__peDS44aZ6X7I3VY32ks5vYqMegaJpZM4GCBbx>
.
--
+++++++++++++++++++++++++++++++++++
Derek S. Sikes, Curator of Insects
Professor of Entomology
University of Alaska Museum
1962 Yukon Drive
Fairbanks, AK 99775-6960
dssikes@alaska.edu
phone: 907-474-6278
FAX: 907-474-5469
University of Alaska Museum - search 400,276 digitized arthropod records
http://arctos.database.museum/uam_ento_all
<http://www.uaf.edu/museum/collections/ento/>
+++++++++++++++++++++++++++++++++++
Interested in Alaskan Entomology? Join the Alaska Entomological
Society and / or sign up for the email listserv "Alaska Entomological
Network" at
http://www.akentsoc.org/contact_us <http://www.akentsoc.org/contact.php>
|
First pass at this is running in production; closing for now. |
Given Acartophthalmus nigrinus & Acartophthalmus nigrina and similar, can we detect that those might be gender variations and automagic in (waffly - "potential alternate spelling" or something?) relationships to facilitate search?
The text was updated successfully, but these errors were encountered: