Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

magic taxonomy relationships #757

Closed
dustymc opened this issue Sep 22, 2015 · 19 comments
Closed

magic taxonomy relationships #757

dustymc opened this issue Sep 22, 2015 · 19 comments
Labels
Function-Taxonomy/Identification Priority-High (Needed for work) High because this is causing a delay in important collection work..

Comments

@dustymc
Copy link
Contributor

dustymc commented Sep 22, 2015

Given Acartophthalmus nigrinus & Acartophthalmus nigrina and similar, can we detect that those might be gender variations and automagic in (waffly - "potential alternate spelling" or something?) relationships to facilitate search?

@dustymc dustymc added this to the Needs Discussion milestone Sep 22, 2015
@dustymc
Copy link
Contributor Author

dustymc commented Sep 24, 2015

Find part of them with...

select 
    a.scientific_name,
    b.scientific_name
from
    taxon_name a,
    taxon_name b
where
    regexp_count(a.scientific_name,' ')=1 and
    regexp_count(b.scientific_name,' ')=1 and
    a.scientific_name!=b.scientific_name and
    a.scientific_name like '%us' and
    b.scientific_name like '%a' and
    substr(a.scientific_name,0,length(a.scientific_name)-2)=substr(b.scientific_name,0,length(b.scientific_name)-1)
;

Should we create a "potential alternate spelling" (better name? Should probably be distinguish between "pattern looks similar" and "a person asserts") relationship, populate it with these, and set a trigger to autorelate new (potential) gender variations?

@dustymc
Copy link
Contributor Author

dustymc commented Nov 8, 2016

See also http://arctos.database.museum/info/reviewAnnotation.cfm?ANNOTATION_GROUP_ID=1343. Two names exist, one of them may be a "literature artifact," nobody is going to immediately invest the time to track down the three-or-more relevant publications. A "these are probably the same" relationship type (eg, something explicitly less authoritative than "synonym of") would be very useful.

@Jegelewicz

@DerekSikes
Copy link

DerekSikes commented Nov 8, 2016

Sounds like a can of worms to me. I'm not sure but doing this properly would require programming in Latin & Greek equivalencies so Arctos would be able to guess correctly which epithets are merely different versions of the same thing. I expect someone somewhere is already working on this problem. In the meantime I think it would be a better use of time & help solve this issue to improve the synonymy (name-relationship) tools so users can more easily assert 2 or more names are synonyms and specify which is the valid name. (Currently, for example, if I assert name X is the senior synonym of name Y but forget to specify that name Y is the junior synonym of name X the relationship is half-formed. It would be better if it auto-completed the other half). It would also be nice if when someone was say searching for an ID to add to a specimen Arctos provided more than just a list of names - but also a little information about the names like if they were valid or not.

@dustymc
Copy link
Contributor Author

dustymc commented Nov 8, 2016

I expect someone somewhere is already working on this problem.

If they are and they've shared with GlobalName, then users can find specimens by either variant and it's not much of a problem. (But I know of no such initiatives, I can't recall seeing those sorts of data from GlobalNames, and most "taxon projects" seem focused on "current" or similar. I'd really love to be wrong....)

can more easily assert 2 or more names are synonyms and specify which is the valid name.

To do that we'd have to move relationships to, or closer to, classifications. I don't THINK that's an insurmountable obstacle, but there are some details that would need considered.

It would be better if it auto-completed the other half

I strongly agree - but see #735; we have some cleaning to do before we can get there.

searching for an ID to add to a specimen

See #756 - I think clearing up relationships (and possibly moving them closer to classifications) would be useful (necessary?) here too.

@DerekSikes
Copy link

agreed!

@AlanBatten
Copy link

I don't have anything substantive to add here, really, other than to emphasize that these issues are important to our collection (UAM Herbarium) as well. It's a bit of a mess when specimens of the same taxon get entered under several different names. Anyone searching for specimens by scientific name isn't going to get all of them unless they do substantial research to figure out all the potential names. I just changed 6 records recorded as Veronica wormskioldii (with an "i") to (what is nearly universally considered the correct transliteration) V. wormsjioldii (with a "j"). I'd rather not have to do it again a year from now. It's easy to just hit the first name that comes up on data entry rather than looking carefully at them. And not everyone doing data entry might know which one is considered valid. But we do have to retain the option of using currently unaccepted names when entering type specimens and literature citations. It would just be nice to have clues as to what current conventions are.

Sometimes these names change back and forth. The accepted name for the arctic larkspur went from Delphinium brachycentrum in Hulten 68 to D. chamissonis in Hulten 73 back to D. brachycentrum in the Flora North America volume. Another example is that Thlaspi arcticum in Hulten 68 is now Noccaea arctica. I often forget which name is currently accepted for Alaskan specimens, so it would be nice to be able to give myself (and whoever is doing data entry) a clue. For us at least, we would like to have a fairly well-established list of accepted names for the Alaskan flora and would like clues in the taxonomic choices that appear during data entry to help us enter names consistently. However, we also commonly get specimens from other places (eastern Canada, Chukotka, Greenland) so need access to other names for those.

I'm sorry but I just don't understand the internal working of Arctos well enough to have any suggestions as to how all this could be accomplished.

Thanks!

@Jegelewicz
Copy link
Member

closing for consolidation with #1136

@dustymc
Copy link
Contributor Author

dustymc commented Mar 7, 2019

Reopening and reprioritizing. There are a bunch of recent names (not created by the WoRMS scripts) that look a lot like 'synonyms' to me.

I'm hesitant to use "synonym of" because these would be a different type of data - existing relationships are "some person says..." where these would be "some script, based only on string patterns, thinks that it's possible...."

I suspect there are three things in these data:

  1. "legitimate" variations - there are multiple "official" ways to spell the thing. An automated "waffly relationship" will help people searching for Acartophthalmus nigrinus find Acartophthalmus nigrina.

  2. Misspellings; one of them is just wrong, but it's close enough that the scripts can detect it. One of (Pompilus calcaratus accolens, Pompilus calcaratus accoleus) is PROBABLY wrong. (And that doesn't necessarily mean it should be excluded from Arctos if eg, it's in common usage - http://handbook.arctosdb.org/documentation/taxonomy.html.)

  3. Unrelated taxa that just happen to be similarly constructed. Maybe (Pompilus calcaratus accolens, Pompilus calcaratus accoleus) really are two taxa, so the relationship would be unwanted. I suspect this is rare, but it certainly exists.

That would help users get where they want to be, possibly help us detect and eliminate data which is just wrong, and occasionally lead to false positives in certain search results.

Here are some recent examples that caught my eye.

2019-02-26 might_be_valid Pompilus calcaratus accolens Teresa J. Mayfield-Meyer    
2019-02-26 might_be_valid Pompilus calcaratus accoleus Teresa J. Mayfield-Meyer    
2019-02-24 might_be_valid Pepsis seifferti Teresa J. Mayfield-Meyer    
2019-02-24 might_be_valid Pepsis seiffertii Teresa J. Mayfield-Meyer    
2019-02-24 might_be_valid Pepsis purpurea Teresa J. Mayfield-Meyer    
2019-02-24 might_be_valid Pepsis purpureipes Teresa J. Mayfield-Meyer    
2019-02-24 might_be_valid Pepsis purpureus Teresa J. Mayfield-Meyer    
2019-02-26 might_be_valid Pompilus bilineatus Teresa J. Mayfield-Meyer    
2019-02-24 might_be_valid Pompilus bilunatus Teresa J. Mayfield-Meyer    
2019-02-26 might_be_valid Pompilus bilunulatus Teresa J. Mayfield-Meyer    
2019-02-24 might_be_valid Pompilus funebris Teresa J. Mayfield-Meyer    
2019-02-24 might_be_valid Pompilus funereus Teresa J. Mayfield-Meyer    
2019-02-24 might_be_valid Pompilus handlirchii Teresa J. Mayfield-Meyer    
2019-02-24 might_be_valid Pompilus handlirschi Teresa J. Mayfield-Meyer    
2019-02-24 might_be_valid Pompilus handlirschii Teresa J. Mayfield-Meyer    

I'm not sure how many of those scripts might detect, or how many existing variations there might be in Arctos; this might not work at all. All of those could be valid, but it would probably take way more resources than we're likely to have (I think this is Teresa helping a new collection) to really tell on an individual basis. I was going to suggest some sort of multiple-review process, but that's probably just twice as many people who don't have the resources to do anything about it....

All of these are "valid" according to at least one of the sources in our validator (wikidata, globalnames, marinespecies, eol, gbif). That may be sufficient reason to include them in Arctos even if they're unambiguously wrong; they're used elsewhere and so people are likely to search by them.

I don't think these data are outliers - this is about what I'd expect when any big blob of data comes in.

  • shall I try to detect minor string variations? (DLM vote: yes)
  • if so, what should we do with them? (DLM vote: new relationship)

Here's the last ~20 days of new taxonomy, excluding names from WoRMS.

https://docs.google.com/spreadsheets/d/19deCk0WSdJ4x7IjrrpPlfOFrZ7UUybtmUWFdXnBEYrI/edit?usp=sharing

@dustymc dustymc reopened this Mar 7, 2019
@dustymc dustymc added the Priority-High (Needed for work) High because this is causing a delay in important collection work.. label Mar 7, 2019
@Jegelewicz
Copy link
Member

FWIW, most of these names were taken from Hymenoptera Online as I attempted to fill in needed taxa for wasps I am bulkloading. I have no idea if the "valid" assertion - which comes directly from HOL is good, bad, or indifferent; but I elected to trust it for the purposes of creating so much taxonomy AND I generally did not load any taxa that were labeled as "invalid" by HOL (may or may not be a good thing) because I did not want to go back and create all of the "synonym" relationships, especially since many of the accepted names are also not in Arctos.

I think that any way we can facilitate search is useful, so the fuzzy matches are good.

A suggestion for the new relationship: "potential spelling variation"

@dustymc
Copy link
Contributor Author

dustymc commented Mar 8, 2019

  1. This should be implemented such that it can be run against new names, or can create the relationships as the name is created. (I think we DO NOT want to prevent new names - they're useful for search - we just want to flag them as they're introduced. Are we in agreement on that?)

  2. HELP!! I scrolled through the first couple thousand names that start with A or B and grabbed stuff that jumped out at me.

https://docs.google.com/spreadsheets/d/1YmCqv0aeVa5JpMqvlYPpfng9YCBHczpjfzDF1rmEgA0/edit?usp=sharing

PLEASE let me know of any problems in that, particularly things that are in the spreadsheet and do not need flagged as "potential spelling variation."

And please let me know if there's any pattern I haven't detected. This is essentially a cartesian join and will have to be heavily throttled - it's going to take ~weeks to run, I'd like to get it as correct as possible as early as possible.

Things not in the spreadsheet:

"Higher taxonomy" variations. I intend to ignore these for the first pass; I'm not sure we want to push all 500 ...ii subspecies to ...i and vise-versa, but I'm also not sure I see a less-evil alternative.

UAM@ARCTOS> select scientific_name from taxon_name where scientific_name like 'Abarenicola claparedi%';

SCIENTIFIC_NAME
------------------------------------------------------------------------------------------------------------------------
Abarenicola claparedi
Abarenicola claparedi oceanica
Abarenicola claparedi vagabunda
Abarenicola claparedii

Are these detectable variations or just jumbly things that happen to share a few characters?

Baeckea carnosula
Baeckea carnulosa

(possible check: same characters, different order)

Bactrocera neocognata
Bactrocera neocongnata

(possible check: one extra character)

Bactrocera nigella
Bactrocera nigra
Bactrocera nigrescens
Bactrocera nigrescentis
Bactrocera nigricula
Bactrocera nigrifacies
Bactrocera nigrita
Bactrocera nigrivenata
Bactrocera nigrofemoralis
Bactrocera nigroscutata
Bactrocera nigrotibialis
Bactrocera nigrovittata

(possible check: same root)

@campmlc
Copy link

campmlc commented Mar 8, 2019 via email

@dustymc
Copy link
Contributor Author

dustymc commented Mar 19, 2019

Here are potentially related names from a first run.

temp_p_d.csv.zip

Please scroll around a bit - should any of these not be linked, did I miss anything obvious, etc. I think some of these are probably NOT related, but I don't know how to avoid that and still get the ones that obviously are, and the only effect should be the occasional false positive in search results.

If nobody stops me in a few days I'll create a new relationship and, when these pairs are not already related, link them.

New relationship: potential alternate spelling=Possible spelling variation detected by automation.

@Jegelewicz
Copy link
Member

60K names to scroll through - not sure I'll catch anything. Seems like the best thing would be to wait for someone to tell us they are definitely NOT spelling variations, then remove the relationship and probably add a "not the same as" relationship so that scripts don't undo the work?

@dustymc
Copy link
Contributor Author

dustymc commented Mar 19, 2019

not the same as" relationship

Yea, the problem with that is currently all relationships are functionally identical. I guess I'm surprised we made it this far with that!

I don't think there's much of anything too funky in there, but sometimes something will jump out for new eyeballs.

@sharpphyl
Copy link

sharpphyl commented Mar 20, 2019

Since I no longer use the Arctos source, I probably shouldn't comment, but I would be very cautious. When I sort alphabetically, the majority of the names are plants in my garden waiting for spring and I have no idea how their taxonomy works. Dusty, can you split these into Arctos and Arctos Plants so we know where to concentrate our efforts?

But I did find a few mollusks and some were ok to link (one was invalid - usually misspelled but in WoRMS as a misspelled taxon - and one was valid). So it the linkage only helps if it also says which is valid and which is invalid. Do any of these have Taxon Status? Also, in my experience, most of the Mollusca misspellings came from UCMP so it would be helpful to know the source. Lastly, the author is very important as shown in the first example.

Some of them should not be linked.

  1. Cerithidea varicosum Valenciennes, 1832 - invalid - now Cerithidea valida  (C. B. Adams, 1852) linked to
    Cerithidea varicosa Morch, 1876 – invalid- in ITIS as a synonym for Cerithidea pliculosa

So I would start by wanting the author if there is one and if it isn't the same on both taxa, then I wouldn't be comfortable linking them. The higher classification should also be the same at least one level above (e.g. family).

  1. Acanthinula - a genus in the family Valloniidae
    linked to
    Acanthina, a genus in the family Muricidae

Again, what is the author? These are two different genera in different families and are totally unrelated.

  1. Acanthocardium Römer, 1865 in WoRMS - invalid - accepted as Acanthocardia J.E. Gray, 1851
    linked (ok) to
    Acanthocardia J.E. Gray, 1851 in WoRMS in the family Cardiidae. Wikipedia

These are both in WoRMS, so perhaps start by running all of these against WoRMS and adding at least whether it's valid or invalid and that it's in WoRMS so there's more context and a more meaningful link.

  1. Hipponix conicus - valid species - family Hipponicidae
    linked to
    Hipponix conica - totally invalid - should be deleted

  2. Hipponix foliaceus - invalid misspelling - should be deleted
    linked to
    Hipponix foliacea - WoRMS Hipponix foliacea Quoy & Gaimard, 1835 - valid

  3. Cymatium testudinaria - misspelling of either Cymatium testudinarium or of Raularia testudinaria  - a garbage taxon that should be deleted
    linked to
    Cymatium testudinarium (A. Adams & Reeve, 1850) AphiaID 211081 - invalid.  Accepted as Ranularia testudinaria (A. Adams & Reeve, 1850) AphiaID 476562

Unlike Dusty, I don't like having totally erroneous taxa in Arctos even if it's somewhere out there on the Internet (fake taxonomy?). Despite lots of training and instructions to copy directly from WoRMS into Arctos during data entry, we always had a few errors pop up when invalid taxa autofilled. If I were still using Arctos, I would use this list to find misspelled taxa and delete them rather than linking them to possibly valid taxa, but it looks like a lifetime's work.

I picked out one plant taxon at random and found this:

Acacia wightii = Acacia wightii Graham ex Wight & Arn.  This is a synonym of Albizia amara (Roxb.) Boivin
linked to
Acacia wightiana = Acacia wightiana Graham
Both are in the same family but are they really the same species?

Here's the last one

Tulipa grisebachii -Tulipa grisebachii Borbás is a synonym of Tulipa sylvestris subsp. Sylvestris would be linked to
Tulipa grisebachiana - Tulipa grisebachiana Pant. is a synonym of Tulipa sylvestris subsp. Sylvestris

So these are both junior synonyms of the same species but they are different to start with.  Source http://www.theplantlist.org/tpl/record/kew-309073. Maybe we should load their list?

@dustymc
Copy link
Contributor Author

dustymc commented Mar 20, 2019

I'm really looking for "scripts are [not] completely busted" and not this level of detail at this time. I am looking only at names. I think the definition ("Possible spelling variation detected by automation.") covers the possibility that eg, Acacia wightii and Acacia wightiana are bycatch and have nothing to do with each other beyond some common characters.

"Synonyms" are a problem mostly when they're obscure. StudentA used Acartophthalmus nigrinus so now it's in their autofill, StudentB uses Acartophthalmus nigrina so now it follows them around, a user searches one or the other, presumes they've found everything, goes away - that can (and did - it's why we're here) happen for a long time. This fixes that - users find what they're looking for (from "any taxon") with this relationship. (And maybe some stuff they're not looking for, but that's ~less-evil.) Hopefully with time "we" will add better relationships (possibly including Teresa's "these have nothing to do with each other") and indicate preference and all that jazz, but that's beyond the scope of what I'm trying to accomplish right now. Baby steps!

I like "fake taxonomy" when it DOES SOMETHING. We could do more with relationships ("horrible mangling of"?) and/or UI (exclude those from various views, maybe), but a user copying a name from some generally-trusted source and then not being able to use it to find specimens in Arctos because we've invested extra work in making our data more obscure just seems evil to me.

If you can get CSV or something like it from anyone, I'm happy to help load it.

@dustymc
Copy link
Contributor Author

dustymc commented Mar 20, 2019

taxonomy committee: go!

@DerekSikes
Copy link

DerekSikes commented Mar 20, 2019 via email

@dustymc
Copy link
Contributor Author

dustymc commented Mar 27, 2019

First pass at this is running in production; closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Function-Taxonomy/Identification Priority-High (Needed for work) High because this is causing a delay in important collection work..
Projects
None yet
Development

No branches or pull requests

7 participants