Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TG2-VALIDATION_SCIENTIFICNAME_FOUND #46

Open
iDigBioBot opened this issue Jan 5, 2018 · 23 comments
Open

TG2-VALIDATION_SCIENTIFICNAME_FOUND #46

iDigBioBot opened this issue Jan 5, 2018 · 23 comments
Labels
Conformance CORE TG2 CORE tests NAME Parameterized Test requires a parameter Test Tests created by TG2, either CORE, Supplementary or DO NOT IMPLEMENT TG2 Validation VOCABULARY

Comments

@iDigBioBot
Copy link
Collaborator

iDigBioBot commented Jan 5, 2018

TestField Value
GUID 3f335517-f442-4b98-b149-1e87ff16de45
Label VALIDATION_SCIENTIFICNAME_FOUND
Description Is there a match of the contents of dwc:scientificName with the bdq:sourceAuthority?
TestType Validation
Darwin Core Class dwc:Taxon
Information Elements ActedUpon dwc:scientificName
Information Elements Consulted
Expected Response EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:scientificName is bdq:Empty; COMPLIANT if there is a match of the contents of dwc:scientificName in the bdq:sourceAuthority; otherwise NOT_COMPLIANT
Data Quality Dimension Conformance
Term-Actions SCIENTIFICNAME_FOUND
Parameter(s) bdq:sourceAuthority
Source Authority bdq:sourceAuthority default = "GBIF Backbone Taxonomy" {[https://doi.org/10.15468/39omei]} {API endpoint [https://api.gbif.org/v1/species?datasetKey=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&name=]}
Specification Last Updated 2023-09-17
Examples [dwc:scientificName="Eucalyptus camaldulensis": Response.status=RUN_HAS_RESULT, Response.result=COMPLIANT, Response.comment="dwc:scientificName found in the bdq:sourceAuthority"]
[dwc:scientificName="Capulus intort": Response.status=RUN_HAS_RESULT, Response.result=NOT_COMPLIANT, Response.comment="dwc:scientificName was not found in the bdq:sourceAuthority"]
Source ALA
References
Example Implementations (Mechanisms) Kurator/FilteredPush sci_name_qc Library
Link to Specification Source Code https://github.com/FilteredPush/sci_name_qc/blob/v1.1.2/src/main/java/org/filteredpush/qc/sciname/DwCSciNameDQ.java#L216
Notes The purpose of this test is to detect errors in the scientific name but is dependent on the abilities of the parsing of the bdq:sourceAuthority. For research users of biodiversity data doing quality assurance, VALIDATION_TAXON_UNAMBIGUOUS (4c09f127-737b-4686-82a0-7c8e30841590) handles their needs, but for curators of data sets doing quality control, this test provides a specific subset of targeted data cleaning, making it a valuable test to include for the quality control case.
@godfoder godfoder changed the title TG2-VALIDATION_SCIENTIFICNAME_NOTSTANDARD TG2-VALIDATION_POLYNOMIAL_NOTSTANDARD Jan 18, 2018
@ArthurChapman ArthurChapman added the Test Tests created by TG2, either CORE, Supplementary or DO NOT IMPLEMENT label Jan 18, 2018
@tucotuco tucotuco added the Parameterized Test requires a parameter label Nov 5, 2018
@chicoreus
Copy link
Collaborator

Specification leaves case of a mononomial scientific name as undefined. Specification doesn't explain what to do with information elements other than dwc:scientificName. The may be the VALIDATION_SCUENTIFICNAME_NOTFOUND that @Tasilee is looking for.

@Tasilee
Copy link
Collaborator

Tasilee commented Jun 24, 2020

So, should this test be "VALIDATION_TAXON_NOTFOUND" to match #122, #77, #83, #22, #28, with Information elements dwc:scientificName and dwc:genus?

@chicoreus
Copy link
Collaborator

@Tasilee I'm thinking this test takes just dwc:scientificName and is VALIDATION_SCIENTIFICNAME_NOTFOUND. TAXON_NOTFOUND would imply more terms, perhaps dwc:taxonID, but perhaps dwc:genus and higher. We do need to consider if #101 needs additional companion tests that examine the TAXON, as phrased in the specification, this one currently isn't.

@Tasilee
Copy link
Collaborator

Tasilee commented Jun 24, 2020

I agree @chicoreus - I would be much happier with VALIDATION_SCIENTIFICNAME_NOTFOUND to match the other VALIDATIONS.

Regards #101 (and #46) - I am now wondering about the set of TAXON type tests we should conform to. For example, #123 and #70

@ArthurChapman
Copy link
Collaborator

See above where it was once called "TG2-VALIDATION_SCIENTIFICNAME_NOTSTANDARD" but was changed duiring our Gainseville discussions. Same with #45. Do you have some notes, @Tasilee on the Gainesville discussion?

@Tasilee
Copy link
Collaborator

Tasilee commented Jun 25, 2020

@ArthurChapman - I checked my Gainesville notes and I had nothing specific to this one or #45. Maybe related tests have changed leaving this orphaned?

@ArthurChapman
Copy link
Collaborator

I have changed the wording of the notes to make a little clearer

FROM: This test is not intended to detect errors of a taxonomic nature. The intent of this test is not to detect errors or inconsistencies in the format of the Authorship. For the purpose of this amendment, if the genus in the dwc:genus field does not match the genus of the polynomial, the genus of the polynomial takes precedence for standardization.

TO: The purpose of this test is to detect errors in spelling and typography only. It is not intended to detect errors of a taxonomic nature or to detect errors or inconsistencies in the format of the Authorship. For the purpose of this amendment, if the genus in the dwc:genus field does not match the genus of the polynomial, the genus of the polynomial takes precedence for standardization.

@Tasilee
Copy link
Collaborator

Tasilee commented Jul 14, 2020 via email

@ArthurChapman
Copy link
Collaborator

I have also deleted the last paragraph - as that refers to the AMENDMENT (#45)

@Tasilee Tasilee changed the title TG2-VALIDATION_POLYNOMIAL_NOTSTANDARD TG2-VALIDATION_SCIENTIFICNAME_NOTFOUND Aug 10, 2020
@Tasilee
Copy link
Collaborator

Tasilee commented Aug 10, 2020

With the quorum of agreement (email response July 15, 2020), I have changed this test to match the higher taxonomic equivalents, e.g., #122 with the recognition that the utility of the test is heavily dependent on the abilities of the bdq:sourceAuthority.

@ArthurChapman
Copy link
Collaborator

This one again overlaps with #82 If there was workflow then if #82 was run before #46, then there would be no point in running #46 (cf. comment under #101 which is similar)

@chicoreus
Copy link
Collaborator

@ArthurChapman no overlap, #82 tests for non-emptyness, that is on an axis of completeness, #46 for value being found in an authority, thus on an axis of completeness. The framework treats the the response status (internal prerequisites not met) with the response result (not compliant) as orthogonal concerns, and test order is not specified (to allow for implementations that run tests in paralell, or implementations that run tests (as in @tucotuco 's sql implementation in the paper) on distinct values.

@ArthurChapman
Copy link
Collaborator

@chicoreus - I have no problem with running all. Just saying that if you run #82 and it fails - i.e. the Scientific Name is EMPTY, then there would be no use running #42 (or #101) as they would both just return INTERNAL_PREREQUISITES_NOT_MET. But if all tests are standalone - and not part of a workflow, then it doesn't really matter.

ArthurChapman added a commit that referenced this issue Oct 6, 2020
In accord with #189 added test data for SCIENTIFICNAME_NOTFOUND #46
@Tasilee Tasilee changed the title TG2-VALIDATION_SCIENTIFICNAME_NOTFOUND TG2-VALIDATION_SCIENTIFICNAME_FOUND Mar 22, 2022
@Tasilee Tasilee removed the NEEDS WORK label Apr 3, 2022
chicoreus added a commit to FilteredPush/sci_name_qc that referenced this issue Jun 9, 2022
…rent TG2 Specifications. DESCRIPTION: Adding an implementation of VALIDATION_SCIENTIFICNAME_FOUND. Changing the Validator interface to throw an exception to distinguish between service failures (EXTERNAL_PREREQUISITES_NOT_MET), and no match found (RUN_HAS_RESULT+NOT_COMPLIANT). Updating BatchRunnier, WoRMSService, and GBIFService to suport this.
@Tasilee
Copy link
Collaborator

Tasilee commented Jun 19, 2022

Changed "NOT COMPLIANT" to "NOT_COMPLIANT" in Example

@chicoreus
Copy link
Collaborator

Feedback from deployment of the implementation sci_name_qc implementation at the MCZ: Most services, including GBIF's API, support a lookup on the scientific name without authorship, but not on the literal dwc:scientificName including the authorship. This means that implementations which aren't working off a local data store must implement a parser to extract the part of the name to look up in the authority, then compare the returned name and authorship with the provided name and authorship. When parsing fails (as happens readily with historical name orthographic variants where specific epithets based on people's names started with a capital letter, e.g. Ophiocoma Alexandri Lyman, 1860 or Pentagonaster Alexandri Perrier, 1881, both of which fail in MCZbase through having the validation lookup the generic name rather than the binomial), the lookup fails.

I would recommend that we address this by including dwc:scientificNameAuthorship in the information elements, and changing the specification from

EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:scientificName is EMPTY; COMPLIANT if there is a match of the contents of dwc:scientificName with the bdq:sourceAuthority; otherwise NOT_COMPLIANT

to:

EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:scientificName is EMPTY; COMPLIANT if there is a match of the contents of dwc:scientificName with the bdq:sourceAuthority (using dwc:scientificNameAuthorship to assist with separating which part of the dwc:scientificName to query on a source authority that does not support lookup on scientific name with authorship); otherwise NOT_COMPLIANT

Or, leaving the specification unchanged, and changing the notes from:

The purpose of this test is to detect errors in the scientific name but is dependent on the abilities of the parsing of the bdq:sourceAuthority.

To:

The purpose of this test is to detect errors in the dwc:scientificName. dwc:scientificNameAuthorship is included as an information element to allow implementors to potentially identify which portion of the dwc:scientificName is the authorship string without having to parse the dwc:scientificName into component parts. Many possible bdq:sourceAuthority services support the lookup of a scientific name by the scientific name without authorship, and return no results when given the full dwc:scientificName with authorship, so implementations are likely to have to (1) extract the portion of the name to look up from the dwc:scientficName, (2) perform the lookup, and (3) determine if a full name with authorship in lookup results is an exact match on the provided dwc:scientificName. Step (1) is simplified if dwc:scientificNameAuthorship contains the scientific name authorship portion of dwc:scientificName, with parsing being a fallback when it is not available. An empty dwc:scientificNameAuthorship or an inconsistency between dwc:scientificName and dwc:scientificNameAuthorship should result in dwc:scientificNameAuthorship being ignored.

@jhnwllr
Copy link

jhnwllr commented Jun 24, 2022

I have written a blog post about unmatched names GBIF receives from publishers that might be somewhat relevant to this discussion:
https://data-blog.gbif.org/post/2022-03-24-reasons-why-names-don-t-match-to-the-gbif-backbone/

This work relies heavily on the GBIF name parser, which while not perfect, does a fairly good job of telling whether a name is a valid dwc:scientificName.
https://www.gbif.org/tools/name-parser
https://api.gbif.org/v1/parser/name?name=Steroma%20superba%20nr.%20Butler,%201868

@debpaul
Copy link

debpaul commented Jun 24, 2022

@jhnwllr thanks for the post. I think many will find it helpful as it illuminates and provides transparency about the decision making process going on. I see this related post that really helps those submitting data to engage them to supply names -- that is, in the cases where their names are not matched because they fill gaps. In other words, great to see ways to get the system to work bidirectionally. We need to think of ways to spread the knowledge / use of this information you've collated so nicely ...

@Tasilee
Copy link
Collaborator

Tasilee commented Aug 21, 2022

In our zoom today, we (@tucotuco, @chicoreus and @ArthurChapman) concluded that as #70 is providing a testing of dwc:scientificName (in part 2 of the Expected Response), we can set this test to NON CORE. Yea, one less test!

There is a recognition that the 'hard part' of dwc:scientificName is the likely/potential inclusion of dwc:scientificNameAuthorship as per Darwin Core Standard (https://dwc.tdwg.org/list/#dwc_scientificName) and the way some source authorities handle dwc:scientificName. For example, GBIF will not return a match if dwc:scientificName contains an authorship.

A principle (?) was also suggested by @tucotuco in the Zoom meeting of August 17: We should avoid parsing 'input strings'. Output strings yes.

@Tasilee Tasilee removed the Test Tests created by TG2, either CORE, Supplementary or DO NOT IMPLEMENT label Aug 21, 2022
chicoreus added a commit to FilteredPush/sci_name_qc that referenced this issue Aug 26, 2022
…problematic name parsing such as historical captialized specific epithets by checking for known parse of name in GNI before trying the GBIF name parser. Adding minimal implementation over the GNI web service, adding a utility function to wrap various means of separating authorship from canonical name, adding unit tests, and changing failing test case that GBIF service fails but generalized parsing succeeds at.
@Tasilee Tasilee added the Test Tests created by TG2, either CORE, Supplementary or DO NOT IMPLEMENT label Sep 17, 2022
@chicoreus
Copy link
Collaborator

chicoreus commented Sep 17, 2022

We should consider adding this back to core tests. (1) For research users of biodiversity data doing quality assurance, #70 handles their needs, but for curators of data sets doing quality control, #46 provides a specific subset of targeted data cleaning, making this a valuable test to include for the quality control case.
(2) This can be handled without parsing the dwc:scientificName even for services that require just the name without authorship by running a two phase process (as in early Filtered Push implementations), querying a scientific name string authority followed by a query on a nomenclator or taxon authority.

@Tasilee
Copy link
Collaborator

Tasilee commented Jun 13, 2023

Changed Parameter(s) to "bdq:sourceAuthority" as per discussions 12th June 2023 and restructured Source authority entries

@ArthurChapman
Copy link
Collaborator

Changed Notes from

The purpose of this test is to detect errors in the scientific name but is dependent on the abilities of the parsing of the bdq:sourceAuthority. For research users of biodiversity data doing quality assurance, #70 handles their needs, but for curators of data sets doing quality control, #46 provides a specific subset of targeted data cleaning, making this a valuable test to include for the quality control case.

To

The purpose of this test is to detect errors in the scientific name but is dependent on the abilities of the parsing of the bdq:sourceAuthority. For research users of biodiversity data doing quality assurance, VALIDATION_TAXON_UNAMBIGUOUS (4c09f127-737b-4686-82a0-7c8e30841590) handles their needs, but for curators of data sets doing quality control, this test provides a specific subset of targeted data cleaning, making it a valuable test to include for the quality control case.

chicoreus added a commit to FilteredPush/sci_name_qc that referenced this issue Jul 2, 2023
…tdwg/bdq specifications. Updated metadata (added ProvidesVersion and Specification) for tdwg/bdq#46 VALIDATION_SCIENTIFICNAME_FOUND Removed reviewed stub method.
chicoreus added a commit to FilteredPush/sci_name_qc that referenced this issue Jul 3, 2023
… where name is provided without authorship. Adding test case.
chicoreus added a commit to FilteredPush/sci_name_qc that referenced this issue Jul 3, 2023
@Tasilee
Copy link
Collaborator

Tasilee commented Jul 4, 2023

Amended Source Authority values to align with @chicoreus syntax

From

bdq:sourceAuthority default = "GBIF Backbone Taxonomy" [https://doi.org/10.15468/39omei] |
| | API endpoint [https://api.gbif.org/v1/species?datasetKey=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&name=]

to

bdq:sourceAuthority default = "GBIF Backbone Taxonomy" {[https://doi.org/10.15468/39omei]} {API endpoint [https://api.gbif.org/v1/species?datasetKey=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&name=]}

@Tasilee
Copy link
Collaborator

Tasilee commented Sep 16, 2023

Splitting bdqffdq:Information Elements into "Information Elements ActedUpon" and "Information Elements Consulted". Also changed "Field" to "TestField" and "Output Type" to "TestType".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Conformance CORE TG2 CORE tests NAME Parameterized Test requires a parameter Test Tests created by TG2, either CORE, Supplementary or DO NOT IMPLEMENT TG2 Validation VOCABULARY
Projects
None yet
Development

No branches or pull requests

7 participants