-
Notifications
You must be signed in to change notification settings - Fork 9
Standardized Reference Data Set for Vertebrate Taxon Name Resolution
This data set has two files. The first, stored at https://github.com/tucotuco/DwCVocabs/blob/master/vocabs/tests/VertNetTaxonomyTestSet.csv, consists of a random set of 1000 records of distinct taxon combinations from the 18 April 2015 snapshots [Bloom D. 2015a, 2015b, 2015c, 2105d, 2015e] of the data aggregated in VertNet [http://vertnet.org] resolved to determine valid taxon names wherever possible using a well-defined workflow. The second file, stored at https://github.com/tucotuco/DwCVocabs/blob/master/vocabs/tests/VertNetTaxonomyMatchingOccurrences.txt, contains the content of the occurrence records with the names from the first file that were available in VertNet from the same snapshots.
These data are meant to serve as a test set against which to compare results of taxon name resolution workflows for verbatim vertebrate names expressed in Darwin Core (Wieczorek et al. 2009, 2012) fields as found in the Darwin Core Quick Reference Guide [http://rs.tdwg.org/dwc/terms/index.htm]. The Matching Occurrences data set is meant for those who wish to investigate in further detail the data used in the research.
The data set includes the fields described below in a utf8-encoded, comma-separated value file with " as the quoting character:
The data set includes a id field (a signed long integer) to identify distinct name combination records. The data set also includes the following Darwin Core Taxon fields [http://rs.tdwg.org/dwc/terms/#taxonindex] that were used as input:
genus, subgenus, specificEpithet, infraspecificEpithet, scientificNameAuthorship, scientificName
For all records in this data set, one can introduce kingdom = 'Animalia' and phylum = 'Chordata' if these fields are required or beneficial for the process that is being tested.
While processing the verbatim inputs it was found to be convenient to create fields to store concatenated subsets of the original fields that might be used by scientific name parsers and resolvers.
scientificnameplus - a concatenatation of the scientificName with the infraspecificEpithet for records where the scientificName did not contain the infraspecific epithet information. It is clear that some data publishers interpret the scientificName to contain only the species binomial and leave the infraspecificEpithet separate. The combination into scientificnameplus essentially creates an interim field where this misuse of the standard is accounted for.
dwcsn-rank - the most specific taxon rank in the scientificName (not scientificnameplus).
sn-rank - the most specific taxon rank in the scientificnameplus.
constructedscientificname - a concatenation of the atomized scientific name fields (genus, subgenus, specificEpithet, infraspecificEpithet, scientificNameAuthorship).
con-rank - the most specific taxon rank in the constructedscientificname.
The data set also includes fields that were meant to be the testable outputs of the processes described in the Methods section below. These include:
validCanonical - the canonical form (monomial, binomial, or trinomial, no authorship) of the scientific name based on the verbatim Darwin Core taxonomy fields and the external sources given in the validSource and sourceURL fields as of the date given in the sourceDate field.
validSource - the name(s) of the source(s) used to determine the name in the validCanonical field. Multiple sources are separated by " | ".
sourceURL - the URL(s) of the source(s) used to determine the name in the validCanonical field. Multiple URLs are separated by " | ".
sourceDate - the date on which the source(s) in validSource and sourceURL were consulted to determine the scientific name given in the validCanonical field.
comments - additional information about the resolution of the record.
checked - the way in which the record was vetted (see Methods, below).
Finally, the data set includes fields that show detectable characteristics of the taxon data in the record. All fields for which the name begins with 'con-' refer to characteristics of the constructedscientificname. All fields for which the name begins with 'sn-' refer to characteristics of the scientificnameplus. Many of these are assessments for errors in one of several categories - spelling errors, errors based on the applicable nomenclatural code ("format errors" in the descriptions below), and Darwin Core usage errors ("conceptual errors" in the descriptions below).
con-ms - there is some misspelling in the constructedscientificname. If TRUE, this constitutes a spelling error.
con-sp - there is a species identificationQualifier [http://rs.tdwg.org/dwc/terms/#identificationQualifier] in the constructedscientificname. If TRUE, this constitutes a conceptual error.
con-inf - there is an infraspecific identificationQualifier [http://rs.tdwg.org/dwc/terms/#identificationQualifier] in the constructedscientificname. If TRUE, this constitutes a conceptual error.
con-ws - there is extra whitespace (space, tab, non-printing character) in the constructedscientificname. If TRUE, this constitutes a format error.
con-cap - there is incorrect capitalization in the constructedscientificname. If TRUE, this constitutes a format error.
con-sg - there is a subgenus in the constructedscientificname. If TRUE, this does not necessarily constitute an error. It is an error only if the subgenus is given somewhere other than the subgenus field (usually occurs in genus as an error).
con-sgerror - there is a subgenus in the constructedscientificname, but not from the subgenus field. If TRUE, this constitutes a conceptual error.
con-rnk - there is name from an incorrect taxon rank in the constructedscientificname. Because the constructedscientificname is composed of the genus, subgenus, specificEpithet, infraspecificEpithet, and scientificNameAuthorship, is not appropriate to have names for other ranks (such as family) in the constructedscientificname, even if it is appropriate to have names of other ranks in the scientificName according to the Darwin Core definition [http://rs.tdwg.org/dwc/terms/#scientificName] of that term. If con-rnk is TRUE, this constitutes a conceptual error.
con-auth - there is a scientificNameAuthorship string in the constructedscientificname. Note: this might occur even if the scientificNameAuthorship field is empty, if one of the other fields constituting the constructedscientificname contains an author string. If TRUE, this does not necessarily constitute an error. This is only an error if the name appears in a field other than scientificNameAuthorship.
con-autherror - there is a scientific name authorship string in the constructedscientificname, but not from the scientificNameAuthorship input field. If TRUE, this constitutes a conceptual error.
con-authcap - there is incorrect capitalization in the scientificNameAuthorship string in the constructedscientificname. If TRUE, this constitutes a format error.
con-hyb - there is a hybrid formula in the constructedscientificname. If TRUE, this constitutes a conceptual error.
con-cf - there is a 'cf' identificationQualifier [http://rs.tdwg.org/dwc/terms/#identificationQualifier] in the constructedscientificname. If TRUE, this constitutes a conceptual error.
con-qu - there is a question mark identificationQualifier [http://rs.tdwg.org/dwc/terms/#identificationQualifier] in the constructedscientificname. If TRUE, this constitutes a conceptual error.
con-ab - there is an inappropriate abbreviation of a taxon name in the constructedscientificname. If TRUE, this constitutes a format error. Note: Abbreviations in scientificNameAuthorship may be appropriate.
con-ex - there is something extra in the constructedscientificname (distinct from the other errors captured in other con- fields). If TRUE, this constitutes a conceptual error.
con-enc - there is an character encoding issue in the constructedscientificname. If TRUE, this ultimately constitutes a spelling error, though the source may be from a sordid history of transformations of the data. Note: this can happen when the data have passed through an incorrect coding at some point in their history, for example, when UTF8-encoded data have been opened in Excel and then saved again, without importing the original as UTF8.
con-valid - a statement about the validity of the constructedscientificname. One of “valid”, “invalid” or “not applicable”, where "not applicable" applies if constructedscientificname is empty or consists of only a scientificNameAuthorship.
sn-ms - there is some misspelling in the scientificnameplus. If TRUE, this constitutes a spelling error.
sn-sp - there is a species identificationQualifier [http://rs.tdwg.org/dwc/terms/#identificationQualifier] in the scientificnameplus. If TRUE, this constitutes a conceptual error.
sn-inf - there is an infraspecific identificationQualifier [http://rs.tdwg.org/dwc/terms/#identificationQualifier] in the scientificnameplus. If TRUE, this constitutes a conceptual error.
sn-inf-missing - the infraspecificEpithet field is not empty and the record has a scientificName that does not include an infraspecific epithet. If TRUE, this constitutes a conceptual error.
sn-ws - there is extra whitespace (space, tab, non-printing character) in the scientificnameplus. If TRUE, this constitutes a format error.
sn-cap - there is incorrect capitalization in the scientificnameplus. If TRUE, this constitutes a format error.
sn-sg - there is a subgenus in the scientificnameplus. If TRUE, this does not constitute an error.
sn-auth - there is a scientificNameAuthorship string in the scientificnameplus. If TRUE, this does not constitute an error.
sn-authcap - there is incorrect capitalization in the scientificNameAuthorship part of the scientificnameplus. If TRUE, this constitutes a format error.
sn-hyb - there is a hybrid formula in the scientificnameplus. If TRUE, this constitutes a conceptual error.
sn-cf - there is a 'cf' identificationQualifier [http://rs.tdwg.org/dwc/terms/#identificationQualifier] in the scientificnameplus. If TRUE, this constitutes a conceptual error.
sn-qu - there is a question mark identificationQualifier [http://rs.tdwg.org/dwc/terms/#identificationQualifier] in the scientificnameplus. If TRUE, this constitutes a conceptual error.
sn-ab - there is an inappropriate abbreviation of a taxon name in the scientificnameplus. If TRUE, this constitutes a format error. Note: Abbreviations in scientificNameAuthorship may be appropriate.
sn-ex - there is something extra in the scientificnameplus(distinct from the other errors captured in other sn- fields). If TRUE, this constitutes a conceptual error.
sn-enc - there is an character encoding issue in the scientificnameplus. If TRUE, this ultimately constitutes a spelling error, though it may be the result of post-digitization data transfers with incorrect encodings. Note: this can happen when the data have passed through an incorrect coding at some point in their history, for example, when UTF8-encoded data have been opened in Excel and then saved again, without importing the original as UTF8.
sn-valid - a statement about the validity of the scientificnameplus. One of “valid”, “invalid” or “not applicable”, where "not applicable" applies if scientificnameplus is empty or consists of only a scientificNameAuthorship.
VertNet is comprised of data aggregated from self-published sources as well as sources published with the help of and hosted by VertNet. Most vertebrate data hosted on the VertNet IPT [ipt.vertnet.org:8080/ipt] have first passed through customized "migrators" based on the VertNet Darwin Core Migrator Toolkit [Wieczorek 2015], while self-hosted sources have not. The migrators provide standardized scientific names and classifications in the Darwin Core taxon fields and preserve the verbatim classification from the source in the Darwin Core higherClassification field. Thus, any study of the raw, unadulterated, verbatim taxonomic content of data sets participating in VertNet must use the pre-migration data for those data sets that are hosted by VertNet, while the rest of the data in VertNet can be used as is.
The input for this data set was derived from published occurrence records integrated in the VertNet network as of 18 April 2015 as well as from pre-publication sources as explained above. The methods for these two types of sources are explained below.
The 18 April 2015 VertNet snapshots [Bloom 2015a, 2015b, 2015c, 2015d, 2015e] with a total of 17,412,547 occurrence records were uploaded into a Google BigQuery table (vertnet_latest) for easy filtering and extraction of subsets. Then a summary table of taxa by data set was extracted BigQuery using the following query:
SELECT icode, collectioncode, gbifdatasetid, scientificname, genus, subgenus, specificepithet, infraspecificepithet, scientificnameauthorship, migrator, count(*) as reps FROM [dumps.vertnet_latest] group by icode, collectioncode, gbifdatasetid, scientificname, genus, subgenus, specificepithet, infraspecificepithet, scientificnameauthorship, migrator
This summary was saved in a new table called vn_scinames_by_dataset and contained 892128 distinct combinations. This new table was used to extract a subset of non-migrated taxon combinations using the following query:
SELECT scientificname, genus, subgenus, specificepithet, infraspecificepithet, scientificnameauthorship, sum(reps) as totaloccurrences FROM [dumps.vn_scinames_by_dataset] where migrator is null or migrator='no migrator' group by scientificname, genus, subgenus, specificepithet, infraspecificepithet, scientificnameauthorship
This summary was saved in a new table called vn_scinames_from_selfhosted and contained 304127 distinct combinations. The following query was run to extract complementary data for migrated data sets (for reference only - these data were not used to construct this test data set):
SELECT scientificname, genus, subgenus, specificepithet, infraspecificepithet, scientificnameauthorship, sum(reps) as totaloccurrences FROM [dumps.vn_scinames_by_dataset] where migrator like '%-%' group by scientificname, genus, subgenus, specificepithet, infraspecificepithet, scientificnameauthorship
This summary was saved in a new table called vn_scinames_from_migrators and contained 158250 distinct combinations after migration (compared to 399173 combinations before migration determined separately).
The vn_scinames_from_selfhosted table was saved to Google Cloud storage as a gzipped CSV file vn_scinames_from_selfhosted.csv, dowloaded, and unzipped. The resulting CSV file was loaded into a table vn_names_from_selfhosted in a Microsoft Access database using the same database schema as in BigQuery.
The table vn_names_from_selfhosted was duplicated with the same data to a new table vn_names_all. The table vn_names_from_migrators was created without data using the same structure as vn_names_from_selfhosted.
Each customized VertNet migrator contains a table SimpleDwC-verbatim into which the verbatim source data are loaded before further processing. For each migrated data set, the distinct combinations of institutionCode, scientificName, genus, subgenus, specificEpithet, infraspecificEpithet, scientificNameAuthorship, count(*) as reps were extracted from the migrator's table SimpleDwC-verbatim and appended to the table vn_names_from_migrators.
When all migrators had been processed and the data added to the vn_names_from_migrators table, the distinct combinations of scientificName, genus, subgenus, specificEpithet, infraspecificEpithet, scientificNameAuthorship, and reps were extracted and appended to table vn_names_all.
At this point the table vn_names_all contained the full set of verbatim name combinations of those fields across all of published VertNet as of 18 April 2014. The table vn_names_distinct was created empty with the same structure as table vn_names_all. The field 'id' was added to table vn_names_distinct of type Autonumber with random new long integer values. This enable each new record added to the table to receive a distinct random id.
The table vn_names_distinct was filled by appending from a view defined as
SELECT scientificName, genus, subgenus, specificEpithet, infraspecificEpithet, scientificNameAuthorship, sum(totaloccurrences) as totaloccurrences FROM vn_names_all group by scientificName, genus, subgenus, specificEpithet, infraspecificEpithet, scientificNameAuthorship
This resulted in a completed table vn_names_distinct with 522163 distinct combinations of taxon records from across all of VertNet as of 18 April 2015. This table constitutes the raw material from which to draw the random subset of 1000 records for this test data set.
The table 1000VNNames was created with the fields described under the Content section above, along with various auxiliary fields to help with the management of processing subsets of the test data. To populate the table 1000VNNames with 1000 names, the id of the 1000th record when sorted in ascending order was found used as a filter to select the first 1000 id-sorted names where the ids were generated randomly. The table 1000VNNames was then populated by appending records from the following query:
SELECT id, scientificname, genus, subgenus, specificepithet, infraspecificepithet, scientificnameauthorship from TABLE vn_names_distinct INTO TABLE 1000VNNames where id<= {value of 1000th sorted id}
During the course of name resolution and vetting, 39 records from among the original 1000 were found to be for non-vertebrates (VertNet does indeed have some non-vertebrate records from some sources for various reasons). These were replaced using the same method to bring the final count of the test data set to 1000 records from vertebrates.
A field called constructedscientificname, to facilitate valid name resolution, was populated with the space-separated concatenation of genus, subgenus, specificEpithet, infraspecificEpithet, and scientificNameAuthorship. In the cases where only the infraspecificEpithet field in the record was populated with data, the _constructedscientificname was left blank.
A field called scientificnameplus was created to provide a corollary to the constructedscientificname based on the scientificName field. Inspection of the dataset revealed that some verbatim records relied on a combination of scientificName and infraspecificEpithet to produce the full scientific name, apparently taking the scientificName field to be populated with nothing more specific than a binomial. Thus, for records where scientificName and infraspecificEpithet were both populated and the scientificName did not contain an infraspecificEpithet, the field scientificnameplus was constructed from the space-separated concatenation of the two, otherwise it was populated by only the value in the scientificName field.
A field called who was created to track the researcher to whom primary responsibility for vetting was assigned. The first 500 records of the 1000 in the data set were assigned to one researcher and the remaining 500 were assigned to the other researcher. Neither researcher was an expert in vertebrate taxonomy and nomenclature.
A first pass of the full data set was made by a third researcher to assign preliminary values to most of the fields fields whose names begin with con- or sn (see Contents section above). All of these values would later be checked and amended if necessary by the data set vetters.
A preliminary methodology for resolving the records and creating outputs was defined. These methods were used on a subset of 50 records that were shared with both vetters to resolve. The results from the two vetters were compared and used to refine the methods and outputs.
A second subset of 24 records not used in the first subset were shared and resolved by both vetters using the refined methods to determine if results were reasonably in accord. Thereafter distinct subsets of the remaining records were given to the vetters, each set with roughly 20% of records in common between the two. When the two vetters both finished with a subset, they compared the records in common to check for consistency in the use of the methodology. Thus, when all records were resolved, 200 of them were resolved by a detailed consensus of both vetters.
As an additional check of consistency, 100 records from the remaining 400 of each vetter were shared with the other vetter for review. The review covered assessments of validity, errors in the input, and assigned ranks of the output, but did not go into the documented sources to assess the assignment of the final valid taxon name (field validcanonical). This pass revealed only two minor inconsistencies among the 200 reviewed records, which were amended. Thus, and additional 200 records were checked by both vetters.
Finally, each vetter reviewed the remainder of his/her own records for errors or inconsistencies. This pass revealed only about ten additional minor inconsistencies, which were amended for the final data set.
In all, 600 records were resolved entirely by one vetter. These records are designated in the field called checked with the value "self review".
Another 200 records were reviewed by the other vetter. These records have the value "independent review" in the field checked.
Of the remaining 200 records, the results for 132 of them were completely in accord. These records are designated with "independent consensus" in the field checked. Another 59 of the remaining records were not originally in accord, but an accord was reached through a consensus of the two vetters. These records have "consensus discussion" in the field checked. Finally, after discussion the solutions of both vetters for the remaining 9 records were agreed to be equally reasonable. For these nine records, both solutions were kept in this test data set (resulting in a total of 1009 records for 1000 input records), and the value for teh checked field for these records was set to "primary opinion" or "secondary opinion" based on who the record was originally assigned to. The original assignee's opinion is recorded as the "primary opinion". This designation is not a statement of the relative merits of the opinions.
In all shared data sets, care was taken throughout the process to preserve the original UTF8 encoding to avoid introducing spurious errors for non-ascii characters.
We would like for this test data set to be useful in providing benchmarks for error detection and name resolution services. To do so, it is important that a) it be kept up-to-date with the latest taxonomic revisions, and b) that new versions are identified, dated, and cited with this information intact. If you suspect that there is an error in the data set, or that an update is warranted for any records, there are two ways to share that information. One way is to open a new issue (https://github.com/tucotuco/DwCVocabs/issues), in which please describe as completely as possible the change that you think should be made, with citations if appropriate. The other way, especially if the changes are extensive, is to Fork the repository, make changes in your fork, commit the changes, and send a pull request. We will review pull requests and merge them if acceptable, making a new version with each merge.
The data set includes the following fields in a utf8-encoded, tab-separated value file with no quoting character:
namecomboid - a signed long integer corresponding to the id field in the VertNetTaxonomyTestSet.csv file. This field serves as the link between the names and the occurrences that use those names.
newnamekey - this is the name combination key made by concatenating the input fields of the file VertNetTaxonomyTestSet.csv with ‘ | ‘ as field separator. This was necessary to be able to connect name combinations to their associated Occurrences.
clade - this is a best-effort, post-hoc assignment of the class to the name string combination. Whether the values are classes or not depends on the classification authority. Here we follow Catalog of Life for all except Placodermi and Conodonta, which come from Paleobiology Database.
migrated - shows if the data as published (as opposed to the original used in this study) were among those that were “migrated” using the VertNet Toolkit (https://github.com/vertnet/toolkit).
datasetid - the global unique identifier for the source data set in the GbIF registry, or a proxy if the data set was not yet registered in GBIF. Note that a data set may have data from more than one collection (e.g., LACM and BPBM have combined collections of vertebrates in one data set).
icode - the acronym for the source institution of the data set. Note that some institutions have more than one acronym (e.g., MSU/MSUM, UWBM/UWFC).
collectionCode - the name given by the institution for the collection in which the occurrence belongs. Note that a data set may have data from more than one collection (e.g., LACM, BPBM).
catalognumber - the catalogNumber given by the institution for the Occurrence record.
basisofrecord - the standardized basisOfRecord for the Occurrence.
vbasisofrecord - the verbatim original basisOfRecord for the Occurrence.
year - the cleaned value of year of the Occurrence. The year presented here was determined from the the original data for year, eventDate, and verbatimEventDate, and is only filled in if the data indicated a single year rather than a data range that covered more than one year.
vyear - the verbatim original value of the year field in the Occurrence record.
eventdate - the verbatim original value of the eventDate field in the Occurrence record.
verbatimeventdate - the verbatim original value of the verbatimEventDate field in the Occurrence record.
continent - the cleaned value of continent of the Occurrence. The continent presented here was determined from the the original data for all of geography fields using the same methods and authority used in the VertNet migrators (https://github.com/vertnet/toolkit).
country - the cleaned value of country of the Occurrence. The country presented here was determined from the the original data for all of geography fields using the same methods and authority used in the VertNet migrators (https://github.com/vertnet/toolkit).
countrycode - the cleaned value of countrycode of the Occurrence. The countryCode presented here was determined from the the original data for all of geography fields using the same methods and authority used in the VertNet migrators (https://github.com/vertnet/toolkit).
waterbody - the cleaned value of waterbody of the Occurrence. This field was not thoroughly vetted to determine the validity of the name of the water body. It was only provided if the original data had a value in the waterbody field. Thus there may be other records that are in a waterbody if that information was held somewhere else in the higher geography fields or in the locality field. In other words, use with caution.
island - the cleaned value of island of the Occurrence. This field was generally only provided if the original data had a value in the island field. Thus there may be other records that are on an island if that information was held somewhere else in the higher geography fields or in the locality field. In other words, use with caution.
islandgroup - the cleaned value of islandGroup of the Occurrence. This field was generally only provided if the original data had a value in the islandGroup field. Thus there may be other records that are in an islandGroup if that information was held somewhere else in the higher geography fields or in the locality field. In other words, use with caution.
geogkey - this is the geography combination key made from the ‘;‘-separated concatenation of original higher geography fields. This was necessary to be able to connect geography combinations to their associated records in the VertNet geographic authority used in the migrators (https://github.com/vertnet/toolkit).
vcontinent - the pre-cleaned value of the continent field in the Occurrence record. For migrated data sets, this is already the post-migration cleaned up value. For self-hosted data sets, this is the verbatim original value of the continent field.
vcountry - the pre-cleaned value of the country field in the Occurrence record. For migrated data sets, this is already the post-migration cleaned up value. For self-hosted data sets, this is the verbatim original value of the countryCode field.
vcountrycode - the pre-cleaned value of the country field in the Occurrence record. For migrated data sets, this is already the post-migration cleaned up value. For self-hosted data sets, this is the verbatim original value of the countryCode field.
vstateprovince - the pre-cleaned value of the stateProvince field in the Occurrence record. For migrated data sets, this is already the post-migration cleaned up value. For self-hosted data sets, this is the verbatim original value of the stateProvince field.
vcounty - the pre-cleaned value of the county field in the Occurrence record. For migrated data sets, this is already the post-migration cleaned up value. For self-hosted data sets, this is the verbatim original value of the county field.
vmunicipality - the pre-cleaned value of the municipality field in the Occurrence record. For migrated data sets, this is already the post-migration cleaned up value. For self-hosted data sets, this is the verbatim original value of the municipality field.
vwaterbody - the pre-cleaned value of the waterBody field in the Occurrence record. For migrated data sets, this is already the post-migration cleaned up value. For self-hosted data sets, this is the verbatim original value of the waterBody field.
visland - the pre-cleaned value of the island field in the Occurrence record. For migrated data sets, this is already the post-migration cleaned up value. For self-hosted data sets, this is the verbatim original value of the island field.
vislandgroup - the pre-cleaned value of the islandgroup field in the Occurrence record. For migrated data sets, this is already the post-migration cleaned up value. For self-hosted data sets, this is the verbatim original value of the islandgroupq field.
hascoordinates - “Yes” or “No” indicating that the Occurrence has data somewhere in the coordinate fields that are meant to be a real set of coordinates. Could include UTM coordinates as well of geographic coordinates in various formats in various field.
isgeoreferenced - “Yes” or “No” indicating that the Occurrence has coordinates based on the definition above and has a value that is meant to be a real set measure of the uncertainty in meters.
validcoordinates - “Yes” or “No” indicating that the Occurrence has coordinates based on the definition above and those coordinates fall within the valid numerical ranges. “Yes” here does not necessarily indicate that the coordinates are consistent with the geography.
validgeorefs - “Yes” or “No” indicating that the Occurrence has valid coordinates based on the definition above and has a coordinateUncertaintyInMeters that falls within the valid numerical range (i.e., >=1).
verbatimlatitude - the verbatim original value of the verbatimLatitude field in the Occurrence record.
verbatimlongitude - the verbatim original value of the verbatimLongitude field in the Occurrence record.
verbatimcoordinates - the verbatim original value of the verbatimCoordinates field in the Occurrence record.
decimallatitude - the verbatim original value of the decimalLatitude field in the Occurrence record as a string.
decimallongitude - the verbatim original value of the decimalLongitude field in the Occurrence record as a string.
coordinateuncertaintyinmeters - the verbatim original value of the coordinateUncetaintyInMeters field in the Occurrence record as a string.
declat - the decimallatitude field converted to a double precision floating point number.
declng - the decimallongitude field converted to a double precision floating point number.
unc - the coordinateuncertaintyinmeters field converted to a double precision floating point number.
David Bloom. 2015a. VertNet Portal - Class Amphibia Records. KNB Data Repository. [doi:10.5063/F1VX0DF9]
David Bloom. 2015b. VertNet Portal - Class Aves Records. KNB Data Repository. [doi:10.5063/F1MG7MDB]
David Bloom. 2015c. VertNet Portal - Class Fishes Records. KNB Data Repository. [doi:10.5063/F1R49NQB]
David Bloom. 2015d. VertNet Portal - Class Mammalia Records. KNB Data Repository. [doi:10.5063/F1GQ6VPM]
David Bloom. 2015e. VertNet Portal - Class Reptilia Records. KNB Data Repository. [doi:10.5063/F10P0WX6]
Wieczorek J, Döring M, De Giovanni R, Robertson T, Vieglais D. Darwin Core. [Internet]. 2009. Available from: [http://www.tdwg.org/standards/450/]
Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, Giovanni R, et al. Darwin Core: An Evolving Community-Developed Biodiversity Data Standard. PLoS ONE 2012; 7(1): e29715. [doi:10.1371/journal.pone.0029715]
Wieczorek J. VertNet Darwin Core Data Migrator Toolkit. GitHub repository. [Internet]. 2015. Available from: [https://github.com/vertnet/toolkit]