Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Darwin Core Continent and Water Body #128

Open
Jegelewicz opened this issue Sep 22, 2018 · 9 comments
Open

Darwin Core Continent and Water Body #128

Jegelewicz opened this issue Sep 22, 2018 · 9 comments
Assignees

Comments

@Jegelewicz
Copy link

Over a year ago, I found that some of the UTEP specimens on islands in the Pacific, are flagged by iDigBio with "dwc_continent_replaced | Darwin Core Continent Corrected." Life kept moving and the issue fell by the wayside.

Then, while at SPNHC, Robert Mesibov offered to review some Arctos data for me. He downloaded the MSB fish data from iDigBio and reviewed the RAW file. One of the issues he found was that all of the stuff coming from oceans had no water body and instead the body of water was in the DWC_Continent field.

In iDigBio, Atlantic Ocean is a body of water, in Arctos, it is a continent/ocean.

I thought that it would make sense to call the tectonic plate the "continent", but that isn't how iDigBio does it. They use political boundaries for continent.

So DMNS:Bird:18967 in Arctos shows a dwc_continent of "Atlantic Ocean" in Arctos and no associated water body.

and DMNS:Bird:18967 in iDigBio shows a dwc_continent of "Europe" and has the flag DWC Continent Replaced.

Strictly speaking, we are both wrong but I doubt that anyone searching in iDigBio for Europe wants stuff from the South Georgia Islands. And when I search iDigBio for institution code "DMNS" plus water body "Atlantic Ocean" I get no results. At least anyone searching Arctos for stuff from the Continent/Ocean field for "Atlantic Ocean" will find this specimen (I tried it and it worked!).

All this being said. It seems to me that there needs to be a wider community discussion about Continent and Bodies of Water. I have suggested to Arctos that in the interest of making our data show up in appropriate searches in iDigBio (and GBIF I'm betting), we should add Water Body to our higher geography and for anything with a "continent/ocean" that is really a water body we add the correct name to the water body field. iDigBio will still replace our "Continent/Ocean" information, but the correct water body will get there, so people searching the oceans will find our data, however, people searching "Europe" will still get records from the South Georgia Islands.

@tucotuco
Copy link
Member

tucotuco commented Sep 23, 2018

Thanks @Jegelewicz for this contribution. There are a lot of issues raised here. I think it might be worth separating them to simplify the discussion. As a preamble (something apparently obligatory when one is about to ramble), I would like to echo @ekrimmel from ArctosDB/arctos#1291 (comment), "...that obviously different collections/databases use different but equally correct ways to say the same thing..."

  • Arctos: For those who might not know about Arctos (https://arctos.database.museum/home.cfm), it is an online collections data management system that shares hardware and software infrastructure as well as geographies, taxonomic classifications, and other controlled vocabularies (called code tables by that community). As with any collection data management system, it should be modeled to most effectively meet the collection data management.
  • Darwin Core: Darwin Core is model for sharing biodiversity data on a global scale. That doesn't mean it is a good model for collection data management, but it does mean that, to the extent reasonable, collection data management systems can benefit from using Darwin Core terms, because their definitions are part of a global community ratified standard and to the extent that those terms are used properly, people will be more able to communicate information and find data. This is true both at the field level (fields such as dwc:continent and dwc:waterbody - the dwc: means we are speaking specifically of the meanings of those terms as defined under Darwin Core) and at the level of the values populating those fields (such as "Asia" and "North Atlantic Ocean"). We are at a stage when people are becoming much more interested in the use of controlled vocabularies when data are aggregated as Darwin Core, because doing so (in theory) will help people find what they are looking for. The organization Biodiversity Information Standards (TDWG) (https://www.tdwg.org/) has a Vocabularies of Values Data Quality Task Group (https://www.tdwg.org/community/bdq/tg-4/), which is gearing up to tackle the problem of building and managing community-managed controlled vocabularies for all the terms in Darwin Core that recommend having them. Data in biodiversity data sets in their original form or primary usage or data management systems do not have to follow Darwin Core either in structure (field names) nor content (recommended vocabularies), but if they don't, then somewhere along the line it is useful to transform them to do so. This is what the aggregators face, and what VertNet tries to "bring closer to home" by doing "migrations" of data sets to transform them into Darwin Core following current best practices, reporting that to the data providers, and agreeing on what the data look like before they "go out the door" as Darwin Core.
  • Flags: Aggregators such as GBIF and iDigBio put "flags" on records to point out demonstrable errors, putative errors, or interpretations made on the original data. TDWG has a Tests and Assertions Data Quality Task Group (https://www.tdwg.org/community/bdq/tg-2/) well advanced in defining a core set of such tests so that anyone who implements them does so the same way, with the same results. That way, no matter where you get your feedback from, in these core tests at least it would be consistent.
  • iDigBio geography: The issue with mis-assignment of the South Georgia Islands to Europe isn't a Darwin Core issue, per se. It is one iDigBio would need to resolve (turns out GBIF has the same problem). They did so based on the country given as "United Kingdom", but not all claims by the United Kingdom are in Europe, so it would be great for someone to submit issue with sufficient detail to https://www.idigbio.org/contact/Portal_feedback and the feedback icon at https://www.gbif.org/.
  • Continent: Continents are funny (https://www.youtube.com/watch?v=3uBcq1x7P34&t=16s). The term dwc:continent is defined (http://rs.tdwg.org/dwc/terms/index.htm#continent) as "The name of the continent in which the Location occurs." The commentaries on the term recommend using a controlled vocabulary for the values, and suggest The Getty Thesaurus of Geographic Names (TGN) as the source. TGN has a 12-continent system (just kidding, five of those are "former physical features" such as "Pangaea"; http://www.getty.edu/vow/TGNServlet?english=Y&find=&place=continent&page=1&nation=). The seven remaining currently are "Africa", "Antarctica", "Asia", "Europe", "North and Central America", "Oceania" and "South America". So, the recommendation is pretty clear about what the concept is and what values are valid to use for it. It does not include the oceans.
  • Waterbody: The term dwc:waterbody is a lot broader than dwc:continent, as it can include everything from a pond to an ocean. Some use it for drainage basin systems (cuencas). Darwin Core is consistent about how its geography definitions are formed, even when that isn't particularly helpful. The definition for dwc:waterbody (http://rs.tdwg.org/dwc/terms/index.htm#waterBody) is "The name of the water body in which the Location occurs." and again recommends TGN as a source. There is no single query one can make of the TGN to get water bodies, though subsets of them can be retrieved if you know what "Place Type" to look for (try, for example, http://www.getty.edu/vow/TGNServlet?english=Y&find=&place=ocean&page=1&nation= or http://www.getty.edu/vow/TGNServlet?english=Y&find=&place=sea&page=1&nation=). Sadly, the definition doesn't tell us what to do if the Location is on an island in a water body. Should dwc:waterbody be left empty? Or should it be filled in with the name of the surrounding water body? Strictly speaking, if the location itself is not in the water, dwc:waterbody should be left empty, otherwise we end up with some incongruent assertions some day when the semantics become rigorously important.
  • Standardization: When we do geography standardization in the VertNet "migration" process, we look at all of the values of the incoming Darwin Core higher geography field equivalents (dwc:continent, dwc:country, dwc:stateProvince, dwc:county, dwc:municipality, dwc:waterbody, dwc:islandGroup, dwc:island) as a unit and do lookups against the combination of them all to fill in the values that should be in those same fields in Darwin Core. Teresa's DMNS Bird example (http://arctos.database.museum/guid/DMNS:Bird:18967?seid=409311, http://portal.vertnet.org/o/dmns/bird-specimens?id=http-arctos-database-museum-guid-dmns-bird-18967-seid-409311, https://www.idigbio.org/portal/records/f24f49d6-8c7c-4d31-8b64-15d82c1d2908, https://www.gbif.org/occurrence/1145060690) is a really good one to demonstrate the issues. Arctos has the following field:value pairs for the higher geography - "Continent":"Atlantic Ocean", "Country":"United Kingdom", "State":"South Georgia & South Sandwich Islands", "IslandGroup":"South Georgia Islands", "Island":"South Georgia". The problem is that Arctos' concept of the country for the South Georgia & South Sandwich Islands (United Kingdom) is not the same concept as in Darwin Core (South Georgia and the South Sandwich Islands). Arctos isn't wrong, it's just different, and so the mapping Arctos has made of their data into Darwin Core isn't always strictly in accord with the sharing mechanism, Darwin Core, which is what the aggregators expect. It looks like iDigBio and GBIF take the country value at face value, and derive the continent from that (clearly with flawed assumptions and algorithm). If the record had passed through a VertNet "migrator" (or the equivalent Kurator Geography Cleaner workflow) with the input mapped to Darwin Core with Continent-->dwc:continent, Country-->dwc:country, State-->dwc:stateProvince, IslandGroup-->dwc:islandGroup, Island-->dwc:island, the higher geography after standardization in Darwin Core would be "dwc:continent":"South America", "dwc:country":"South Georgia and the South Sandwich Islands", "dwc:countryCode":"GS", "dwc:waterbody":"South Atlantic Ocean", "dwc:islandGroup":"South Georgia Islands", "dwc:island":"South Georgia" (dwc:stateProvince would be left blank). Sure, one can argue about whether the South Georgias should be in South America, Antarctica, or neither, but TGN says it is in South America, so there you have it. The thing about the country might have seemed tricky. Here's why. We work with certain principles. One is to fill all the Darwin Core fields we can. So, we fill dwc:countryCode. The definition for that field (http://rs.tdwg.org/dwc/terms/index.htm#countryCode) is, "The standard code for the country in which the Location occurs." The recommendation is to use a ISO 3166-1-alpha-2 country code. Well, South Georgia and the South Sandwich Islands has its own country code (GS). The dwc:countryCode is supposed to correspond with the dwc:country, so that must be "South Georgia and the South Sandwich Islands". For further explanation of the principles behind the standardizations done by VertNet, have a look at the README.md at https://github.com/VertNet/DwCVocabs.

@Jegelewicz
Copy link
Author

@tucotuco Thanks for taking the time to put together such a comprehensive ramble! It explains a lot and I hope we can use what you have said here to improve the geography reported from Arctos.

@debpaul
Copy link
Contributor

debpaul commented Sep 24, 2018

Marvelous thread @Jegelewicz @tucotuco Note too that some including @chicoreus have wanted to discuss and encourage use of for some time. @tucotuco I think it's in our To-Do list of webinar topics too (at least broadly speaking).

@sharpphyl
Copy link

sharpphyl commented Jul 20, 2020

To add another example, the DMNS:Inv collection in GBIF is 29,608 catalog records today. But because we map our continents differently, if you select the seven GBIF continents for DMNS:Inv, you only get 20,206 records. The difference seems to be all the records with geography that begin with an Ocean rather than a continent. For example, our specimens from Hawaii, New Zealand, etc. that begin with the Pacific Ocean can't be found by searching the GBIF continents.

If we were to separate continent (and add Oceania) and water body (as it appears some other collections including MCZ do) our records would map more accurately and would be found in searches based on the continents.

GBIF continents:
Screen Shot 2020-07-20 at 10 35 10 AM

Arctos continents:

Screen Shot 2020-07-20 at 10 36 15 AM

@tucotuco
Copy link
Member

Despite https://www.youtube.com/watch?v=3uBcq1x7P34&t=25s, VertNet uses the following principle for its vocabularies with respect to continents, exactly matching GBIF.

We use the geopolitical concept of continents following the seven continent model, which include Africa, Antarctica, Asia, Europe, North America (with Central America and the Caribbean), Oceania (with Australasia) and South America.

@sharpphyl
Copy link

sharpphyl commented Aug 19, 2020

@tucotuco So when an Arctos higher geography starts with an ocean or an unlisted land mass, what does VertNet (thus GBIF) do with it? We have 20 entries in our Continent/Ocean field and six of them match the VertNet continents. The seventh (Oceania) isn't in our Arctos vocabulary.

BTW, loved the youtube video.

@tucotuco
Copy link
Member

@sharpphyl I might be about to over-answer your question, but I think it is important. In case you only need the bottom line, the answer is that VertNet keeps the geography just as it comes through Darwin Archive from the IPT and GBIF does not provide an interpreted continent because the original value "Atlantic Ocean" isn't a continent under their interpretation. I agree with their interpretation.

Now, the rest of the story...

When a data set from Arctos is published to the VertNet IPT, it does its best to manipulate the Arctos world view to the Darwin Core world view (VertNet made that interpretation in collaboration with @dustymc). That same Darwin Core world view is the one shared through the IPT to all aggregators, including VertNet (portal), iDigBio, and GBIF, who grab it from the IPT Darwin Core archive and put the data through their respective ingestion processes, which are not the same.
VertNet takes the geographic data as published and does no further interpretation of it. The "migrator" interpretations I mentioned in an earlier comment happen long before publishing, in a feedback loop in which VertNet uses its vocabularies (with geography being a single integrated vocabulary because dividing it up by field is certain to create a mess, see https://github.com/VertNet/DwCVocabs/blob/master/vocabs/Geography.csv) provides reports with suggestions about how various errors can be fixed and standards followed. That loop is closed when the data publisher is satisfied that the data they are publishing meets their own quality criteria. VertNet only offers the suggestions and obliges to publish when desired. On the portal, the geography searches use exactly what was published.

GBIF ingests the data as published, but also adds interpretations (to an ever-increasing number of fields) to aid searches (https://www.gbif.org/article/5i3CQEZ6DuWiycgMaaakCo/gbif-infrastructure-data-processing, technical details at https://github.com/gbif/pipelines#interpretation).

For Arctos, all of this means that, for example, this specimen https://arctos.database.museum/guid/DMNS:Bird:18967 with the following data in Arctos:

  • higherGeography: Atlantic Ocean, United Kingdom, South Georgia & South Sandwich Islands, South Georgia Islands, South Georgia
  • continent: Atlantic Ocean
  • islandGroup: South Georgia Islands
  • island: South Georgia
  • country: United Kingdom
  • stateProvince: South Georgia & South Sandwich Islands

comes through the Darwin Core Archive (and thus the VertNet portal http://portal.vertnet.org/o/dmns/bird-specimens?id=http-arctos-database-museum-guid-dmns-bird-18967-seid-409311) as:

  • higherGeography: Atlantic Ocean, United Kingdom, South Georgia & South Sandwich Islands, South Georgia Islands, South Georgia
  • continent: Atlantic Ocean
  • waterbody:
  • islandGroup: South Georgia Islands
  • island: South Georgia
  • country: United Kingdom
  • countrycode:
  • stateProvince: South Georgia & South Sandwich Islands
  • county:
  • municipality:

If these data had run through the VertNet migrator (or other tools based on the geography vocabulary mentioned above) they would have come out in Darwin Core as:

  • higherGeography: Atlantic Ocean | United Kingdom | South Georgia & South Sandwich Islands | | | | South Georgia Islands | South Georgia
  • continent: South America
  • waterbody: South Atlantic Ocean*
  • islandGroup: South Georgia Islands
  • island: South Georgia
  • country: South Georgia and the South Sandwich Islands
  • countrycode: GS
  • stateProvince:
  • county:
  • municipality:
  • The vocabulary provides the ocean because the original did, but note that one of the geography principles that governs the VertNet vocabularies is that if the original location is terrestrial, the surrounding waterbody should NOT be supplied - it should be there only if the location is IN the waterbody. Thus, if the original did not include "Atlantic Ocean", the waterbody would have been left blank.

In GBIF the raw Darwin Core data are there, but there are also the following interpretations (https://www.gbif.org/occurrence/1145060690):

  • higher geography: Atlantic Ocean, United Kingdom, South Georgia & South Sandwich Islands, South Georgia Islands, South Georgia
  • continent:
  • country or area: United Kingdom
  • country code: GB
  • island: South Georgia
  • islandGroup: South Georgia Islands
  • state province: South Georgia & South Sandwich IslandsAlong with these GBIF provides the following flags:
  • original continent "Atlantic Ocean" excluded
  • country coordinate mismatch
  • geodetic datum assumed WGS84 (because it was explicitly "unknown")

@sharpphyl
Copy link

Thank you so much for your detailed response. In my world, there's no such thing as "over-answering" a technical question. I'm still working through the details of your response and the way our data is ingested into aggregators so I may have more questions, but I better understand now why some of our data is marked Invalid and why various searches may not return all our records.

While GBIF lists the seven continents, it doesn't appear that GBIF lists water bodies and just references TGN. Correct?

It does appear that GBIF makes one exception to substituting Invalid for any country that should be in the continent Oceania. They do map Australia (perhaps the continent, not the country) to Oceania. But New Zealand, Fiji etc. that we put in the Pacific Ocean are marked "Invalid" rather than remapping to Oceania. And Hawaii which we have in the Pacific Ocean is not remapped to North America.

I think @Jegelewicz will add this to a AWG agenda for discussion and your data will be very helpful. Thanks.

@tucotuco
Copy link
Member

For reference, continent interpretation in GBIF is a matter of a simple lookup on the verbatim continent value provided by the data publisher using this table. Problems with this approach and a potential solution have been presented to GBIF in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants