Integrating biodiversity data curation functionality

Background

Biodiversity research is evolving rapidly, progressively changing into a more collaborative and data-intensive science. The integration and analysis of large amounts of data is inevitable, as researchers increasingly address questions at broader spatial, taxonomic and temporal scales than before. Until recently, biodiversity data was scattered in different formats in natural history collections, survey reports, and in literature. In the last fifteen years, lot of efforts are being made to establish standards in the biodiversity database structure (Darwin Core standard, DwC). However, none of the hundreds DwC fields are mandatory or impose strong rules on the content associated with any record; thus, data vary in precision and in quality. To-date, there are several centralized portals that aggregate large volumes of biodiversity records from around the world and publish them in a DwC format. These aggregators are prone to numerous data errors, due to incomplete or erroneous information at the publisher level, errors during the publishing processes (e.g. formatting of date information) as well as errors during the central harvesting and indexing procedures.

Data-cleaning is a process used to determine inaccurate, incomplete, unreasonable, or unsuitable data, and improve the quality through correction of detected errors and omissions. The cleaning process may include validations like format checks, completeness checks, reasonableness checks, limit checks, etc. These processes usually result in flagging, documenting, and subsequent correcting or eliminating of suspect records. The user is responsible for identifying these errors and assessing if data is suitable for a particular application or purpose.

There are an increasing number of scientists using R for their data analyses, however, the skill set required to handle biodiversity data in R, is considerably varies. Since, the user needs to retrieve, manage and assess data with complex structure (DwC), and high volume, only users with an extremely sound R programing background can attempt this. Recently, verios R packages dealing with biodiversity data and specifically data cleaning have been published (e.g. finch, scrubr, biogeo, rgeospatialquality, and taxize). Though numerous new procedures are now available, implementing them require users to prepare the data according to the formats of each of these packages. While even the experienced R user spends a lot of time exploring and learning each R package, for the common users this task can be daunting. In order to truly facilitate data cleaning using R, one must fully understand and address the common user capabilities and skills. Developing an R package that will fully integrate functionality of existing packages, enhance their functionality, and therefore simplify their implementation can greatly serve the scientific community. Only by offering a cohesive framework for data quality assessment, we can fully harness the synergetic quality of the R packages ecosystem.

Related work

As mentioned above, we are seeing today a burst of R packages dealing with biodiversity data and specifically data cleaning. Here is a description of key packages:

Data retrieval:

finch	Read Darwin Core Archive files
rgbif	Search and retrieve data from the Global Biodiversity Information Facilty (GBIF)
spocc	Collect species occurrence data from GBIF, ALA, iDigBio, iNaturalist, AntWeb, eBird, BISON, and more

Taxonomical cleaning and enrichment:

taxize	A taxonomic toolbelt for R, which wraps APIs for a large suite of taxonomic databases available on the web
traits	Species trait data from many different sources
rredlist	IUCN Red List API Client

Biodiversity data cleaning:

biogeo	Assessing and improving data quality of occurrence record datasets
rgeospatialquality	A set of basic geospatial assessment functions over sets of primary biodiversity records, using the Geospatial Quality API
scrubr	A toolbox for cleaning biological occurrence records
assertr	A suite of functions designed to verify assumptions about data

Details of your coding project

We propose to develop an R package that will serve as a framework for biodiversity data quality assessment. By integrating existing packages, enhancing certain functionalities and develop new ones, and developing appropriate supporting materials. The project will be constructed from the following five elements:

Data retrieval & prepping: A set of functions dealing with the retrieval of high volume biodiversity data, and the standardization of its structure.
Data management: Develop a unique fields grouping system for handling numerous DwC fields in R, a uniform flagging system, an easy comparison to verbatim value feature, and an unwritable backup of the original data.
Data cleaning: Develop a biodiversity data quality assessment workflow by integrating functions from relevant packages. The workflow will be divided thematically (e.g. taxonomic, spatial, temporal, and duplicates).
Compatibility: Enhance the usability of the developed code by forcing it compatibility to the Kurator project. In addition, we plan to submitting the package to the rOpenSci suite. Therefore, we will follow their guidelines, and it will go through a process of open peer review, to ensure a consistent level of quality.
Process documentation and reproducibility: In order to help in process documentation, we propose to construct Jupyter notebooks, develop a diagnostic report template. We also propose to develop detailed vignettes demonstrating different types of cleaning procedures, and a generic template for standardized data cleaning procedure.

Looking at the volume of work required in this area, we propose to break the tasks into two categories. This proposal will deal with all the integration related tasks which would facilitate the data format conversions and smooth execution of all the available data cleaning functions form various packages mentioned above. There is another proposal to build the missing functionality, listed separately (Biodiversity data cleaning).

Expected impact

The proposed package will serve as the first crucial step for the development of a complete workflow for handling large biodiversity datasets in R. With the foundations laid here, we plan to improve the developed components, and add new ones (e.g. data enrichment, interactive visualisations, GUI). Improving the quality of biodiversity research, in some measure, is based on improving user-level data cleaning tools and skills. Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis will not only improve the quality of biodiversity data, but will impose a more appropriate usage of such data. This can greatly serve the scientific community and consequently our ability to address more accurately urgent conservation issues.

Mentors

Tomer Gueta <tomer.gu@gmail.com>

Yohay Carmel <yohay@technion.ac.il>

Vijay Barve <vijay.barve@gmail.com>

Tests

Please contact Tomer Gueta and Vijay Barve after solving at least one of the tests below.

Easy: Install packages ‘rgbif’ and ‘rgeospatialquality’. Execute the ‘add_flags’ function on 500 records of the species ‘Perameles nasuta’ (follow this vignette). Do the same but retrieve data from the GBIF portal.
Medium: Write a function that retrieves from GBIF 5000 georeferenced records of Australian mammals, and then sends all of them successfully to the Geospatial Quality API.
Medium: Write an Rd file and a vignette to the function you developed above.
Hard: How would you build a package for the inexperienced R user? describe all the methods and tricks you know, or that can be to developed; please be creative.
Hard: Please read this and this, and suggest a feasible concept for managing DwC data in R.

Solutions of tests

Students, please post a link to your tests results here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly