Skip to content

Biodiversity data cleaning

Ashwin Agrawal edited this page Apr 3, 2017 · 13 revisions

Background

Biodiversity data cleaning is an essential step in using biodiversity occurrence data for any meaningful analysis or model building. R environment already has several functions to address this, but still some crucial functionality is missing, in order to complete the whole workflow within R environment. This project is an attempt to fill in some of those gaps by taking the workflow to next level.

Related work

There are a variety of R packages for biodiversity data handling. For data retrieval we have packages like rgbif, rvertnet and rinat, for data cleaning packages like rgeospatialquality, taxize & scrubr and into making.

Details of your coding project

Functions to help clean and flag Spatial, Temporal, Taxonomic and discrepancy issues like

  • Extend the function scrubr::coord_within to accept basemaps in WKT or shape files
  • Check validity of all the spatial description fields like Country, Continent etc.
  • Match the scientific name with various backbone taxonomies like GBIF, EOL,GNI, NCBI etc.
  • Check the validity of the date format
  • Identify spatial and temporal outliers in the data
  • To develop a flexible flagging matrix with interactions between certain flags, and develop a set of data quality indices using this matrix.
  • To develop a DwC summary table based on fields and vocabulary.

Expected impact

There is an increase in Biodiversity research community using R in their data analysis workflows. This package would add much needed protocols to the data cleaning process. This will take the workflow to one more step closer to reproducible research.

Mentors

Vijay Barve <vijay.barve@gmail.com>

Tomer Gueta <tomer.gu@gmail.com>

Yohay Carmel <yohay@technion.ac.il>

Tests

  • Easy: Install package rgbif and download sample data for few species. Use package mapr to plot the occurrence data on maps.
  • Easy: List existing R packages that have useful functions for biodiversity data cleaning
  • Medium: Write a R function to check dates of all the records downloaded from GBIF for a set of species. (Number of records > 10,000) and add a flag field indicating quality of the date field data.
  • Hard: Write a function to identify records very close to centroid of any country
  • Hard: Submit a pull request to package scrubr for any pending issues

Solutions of tests

Please contact Tomer Gueta and Vijay Barve after solving at least one of the tests above.

Clone this wiki locally