Skip to content

Implementing biodiversity data checks for the bdchecks package

Tomer Gueta edited this page Mar 25, 2019 · 1 revision

Background

bdchecks is an infrastructure for performing, filtering and managing various biodiversity data checks using R. Data checks are a key to promoting biodiversity data quality. bdchecks offers various features for different types of R users:

  • An interactive and user-friendly Shiny app for inexperienced R users.
  • Full command line functionality for more experienced R users.
  • Advanced R users can easily edit, add and manage their own collection of data checks, using one single YAML file and only two supporting R functions.

Related work

bdchecks (available on CRAN) is part of The bdverse infrastructure and is a dependency for another bdverse package - bdclean.

Details of your coding project

Our main mission is to successfully implement all core suite of tests and assertions being developed by TDWG’s Biodiversity Data Quality ‘Task Group 2: Data Quality Tests and Assertions’. Though bdchecks core is designed to match the test structure, developing and maintaining complete synchronization will be challenging.

Your coding project key points:

  • Get familiar with the bdchecks package and it’s data checks infrastructure (YAML file incorporation, dataCheck class)
  • Construct and test as many data checks as possible
  • Implement a report that lists unsuccessful data checks and describes the errors Implement analysis reproducibility

Skills Required

R and shiny.

Advantage: experience in working with biodiversity big-data.

Expected impact

Improving the quality of biodiversity research, in some measure, is based on improving user-level data cleaning tools and skills. Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis will not only improve the quality of biodiversity data, but will impose a more appropriate usage of such data.

Mentors

Students, please contact mentors below after completing at least one of the tests below.

  • Tomer Gueta tomer.gu@gmail.com is leading the bdverse project. He is a postdoctoral fellow at the Faculty of Civil and Environmental Engineering at the Technion, working with Prof. Yohay Carmel. His research deals with developing tools and methodologies for data-intensive biodiversity research. During the last two years, Tomer served as a GSoC mentor with the R project organization.

  • Thiloshon Nagarajah thiloshon@gmail.com is a key member in bdverse development team. He was past GSoC and GCI student for Fedora Project, Sahana Foundation and R Language.

  • Vijay Barve vijay.barve@gmail.com is the author and maintainer of bdvis and a key member in bdverse development team. Vijay is a biodiversity data scientist who has been a GSoC student and mentor since 2012 with the R project organization. Vijay has contributed to several packages on CRAN.

Tests

Students, please do one or more of the following tests before contacting the mentors.

  • Medium: Implement already existing data check (ie., improve existing data check function) and import check into R using bdchecks yaml file. Provide benchmarks for performance improvements.
  • Hard: Implement non-existing data check, create an entry in dataChecks.yaml for it and import it into R using.
  • Hard: Implement code tests using testthat package for any data check (or multiple data checks).

Solutions of tests

Students, please post a link to your test results here in the format: Name - Email - University - Link to solutions

Clone this wiki locally