Skip to content

Parser for Biodiversity checklists

thomvee edited this page Mar 24, 2017 · 13 revisions

Background

Compiling taxonomic checklists from varied sources of data is a common task that biodiversity informaticians encounter. Data for checklists usually occur within textual formats and significant manual effort is required to extract taxon names from within text into a tabular format. Textual data in sources such as research publications and websites, frequently also contain additional attributes like synonyms, common names, higher taxonomy and distribution. A facility to quickly extract textual data into tabular lists will facilitate easy aggregation of biodiversity data in a structured format that can be used for further processing and upload onto data aggregation initiatives and help in compiling biodiversity data.

Related work

R does have few packages like httr, rvest and hunspell to do some basic operations of fetching files and trying to parse them. But it is important to have a taxonomy specific package since taxonomy has it’s own unique structure and complexities.

Details of your coding project

  • A functions to search Names of organisms within supplied text
  • Functions to manipulate taxon names like assigning ranks to a name string e.g. string ‘Papilio machaon Seyer, 1976’ into Genus = ‘Papilio’, Species = ‘machaon’, Author = ‘Seyer’ and Year = ‘1977’
  • Functions to parse taxonomic lists and return the information in table format
  • Recursive functions to crawl websites

Expected impact

There is an increase in Biodiversity research community using R in their data analysis workflows. This package would add a tool to extract taxonomic name lists and related data from different file formats like txt, html or pdf to quickly build checklists

Mentors

Please contact Vijay Barve vijay.barve@gmail.com after solving at least one of the tests below.

Tests

  • Easy: Read the html from URL [http://ftp.funet.fi/pub/sci/bio/life/insecta/lepidoptera/ditrysia/papilionoidea/papilionidae/papilioninae/lamproptera/] and get the genus name
  • Easy: List out all the species from the list [https://www.abdb-africa.org/genus/Papilio]
  • Medium: Read the html from URL [http://ftp.funet.fi/pub/sci/bio/life/insecta/lepidoptera/ditrysia/papilionoidea/papilionidae/papilioninae/lamproptera/] and get the all the species names
  • Medium: Convert the above task into a function
  • Hard: Read in the file [https://github.com/vijaybarve/Parser-GSOC2017-idea/blob/master/taxo01.txt] and output the parsed data in the form of .csv file [https://github.com/vijaybarve/Parser-GSOC2017-idea/blob/master/taxo_out01.csv]
  • Hard: Convert above task into a function

Solutions of tests

Students, please post a link to your test results here.

Clone this wiki locally