-
Notifications
You must be signed in to change notification settings - Fork 31
Parser for Biodiversity checklists
Compiling taxonomic checklists from varied sources of data is a common task that biodiversity informaticians encounter. Data for checklists usually occur within textual formats and significant manual effort is required to extract taxon names from within text into a tabular format. Textual data in sources such as research publications and websites, frequently also contain additional attributes like synonyms, common names, higher taxonomy and distribution. A facility to quickly extract textual data into tabular lists will facilitate easy aggregation of biodiversity data in a structured format that can be used for further processing and upload onto data aggregation initiatives and help in compiling biodiversity data.
R does have few packages like httr, rvest and hunspell to do some basic operations of fetching files and trying to parse them. But it is important to have a taxonomy specific package since taxonomy has it’s own unique structure and complexities.
- A functions to search Names of organisms within supplied text
- Functions to manipulate taxon names like assigning ranks to a name string e.g. string ‘Papilio machaon Seyer, 1976’ into Genus = ‘Papilio’, Species = ‘machaon’, Author = ‘Seyer’ and Year = ‘1977’
- Functions to parse taxonomic lists and return the information in table format
- Recursive functions to crawl websites
There is an increase in Biodiversity research community using R in their data analysis workflows. This package would add a tool to extract taxonomic name lists and related data from different file formats like txt, html or pdf to quickly build checklists
- [[http://vijaybarve.net/][Vijay Barve]] vijay.barve@gmail.com
- Rohit George rohitmg@gmail.com
- Thomas Vattakaven thomas.vee@gmail.com
- Narayani Barve narayani.ku@gmail.com
Please contact Vijay Barve vijay.barve@gmail.com after solving at least one of the tests below.
- Easy: Read the html from URL [http://ftp.funet.fi/pub/sci/bio/life/insecta/lepidoptera/ditrysia/papilionoidea/papilionidae/papilioninae/lamproptera/] and get the genus name
- Easy: List out all the species from the list [https://www.abdb-africa.org/genus/Papilio]
- Medium: Read the html from URL [http://ftp.funet.fi/pub/sci/bio/life/insecta/lepidoptera/ditrysia/papilionoidea/papilionidae/papilioninae/lamproptera/] and get the all the species names
- Medium: Convert the above task into a function
- Hard: Read in the file [https://github.com/vijaybarve/Parser-GSOC2017-idea/blob/master/taxo01.txt] and output the parsed data in the form of .csv file [https://github.com/vijaybarve/Parser-GSOC2017-idea/blob/master/taxo_out01.csv]
- Hard: Convert above task into a function
Students, please post a link to your test results here.
- Sumedh Mool (https://github.com/Sumedh04/Praser)
- Vishwajeet shukla (https://github.com/vishwajeet993511/gsoc2017tests)
- Xing Xiong (https://github.com/XingXiong/gsoc2017)
- Qingyue Xu (https://github.com/qingyuexu/Parser-for-Biodiversity-checklists)