This project is forked from the larger Neotoma Meta-Review project, as a stand-alone relevance ML project.
NeotomaART aims to extract identify research articles which are relevant to the Neotoma Paleoecological Database (Neotoma), and extract article metadata (title, journal, contributing authors) to pass that information to relevant data stewards at Neotoma. This will allow Neotoma to solicit data submissions from a broader range of authors, and, potentially, reduce spatial and disciplinary biases in datasets.
Significant work on this project was performed as part of the University of British Columbia (UBC) Masters of Data Science (MDS) program in partnership with the Neotoma Paleoecological Database.
Table of Contents
There are 3 primary components to this project:
- Article Relevance Prediction - get the latest articles published, predict which ones are relevant to Neotoma and submit for processing.
Information on each component is outlined below.
The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public xDD API to regularly get recently published articles. Article metadata is queried from the CrossRef API to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not.
The model was trained on ~900 positive examples (a sample of articles currently contributing to Neotoma) and ~3500 negative examples (a sample of articles unrrelated or closely related to Neotoma). Logistic regression model was chosen for its outstanding performance and interpretability.
Articles predicted to be relevant will then be submitted to the Data Extraction Pipeline for processing.
To run the Docker image for article relevance prediction pipeline, please refer to the instructions here
The model could be retrained using reviewed article data. Please refer to here for the instructions.
First, begin by installing the requirements.
For pip:
pip install -r requirements.txt
For conda:
conda env create -f environment.yml
Please refer to the project wiki for the development and analysis workflow details: article-relevance Wiki
Each of the components of this project have different data requirements. The data requirements for each component are outlined below.
The article relevance prediction component requires a list of journals that are relevant to Neotoma. This dataset used to train and develop the model is available for download HERE. Download all files and extract the contents into article-relevance/data/article-relevance/raw/
.
The prediction pipeline requires the trained model object. The model is available HERE. Download the model file and put the .joblib file in article-relevance/models/article-relevance/
.
The project has been developed and tested on the following system:
- macOS Monterey 12.5.1
- Windows 11 Pro Version: 22H2
- Ubuntu 22.04.2 LTS
The pre-built Docker images were built using Docker version 4.20.0 but should work with any version of Docker since 4.
├── .github/ <- Directory for GitHub files
│ ├── workflows/ <- Directory for workflows
├── assets/ <- Directory for assets
├── data/ <- Directory for data
│ ├── article-relevance/ <- Directory for data related to article relevance prediction
│ │ ├── raw/ <- Raw unprocessed data
│ │ ├── processed/ <- Processed data
│ │ └── interim/ <- Temporary data location
├── results/ <- Directory for results
│ ├── article-relevance/ <- Directory for results related to article relevance prediction
│ ├── ner/ <- Directory for results related to named entity recognition
│ └── data-review-tool/ <- Directory for results related to data review tool
├── models/ <- Directory for models
│ ├── article-relevance/ <- Directory for article relevance prediction models
├── notebooks/ <- Directory for notebooks
├── src/ <- Directory for source code
│ ├── entity_extraction/ <- Directory for named entity recognition code
│ ├── article_relevance/ <- Directory for article relevance prediction code
│ └── data_review_tool/ <- Directory for data review tool code
├── reports/ <- Directory for reports
├── tests/ <- Directory for tests
├── Makefile <- Makefile with commands to perform analysis
└── README.md <- The top-level README for developers using this project.
This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.
The UBC MDS project team consists of:
- Ty Andrews
- Kelly Wu
- Shaun Hutchinson
- Jenit Jain
Sponsors from Neotoma supporting the project are:
Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to project forks or project branches.
All products of the Neotoma Paleoecology Database are licensed under an MIT License unless otherwise noted.