Skip to content

Repository based on the UBC MDS Meta Extractor Project that just uses the article relevance component.

License

Notifications You must be signed in to change notification settings

NeotomaDB/article-relevance

 
 

Repository files navigation

Contributors Forks Stargazers Issues MIT License codecov

Banner

Neotoma Article Relevance Tool (NeotomaART): Finding Fossils in the Literature

This project is forked from the larger Neotoma Meta-Review project, as a stand-alone relevance ML project.

NeotomaART aims to extract identify research articles which are relevant to the Neotoma Paleoecological Database (Neotoma), and extract article metadata (title, journal, contributing authors) to pass that information to relevant data stewards at Neotoma. This will allow Neotoma to solicit data submissions from a broader range of authors, and, potentially, reduce spatial and disciplinary biases in datasets.

Significant work on this project was performed as part of the University of British Columbia (UBC) Masters of Data Science (MDS) program in partnership with the Neotoma Paleoecological Database.

Table of Contents

There are 3 primary components to this project:

  1. Article Relevance Prediction - get the latest articles published, predict which ones are relevant to Neotoma and submit for processing.

About

Information on each component is outlined below.

Article Relevance Prediction

The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public xDD API to regularly get recently published articles. Article metadata is queried from the CrossRef API to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not.

The model was trained on ~900 positive examples (a sample of articles currently contributing to Neotoma) and ~3500 negative examples (a sample of articles unrrelated or closely related to Neotoma). Logistic regression model was chosen for its outstanding performance and interpretability.

Articles predicted to be relevant will then be submitted to the Data Extraction Pipeline for processing.

To run the Docker image for article relevance prediction pipeline, please refer to the instructions here

The model could be retrained using reviewed article data. Please refer to here for the instructions.

How to use this repository

First, begin by installing the requirements.

For pip:

pip install -r requirements.txt

For conda:

conda env create -f environment.yml

Article Relevance

Please refer to the project wiki for the development and analysis workflow details: article-relevance Wiki

Data Requirements

Each of the components of this project have different data requirements. The data requirements for each component are outlined below.

Article Relevance Prediction

The article relevance prediction component requires a list of journals that are relevant to Neotoma. This dataset used to train and develop the model is available for download HERE. Download all files and extract the contents into article-relevance/data/article-relevance/raw/.

The prediction pipeline requires the trained model object. The model is available HERE. Download the model file and put the .joblib file in article-relevance/models/article-relevance/.

System Requirements

The project has been developed and tested on the following system:

  • macOS Monterey 12.5.1
  • Windows 11 Pro Version: 22H2
  • Ubuntu 22.04.2 LTS

The pre-built Docker images were built using Docker version 4.20.0 but should work with any version of Docker since 4.

Directory Structure and Description

├── .github/                            <- Directory for GitHub files
│   ├── workflows/                      <- Directory for workflows
├── assets/                             <- Directory for assets
├── data/                               <- Directory for data
│   ├── article-relevance/              <- Directory for data related to article relevance prediction
│   │   ├── raw/                        <- Raw unprocessed data
│   │   ├── processed/                  <- Processed data
│   │   └── interim/                    <- Temporary data location
├── results/                            <- Directory for results
│   ├── article-relevance/              <- Directory for results related to article relevance prediction
│   ├── ner/                            <- Directory for results related to named entity recognition
│   └── data-review-tool/               <- Directory for results related to data review tool
├── models/                             <- Directory for models
│   ├── article-relevance/              <- Directory for article relevance prediction models
├── notebooks/                          <- Directory for notebooks
├── src/                                <- Directory for source code
│   ├── entity_extraction/              <- Directory for named entity recognition code
│   ├── article_relevance/              <- Directory for article relevance prediction code
│   └── data_review_tool/               <- Directory for data review tool code
├── reports/                            <- Directory for reports
├── tests/                              <- Directory for tests
├── Makefile                            <- Makefile with commands to perform analysis
└── README.md                           <- The top-level README for developers using this project.

Contributors

This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.

The UBC MDS project team consists of:

Sponsors from Neotoma supporting the project are:

Tips for Contributing

Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to project forks or project branches.

All products of the Neotoma Paleoecology Database are licensed under an MIT License unless otherwise noted.

About

Repository based on the UBC MDS Meta Extractor Project that just uses the article relevance component.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 57.2%
  • Jupyter Notebook 42.8%