Skip to content

The repository aims to create an overview and comparison of software used for systematically screening large amounts of textual data using machine learning.

License

Notifications You must be signed in to change notification settings

FelixWdm/software-overview-machine-learning-for-screening-text

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

49 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Overview of software for screening large amounts of textual data implementing machine learning

The repository aims to create an overview and comparison of software used for systematically screening large amounts of textual data using machine learning.

Overview

The table below provides a quick overview of the software. The following properties are evaluated:

  • Is there a website?
  • Is the software open-source (provide a πŸ”— to the source code)?
  • Is the software peer-reviewed in a scientific article?
  • Is documentation or a manual available (provide a πŸ”—)?
  • Is the full version of the software free of charge?
Software Website Open-Source Published Documentation Free
Abstrackr πŸ”— ❌ DOI ❌ βœ…
ASReview πŸ”— βœ…πŸ”— DOI βœ…πŸ”— βœ…
Colandr πŸ”— ❌ DOI βœ…πŸ”— βœ…
DistillerSR πŸ”— ❌ DOI βœ…πŸ”— ❌
EPPI-Reviewer πŸ”— ❌ ❌ βœ…πŸ”— ❌
FASTREAD ❌ βœ…πŸ”— DOI βœ…πŸ”— βœ…
Rayyan πŸ”— ❌ DOI βœ…πŸ”— ❌
RobotAnalyst πŸ”— ❌ DOI ❌ ❔1
SWIFT-Active Screener πŸ”— ❌ DOI βœ…πŸ”— ❌

βœ… Yes/Implemented; ❌ No/Not implemented; ❔ Unknown (requires an issue).

1 See issue Rensvandeschoot#29

Installation

The table below provides an overview of options for how to install the software.

  • Can the software be installed locally so that data and labeling decisions are only stored on the user's device (yes/no)?
  • Is the software installable on a server (yes/no)?
  • Is the software available as online service (software as a service - SAAS; yes/no; provide a link to the registration page)?
Software Local Server Online Service
Abstrackr ❌ ❌ βœ…πŸ”—
ASReview βœ… βœ… ❌
Colandr ❌ ❌ βœ…πŸ”—
DistillerSR
EPPI-Reviewer ❌ ❌ βœ…πŸ”—
FASTREAD βœ… βœ… ❌
Rayyan ❌ ❌ βœ…πŸ”—
RobotAnalyst ❌ ❌ βœ…πŸ”—1
SWIFT-Active Screener ❌ ❌ βŒπŸ”—

βœ… Yes; ❌ No; ❔ Unknown (requires an issue).

1 To use RobotAnalyst, you need to request an account via email.

Data

The table below provides an overview of input/output data.

  • Which data formats can be imported?
  • Can partly labeled data be imported (yes/no; if yes, as S(ingle) or M(ultiple) files)?
  • Which data formats can be exported?
  • Does the export file contain the labeling decisions?
  • Can the export file be re-imported into the same software, retaining the labeling decisions (Re-Import-1: yes/no)?
  • Can the export file be re-imported into reference manager software retaining, the labeling decision (Re-Import-2: yes/no)?
Software Input data format Partly labeled Output data format Labeling decisions Re-Import-1 Re-Import-2
Abstrackr RIS, TAB, TXT1 ❌ CSV, XML, RIS βœ… ❌ βœ…
ASReview RIS, TSV, CSV, XLSX, TAB, +2 βœ…(S)+2 RIS, TSV, CSV, XLSX, TAB βœ… βœ… βœ…
Colandr RIS, BIB, TXT βœ…(M) CSV βœ… ❌ ❌
EPPI-Reviewer RIS, TXT, +3 βœ…(M) RIS, XLSX ❔4 ❔4 ❔4
FASTREAD CSV βœ…(S) CSV βœ… βœ… ❌
Rayyan RIS, ENW, BIB, CSV, XML, CIW, NBIB βœ…(M) RIS, BIB, ENW, CSV βœ… ❌ βœ…
RobotAnalyst RIS, NBIB βœ…β”5 ❔5 βœ… ❔5 ❌
SWIFT-Active Screener TXT, RIS, XML, BibTex βœ…(M) CSV, RIS βœ… ❔6 βœ…

βœ… Yes/Implemented; ❌ No/Not implemented; ⚑ Only for some extensions (add a footnote for more explanation); ❔ Unknown (requires an issue).

1 List of PubMed IDs

2 ASReview provides several open-source tools to convert file formats (e.g., CSV->RIS or RIS->XLSX), combine datasets (labeled, partly labeled, or unlabeled), and deduplicate records based on title/abstract/DOI.

3 EPPI-Reviewer provides a closed-source online file converter to convert several file formats to RIS.

4 See issue Rensvandeschoot#21

5 See issue Rensvandeschoot#29

6 See issue Rensvandeschoot#40

Machine Learning Properties

The tables below provide an overview of the machine learning properties.

Active Learning

Training Data

  • Can training data (prior knowledge) be selected by the user to train the first iteration of the model (yes/no)?
  • What is the minimal training data size (provide a number for Relevant and Irrelevant records)?
Software Tr.Data by user Minimum Tr.data
Abstrackr ❌ ❔1
ASReview βœ… β‰₯1R+β‰₯1I
Colandr βœ… 10
EPPI-Reviewer βœ… β‰₯5R
FASTREAD βœ… β‰₯1R
Rayyan βœ… β‰₯50 with β‰₯5R
RobotAnalyst βœ… β‰₯1R
SWIFT-Active Screener βœ…2 β‰₯1R3

βœ… Yes/Implemented; ❌ No/Not implemented; ⚑ With some effort (add a footnote for more explanation); ❔ Unknown (requires an issue).

1 See issue Rensvandeschoot#34

2 Only relevant records can be provided as training data prior to screening.

3 If no relevant records are uploaded prior to screening, training will be initiated after screening β‰₯30 records with atleast β‰₯1R and β‰₯1I.

Model Selection

  • Can the user select the active learning model (yes/no)?
  • Can a user upload their own model (yes/no)?
  • Can the feature extraction results be stored (yes/no)?
  • Does (re-)training proceed Automatically or is it triggered Manually?
  • Can the user continue labeling during training (yes/no)?
  • Can the user select batch size (yes/no; provide the default)?
  • Is it possible to switch to a different model during screening (yes/no)?
Software Select model User model Store Feat.matrix Training Continue Batch size Switch
Abstrackr ❌ ❌ ❌ A βœ… ❌ ❌
ASReview βœ… βœ… βœ… A βœ… ❌ (1) ⚑1
Colandr ❌ ❌ ❌ A βœ… ❌ (10) ❌
EPPI-Reviewer ❌ ❌ ❌ M βœ… ❌ ❌
FASTREAD ❌ ❌ ❌ M ❌ ❌ ❌
Rayyan ❌ ❌ ❌ M βœ… ❌ ❌
RobotAnalyst ❌ ❌ ❌ M ❔2 ❌ ❌
SWIFT-Active Screener ❌ ❌ ❌ A ❔3 ❌ (30) ❌

βœ… Yes/Implemented; ❌ No/Not implemented; ⚑ With some effort (add a footnote with more explanation);

1 Switching to a different model in ASReview is available by exporting the data of the first model and importing the data back into ASReview. The software will recognize all previous labeling decisions, and a new model can be trained.

2 See issue Rensvandeschoot#29

3 See issue Rensvandeschoot#40

Overview of Available Models

  • Which feature extraction methods are available? BOW = bag of words; Doc2Vec = document to vector; sBERT = sentence bidirectional encoder representations from transformers; TF–IDF = term frequency–inverse document frequency; Word2Vec = words to vector; ML = Multi-language;

  • Which classifiers are available? CNN = convolutional neural network; DNN = dense neural network; LDA = latent Dirichlet allocation; LL = log linear; LR= logistic regression; LSTM = long short-term memory; NB = naive Bayes; RF =random forests; SGD = stochastic gradient descent; SVM = support vector machine;

  • Which balancing strategies are available? S / Simple = no balancing balance strategy; D / Double = Double balance strategy; T / Triple = Triple balance strategy; U / Under = Undersampling balance strategy; A / Aggressive = Aggressive undersampling balance strategy (after classifier is stable); W / Weighting = Weighting for data balancing (before and after classifier is stable); M / Mixing = Mixing: weighting is applied before the classifier is stable and aggressive undersampling is applied after the classifier is stable;

  • Which query strategies are available? R / Random = Records are selected randomly; C / Certain = Certainty based; U / Uncertain = Uncertainty based; M / Mixed = A combination of query strategies, for example 90% Certainty based and 10% Random; Cl / Clustering = Clustering query strategy;

Software Feature Extr. Classifiers Balancing Query Stra.
Abstrackr TF-IDF ❔1 SVM ❔1 R, C, U
ASReview TF–IDF, Doc2Vec, sBert, TF-IDF, ML CNN, DNN, LR, LSTM, NB, RF, SVM S, D, U, T R, C, U, M, CL
Colandr Word2Vec ❔2 SGD ❔ 2 ❔2 C
EPPI-Reviewer TF-IDF SVM ❔3 R, C, Cl
FASTREAD TF-IDF SVM S, A, W, M C, U
Rayyan ❔4 SVM ❔4 C, U
RobotAnalyst TF-IDF + BOW + LDA2vec SVM ❔5 R, C, U, Cl
SWIFT-Active Screener TF-IDF LL S:grey_question:6 C

βœ… Yes/Implemented; ❌ No/Not implemented; ❔ Unknown (requires an issue).

1 See issue Rensvandeschoot#34

2 See issue Rensvandeschoot#16

3 See issue Rensvandeschoot#21

4 See issue Rensvandeschoot#19

5 See issues Rensvandeschoot#29

6 See issues Rensvandeschoot#40

Supervised Learning

Software Feature Extr. Classifiers Balancing Query Stra.
EPPI-Reviewer1 TF-IDF SVM:grey_question:2 ❔2 R, C, Cl

1 EPPI-Reviewer offers the option to choose from, or use custom, pre-trained models to find a specific type of literature, e.g., for RCTs.

2 See issue Rensvandeschoot#21

Unsupervised Learning

Software Q1

Software

This section briefly describes the software in alphabetical order.

Abstrackr is a collaborative (i.e., multiple reviewers can simultaneously screen citations for a review), web-based annotation tool for the citation screening task.

ASReview, developed at Utrecht University, helps scholars and practitioners to get an overview of the most relevant records for their work as efficiently as possible while being transparent in the process. It allows multiple machine learning models, and ships with exploration and simulation modes, which are especially useful for comparing and designing algorithms. Furthermore, it is intended to be easily extensible, allowing third parties to add modules that enhance the pipeline with new models, data, and other extensions.

Colandr is a free, web-based, open-access tool for conducting evidence synthesis projects.

DistillerSR automates the management of literature collection, screening, and assessment using AI and intelligent workflows. From a systematic literature review to a rapid review to a living review, DistillerSR makes any project simpler to manage and configure to produce transparent, audit-ready, and compliant results.

EPPI-Reviewer is a web-based software program for managing and analysing data in literature reviews. It has been developed for all types of systematic review (meta-analysis, framework synthesis, thematic synthesis etc) but also has features that would be useful in any literature review. It manages references, stores PDF files and facilitates qualitative and quantitative analyses such as meta-analysis and thematic synthesis. It also contains some new β€˜text mining’ technology which is promising to make systematic reviewing more efficient.

FASTREAD (FAST2) is a tool to support primary study selection in systematic literature review.

Rayyan is a free web and mobile app, that helps expedite the initial screening of abstracts and titles using a process of semi-automation while incorporating a high level of usability.

RobotAnalyst was developed as part of the Supporting Evidence-based Public Health Interventions using Text Mining project to support the literature screening phase of systematic reviews.

SWIFT-Active Screener (SWIFT is an acronym for β€œSciome Workbench for Interactive computer-Facilitated Text-mining”) is a freely available interactive workbench which provides numerous tools to assist with problem formulation and literature prioritization.

Contributing

Do you know other software that meets the inclusion criteria? Please make a Pull Request and add it to the overview. When there is missing, wrong, or incomplete information, please start an issue.

Licence

This project is CC-BY 4.0 licensed.

Contact

For any suggestions, questions, or remarks, please file an issue in the issue tracker.

This comparison is maintained by Rens van de Schoot. I aim to make a fair comparison and not to be prejudiced. If there is any concern about the comparison, please file an issue in the issue tracker such that it can be openly discussed.

About

The repository aims to create an overview and comparison of software used for systematically screening large amounts of textual data using machine learning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published