Skip to content

Code for collecting and preparing the BioMAISx corpus.

Notifications You must be signed in to change notification settings

uchicago-dsi/BioMAISx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BioMAISx

This repository releases the BioMAISx (Biotechnology: Media, Agriculture, Investment, (and) Sentiment Excerpts) dataset annotated for Aspect-Based Sentiment Analysis (ABSA). It includes all code required for collecting and processing the raw data used for annotation, details on how the data was annotated, and code for post-processing the annotated data.

The dataset is made available as a csv here. See here the polarity distribution per aspect category.

A Zenodo link will later be made available.

Examples of preparing and using this data to train ABSA models is located in tutorials.

Collecting Data

The raw articles from which the quotes used in this corpus were sourced came from Factiva. You need to gain access to articles from Factiva (for a fee) and attain a user key and CID. Then to download the articles, set your key and CID to environment variables named FACTIVA_USER_KEY and FACTIVA_CID, respectively. Then you should be able to successfully run python scripts/download-source.py

Preprocessing Data

From the raw text data, we filtered to articles with specific keyterms, extracted quotations from those articles, and then filtered those quotations to those within contianing terms from the desired lexicon.

From this the quotes were reformatted for annotation with LabelStudio and proposed entities (noun chunks) were extracted using SpaCy. The code for this transformation is in scripts/preprocess-source.py

Annotating

Relevant information and code for annotation is included in annotation/README.md

About

Code for collecting and preparing the BioMAISx corpus.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published