Skip to content

Walk through to convert Kaggle's COVID-19 Open Research Dataset Challenge into a text corpus

License

Notifications You must be signed in to change notification settings

TextCorpusLabs/covid19

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COVID-19 To Text Corpus

Kaggle has provided an excelent data source for the COVID-19 courtesy of AI2 The purpose of this repo is to convert it from the given format into the normal text corpus format. I.E. one document per file, one sentence per line, pargraphs have a blank line between them.

Prerequisites

The following packages need to be installed. I recommend using Chocolatey.

if('Unrestricted' -ne (Get-ExecutionPolicy)) { Set-ExecutionPolicy Bypass -Scope Process -Force }
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
refreshenv

choco install 7zip.install -y
choco install python3 -y

Modules

All scripts have been tested on Python 3.8.2. The below modules are need to run the scripts. The scripts were tested on the noted versions, so YMMV. Note: not all modules are required for all scripts. If this it the first time running the scripts, the modules will need to be installed. They can be installed by navigating to the ~/code folder, then using the below code.

  • nltk 3.4.5
  • progressbar2 3.47.0
pip install -r requirments.txt
python -c "import nltk;nltk.download('punkt')"

Steps

The below document describes how to recreate the text corpus. It assumes that a particular path structure will be used, but the commands can be modified to target a different directory structure without changing the code. I am choosing the d:/covid19 directory because my d drive is big enough to hold everything.

  1. Clone this repo then open a shell to the ~/code directory.
  2. Retrieve the dataset by hand. Click on the download link, saving the file to a know location.
  3. Extract the data in-place with no folder structure.
    • The e switch flattens the extract so the custom code does not need to recursivaly search the folder structure.
"C:/Program Files/7-Zip/7z.exe" e -od:/covid19/raw "d:/covid19/*.zip"
  1. Extract the meta-data. This will create a single metadata.csv containing some useful information. In general this would be used as part of segementation or as part of a MANOVA.
python extract_metadata.py -in d:/covid19/raw -out d:/covid19/metadata.csv
  1. Convert the raw JSON files into the nomal folder corpus format. This will create a text corpus folder at the location I.E. ./corpus containing 2 sub folders, one for the abstract and one for the body. Some of the files provide by Kaggle are not full text articles I.E. empty abstract or body. These incomplete files are filtered out of the final folders and noted in error.csv
python convert_to_corpus.py -in d:/covid19/raw -out d:/covid19/corpus

About

Walk through to convert Kaggle's COVID-19 Open Research Dataset Challenge into a text corpus

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages