Skip to content

HHousen/ArXiv-PubMed-Sum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArXiv-PubMed-Sum

process.py is a script to process the ArXiv-PubMed dataset. ArXiv and PubMed (Cohan et al., 2018) are two long document datasets of scientific publications from arXiv.org (113k) and PubMed (215k). The task is to generate the abstract from the paper body.

Stats/Visualizations

These visualizations were created by running python graphs.py <arxiv_articles_dir> <pubmed_articles_dir>

ArXiv

Split Name Avg Num Sents per Article Avg Num Sents per Abstract
Test 205.68 5.69
Train 206.38 9.87
Validation 204.24 5.60

ArXiv Test

arXiv-test-abstract_sents arXiv-test-article_sents

ArXiv Train

arXiv-train-abstract_sents arXiv-train-article_sents

ArXiv Validation

arXiv-val-abstract_sents arXiv-val-article_sents

PubMed

Split Name Avg Num Sents per Article Avg Num Sents per Abstract
Test 87.47 6.93
Train 86.22 6.84
Validation 87.90 6.84

PubMed Test

PubMed-test-abstract_sents PubMed-test-article_sents

PubMed Train

PubMed-train-abstract_sents PubMed-train-article_sents

PubMed Validation

PubMed-val-abstract_sents PubMed-val-article_sents

Instructions

The script processes the data into 6 files based on dataset splits. For each of the dataset split files (train.txt, val.txt and test.txt), the articles are read from the arxiv and pubmed sections and written to text files train.source, train.target, val.source, val.target, and test.source and test.target. These will be placed in the newly created arxiv-pubmed directory.

The output can be used for HHousen/TransformerExtSum to perform extractive summarization.

Steps:

  1. Download the data from armancohan/long-summarization or with the following direct links: PubMed (mirror) and ArXiv (mirror).
  2. Run the command python process.py <arxiv_articles_dir> <pubmed_articles_dir> (runtime: 5-10m).

Commands:

pip install gdown
gdown https://drive.google.com/uc?id=1lvsqvsFi3W-pE1SqNZI0s8NR9rC1tsja
gdown https://drive.google.com/uc?id=1b3rmCSIoh6VhD4HKWjI4HOW-cSwcwbeC
unzip pubmed-dataset.zip
unzip arxiv-dataset.zip
python process.py arxiv-dataset/ pubmed-dataset/