process.py is a script to process the ArXiv-PubMed dataset. ArXiv and PubMed (Cohan et al., 2018) are two long document datasets of scientific publications from arXiv.org (113k) and PubMed (215k). The task is to generate the abstract from the paper body.
These visualizations were created by running python graphs.py <arxiv_articles_dir> <pubmed_articles_dir>
Split Name | Avg Num Sents per Article | Avg Num Sents per Abstract |
---|---|---|
Test | 205.68 | 5.69 |
Train | 206.38 | 9.87 |
Validation | 204.24 | 5.60 |
Split Name | Avg Num Sents per Article | Avg Num Sents per Abstract |
---|---|---|
Test | 87.47 | 6.93 |
Train | 86.22 | 6.84 |
Validation | 87.90 | 6.84 |
The script processes the data into 6 files based on dataset splits. For each of the dataset split files (train.txt
, val.txt
and test.txt
), the articles are read from the arxiv and pubmed sections and written to text files train.source
, train.target
, val.source
, val.target
, and test.source
and test.target
. These will be placed in the newly created arxiv-pubmed
directory.
The output can be used for HHousen/TransformerExtSum to perform extractive summarization.
Steps:
- Download the data from armancohan/long-summarization or with the following direct links: PubMed (mirror) and ArXiv (mirror).
- Run the command
python process.py <arxiv_articles_dir> <pubmed_articles_dir>
(runtime: 5-10m).
Commands:
pip install gdown
gdown https://drive.google.com/uc?id=1lvsqvsFi3W-pE1SqNZI0s8NR9rC1tsja
gdown https://drive.google.com/uc?id=1b3rmCSIoh6VhD4HKWjI4HOW-cSwcwbeC
unzip pubmed-dataset.zip
unzip arxiv-dataset.zip
python process.py arxiv-dataset/ pubmed-dataset/