Skip to content

mattbierbaum/arxiv-public-datasets

Repository files navigation

arXiv public datasets

This project is part of a submission to an ICLR 2019 workshop, RLGM Representation Learning on Graphs and Manifolds. The manuscript can be found on arXiv:1905.00075. Our primary purpose is to develop a set of tools to standardize and facilitate use of the arXiv as a dataset. Due to licensing and distribution issues, our work is primarily a set of scripts which builds the dataset from various public data sources. There are additional cleaning, organization, and aggregation functions that it performs as well.

This project is under development as we try to best fit the needs of the community. We have adopted semver and as such will denote major releases with the first numeral in the tagged version.

Currently, the project grabs or generates:

  • Article metadata -- title, authors string, category, doi, abstract, submitter
  • PDFs -- all PDFs available through arXiv bulk download
  • Plain text -- PDFs converted to UTF-8 encoded plain text
  • Citation graph -- intra-arXiv citation graph between arXiv IDs only (generated from plain text)
  • Author string parsing -- convert metadata author strings into standardized list of name, affiliations

We are able to host certain generated portions of this dataset as released snapshots. The iterations can be found under the releases tab: Releases. However, the rest of it must be generated locally.

Kaggle hosting arXiv data

Kaggle has agreed to cover the cost of hosting the arXiv PDFs, and has created some challenges and tasks associated with the arXiv data. You can use our tools from their data source: download it from Google cloud using their instructions, storing the PDFs in your local $ARXIV_DATA/tarpdfs directory. Skip the downloading stage of downloading the PDFs below.

Contributing

Thank you for your interest in contributing! We welcome many forms of input -- GitHub issues, email, pull requests to name a few. Let us know how if there are ways we can better present and preserve the dataset.

Citing

If you wish to cite the dataset in your own work, a suggested bibtex entry is:

@misc{clement2019arxiv,
    title={On the Use of ArXiv as a Dataset},
    author={Colin B. Clement and Matthew Bierbaum and Kevin P. O'Keeffe and Alexander A. Alemi},
    year={2019},
    eprint={1905.00075},
    archivePrefix={arXiv},
    primaryClass={cs.IR}
}

Setup on Linux

Install the required system packages (or use an alternative Python distribution of your choice). For Debian / Ubuntu / similar:

sudo apt install python3 python3-pip python3-virtualenv poppler-utils

Download the code and prepare the python environment:

git clone https://github.com/mattbierbaum/arxiv-public-datasets
cd arxiv-public-datasets

virtualenv venv
. venv/bin/activate

pip3 install -e .
pip3 install -r requirements.txt

Decide where the data should live and modify the config.json file. This directory needs to have adequate space to hold ~ 1TB of pdfs and ~ 70GB of text if you so choose to retrieve them:

cp config.json.example config.json
[set ARXIV_DATA in config.json to your own directory]

The scripts in bin will then create any of the three subdirectories:

$ARXIV_DATA/tarpdfs   # raw pdf files from Amazon AWS bucket
$ARXIV_DATA/fulltext  # .txt from raw .pdf
$ARXIV_DATA/output    # co-citation network, parsed author strings, etc

Article metadata

Run OAI metadata harvester

python bin/metadata.py [OPTIONAL filepath.json.gz]

This will download the entire ArXiv metadata set, saving it as a series of gzip-compressed JSON entries. The default save location is $ARXIV_DATA/arxiv-metadata-oai-<date>.json.gz. This process will take at least 6 hours, as the OAI server only sends 1000 entries every 15 seconds. A resumption token is saved, so the process can be restarted by running again.

PDFs (AWS download only)

Prepare credentials

In addition to the setup above, you need to prepare your AWS credentials for use with boto3, the Python AWS library. A long explanation is available here while the quick method is to:

apt install awscli
aws configure

Bulk download of ArXiv PDFs

This download costs about 100 USD (and is 1.1TB) at the time of writing, as the ArXiv bulk download only allows requester-pays AWS S3 downloads. Ensure that you have at least 1TB of free space in the directory specified in config.json:

python bin/pdfdownload.py [OPTIONAL manifest_file.json.gz]

Plain text

The tool works for both AWS download (.tar files) and Kaggle download (plain PDFs) by adding --PLAIN_PDFS optional argument.

python bin/fulltext.py -N [OPTIONAL number_of_processes, default cpu_count] --PLAIN_PDFS [OPTIONAL, use if plain PDFs are dowloaded]

Bulk PDF conversion for AWS download

To use our tool for text conversion of all the PDFs from the ArXiv bulk download described above, execute the following. NOTE: if you have not already downloaded the PDFs, this tool will do so. If you have downloaded them, be sure to not change the $ARXIV_DATA so that it will not re-download the tars.

python bin/fulltext.py -N [OPTIONAL number_of_processes, default cpu_count] 

At the time of writing, converting 1.39 million articles requires over 400 core-hours using two Intel Xeon E5-2600 CPUs.

Bulk PDF conversion for plain PDFs, e.g. downdloaded from Kaggle (Google Cloud)

Download from AWS are .tar files. If plains PDFs are downloaded, e.g. from Kaggle, use the optional argument --PLAIN_PDFS:

python bin/fulltext.py -N [OPTIONAL number_of_processes, default cpu_count] --PLAIN_PDFS

Cocitation network

To generate the cocitation network, you first must have the full text. Then, with the directories still set up, run:

python bin/cocitations.py [OPTIONAL number_of_processes, default cpu_count]

The cocitation network will by default be saved in $ARXIV_DATA/output/internal-citations.json.gz.

Author string split

The OAI metadata from the ArXiv features author strings as submitted by article authors. In order to use them in a principled way, theses strings must be parsed and split. To generate and save these author splittings, run:

python bin/authorsplit.py [OPTIONAL number_of_processes, default cpu_count]

The split author strings will by default be saved in $ARXIV_DATA/output/authors-parsed.json.gz.

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value
name arXiv Public Datasets
url
sameAs https://github.com/mattbierbaum/arxiv-public-datasets
description The arXiv pre-print service is the de facto venue for publishing in many scientific disciplines. This repository provides tools for using all the publicly available information provided by the arXiv to download all of the publications and their metadata, extract fulltext from PDFs, and build a co-citation graph. For each publication the tools provide access to: * **Article metadata** -- title, authors string, category, doi, abstract, submitter * **PDFs** -- all PDFs available through arXiv bulk download * **Plain text** -- PDFs converted to UTF-8 encoded plain text * **Citation graph** -- intra-arXiv citation graph between arXiv IDs only (generated from plain text) * **Author string parsing** -- convert metadata author strings into standardized list of name, affiliations
citation https://arxiv.org/abs/1905.00075
license https://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSE