Europe PubMed Central Citations dataset

This tool let us create a dataset from the EuropePMC OpenAccess dumps. At the end of the process we'll have a final CSV file with, for each row, the identifier of the citing document and the related cited documents.

This tool has been created as part of the Open Biomedical Citations in Context Corpus research project, actually used for speed up OpenCitations Bibliographic Entries Extractor (BEE) process.

Workflow

The workflow is divided in the following steps:

download the dumps (skippable)
download IDs file and generate a pickle dump of it to enable a fast search
unzip the articles from each dump and store their xml separately, deleting in the end the original dump (concurrently-> specify the number)
process the article. Each XML is transformed in a row of the dataset having the following fields:
- cur_doi
- cur_pmid
- cur_pmcid
- cur_name (the reference to the XML file needed for BEE/Jats2OC)
- references (json dumped string containing a list of identifiers of the cited documents)
If any of the previous IDs are not contained in the XML, we will exploit the PMID or PMCID to find the missing ones in the IDs file.

If a citing article or a cited one doesn't have any ID, we don't save it. If a citing article doesn't have cited references, we don't save it.

This process is run in parallel (-> specify the number). You can specify to store everything in a single dataset.csv (slow) or to store in many CSV files and then concatenate them (fast).

All the files needed to build the dataset will be automatically downloaded from the script.

You'll find the result in {path}/csv/dataset.csv.

How to start it

Install the dependencies with pip install -r requirements.txt

Configuration

Specify the parameters in the config.py file:

start_path: the full path to the directory where everything will be stored e.g.: "/temp_data_europepubmed-central-dataset"
writing_multiple_csv: a boolean that let you specify if you want to store the results directly to the final CSV during the process, or if you want to store all the results in separate CSV and then merge all. Set to True for high speed.
skip_download: a boolean to specify if you want to download the dumps or if you want to skip this phase (e.g.: you already downloaded manually the ones that you want)
download_workers: the number of processes spawned to download the dumps
max_retry: max number of retries if something goes wrong while downloading an OA dump
sec_between_retry: seconds between each retry
unzip_threads: number of threads involved in the extraction of the dumps
process_article_threads: number of threads involved in the processing of the extracted XML articles
max_file_to_download: max number of OA dumps to download. Set to None in order to download all.
folder_articles: the number of directories that will be created for each dump where will be stored the XMLs

Don't set an high number for unzip_threads and process_article_threads, because you can encounter of memory saturation and the error of having "Too many open files".

Run

Run it with $ python3 EuropePubMedCentralDataset.py

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Dockerfile		Dockerfile
EuropePubMedCentralDataset.py		EuropePubMedCentralDataset.py
LICENSE		LICENSE
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Europe PubMed Central Citations dataset

Workflow

How to start it

Configuration

Run

About

Releases

Packages

Languages

License

GabrielePisciotta/europe-pubmed-central-dataset

Folders and files

Latest commit

History

Repository files navigation

Europe PubMed Central Citations dataset

Workflow

How to start it

Configuration

Run

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages