Data Reusability Pipeline

Analysis pipeline for the paper A comprehensive analysis of the reusability of public omics data across 2.8 million research publications, produced by the Mangul Lab at USC.

This analysis includes publications that interact with one or both of two online resources hosted by the National Center for Biotechnology Information (NCBI): the Sequence Read Archive (SRA) and the Gene Expression Omnibus (GEO).

Authors: Nicholas Darci-Maher, Kerui Peng, Dat Duong, Richard J. Abdill, Eleazar Eskin, Serghei Mangul

Download data

Download the most recent open access subset of PubMed Central (PMC) publications. Rename journals with commas in their names to avoid issues downstream.Download metadata reference tables for every public SRA and GEO dataset.

Note: this data is large. Create a directory outside this repository to store the data, and point each script to that directory where appropriate.

cd scripts
./download_publications.sh
./rename_CommaJournals.sh
./download_refs.py
cd ../

Select papers mentioning SRA or GEO

Parse the text of every publication for regular expressions matching SRA and GEO accession IDs.

cd scripts
./preFilterPMCscrape.sh
cd ../

Extract the publication date from every selected paper

Create a key file containing the paths to each desired paper. Then, parse the XML files to find the earliest listed publish date.

cd scripts
./gen_pmc_paths.sh
./extractDate.sh
cd ../

Create a master table containing all the data

Launch jupyter notebook

Requires installation: https://jupyter.org/install

cd jupyter_notebooks
jupyter notebook

Merge data scraped from the PMC publications onto reference data from SRA and GEO.

Run jupyter_notebooks/create_metadata_table.ipynb
Run jupyter_notebooks/create_impactFactor_table.ipynb
Run jupyter_notebooks/analyze_metadata_table.ipynb

Create figures

Use everything generated so far to visualize findings.

Run jupyter_notebooks/vizualize_data.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data_tables		data_tables
jupyter_notebooks		jupyter_notebooks
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Reusability Pipeline

Download data

Select papers mentioning SRA or GEO

Extract the publication date from every selected paper

Create a master table containing all the data

Launch jupyter notebook

Create figures

About

Releases

Packages

Languages

License

Mangul-Lab-USC/omics-data-reusability

Folders and files

Latest commit

History

Repository files navigation

Data Reusability Pipeline

Download data

Select papers mentioning SRA or GEO

Extract the publication date from every selected paper

Create a master table containing all the data

Launch jupyter notebook

Create figures

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages