PubMed-FAST-Scrape is a Python package intended for Biomedical NLP scientists. It is designed for fast and efficient scraping of PubMed article metadata by hybridizing packet-based lookups with Bio.Entrez and PubMed-direct parsing for bulk scraping of articles x200 faster than otherwise. It enables researchers and data scientists to easily gather articles based on a field of interest, year range, and minimum number of citations.
- Fast scraping of PubMed abstracts & article metadata.
- Filter articles by field of interest, year range, and minimum citations.
- Easy integration into data analysis pipelines.
Install PubMed-FAST-Scrape using pip:
pip install git+https://github.com/jimnoneill/pubmed-fast-scrape.git
PubMed-FAST-Scrape can be used as a command-line tool or imported into your Python scripts.
To use PubMed-FAST-Scrape from the command line:
python3 pubmed_scraper_cli.py --field "Cancer Research" --start_year 2010 --end_year 2020
from pubmed_fast_scrape.scraper import PubMedScraper
scraper = PubMedScraper(email='email_not_required@makesitfaster.com')
results = scraper.scrape('Cancer Research', (2023, 2024), 1) # field, year range, n-min citations
results.head()
To contribute to PubMed-FAST-Scrape, clone the repository and create a new branch for your feature or bug fix.
git clone https://github.com/yourusername/pubmed-fast-scrape.git
cd pubmed-fast-scrape
git checkout -b your-feature-branch
This project is licensed under the MIT License - see the LICENSE file for details.
- The PubMed API for providing access to their invaluable database of articles.
- BioPython and BeautifulSoup for making data extraction easier.
This tool is intended for academic and research purposes. Please ensure you adhere to PubMed's terms of use when using this scraper.
For more information, issues, or questions about the PubMed-FAST-Scrape, please visit the GitHub repository.