Skip to content

Python implementation of the NCBI's Entrez Direct (EDirect) API to query the PubMed Database

License

Notifications You must be signed in to change notification settings

marcel8168/edirect-python

Repository files navigation


EDirect Python Implementation


This repository contains code for querying the PubMed database via NCBI's Entrez Direct (EDirect) using Python. There are also examples of other library functions that can be used for similar queries (but with possible limitations).

Table of contents

Common usage

Commonly this is used for querying publications from PubMed. This was implemented for my master's thesis in order to create a custom dataset for fine-tuning NLP models.

Installation and Execution

  1. Clone the repository:
git clone https://github.com/marcel8168/edirect-python edirect-python
  1. Copy your API key from PubMed (see How to get API key) into api_key.txt
  2. Customize the query for your use case in the file query.py. The current query returns all articles of journal "N Engl J Med" (New England Journal of Medicine) that include an abstract.
  3. Build and run the docker container that automatically executes the query.py script:
cd edirect-python
# Docker runs all installations and executes the query.py script
docker compose up
  1. The saved xml can then be converted into a DataFrame:
# Extract data from XML and create a DataFrame
xml_file = "nejm_data.xml"
data_path = "../edirect-python/results/"

data = []

tree = ElementTree()
xml = tree.parse(data_path + xml_file)

for rec in xml.findall('.//Rec'):
    try: 
        common = rec.find('.//Common')
        pmid = common.find('PMID').text
        title = common.find('Title').text
        abstract = common.find('Abstract').text
        mesh_term_list = rec.find('.//MeshTermList')
        mesh_terms = [term.text for term in mesh_term_list.findall('MeshTerm')]
    except Exception as e:
        print(f"An error occurred: {e}")
        print(f"Error occured for PMID: {pmid}")

    data.append({'pmid': pmid, 'title': title,
                'abstract': abstract, 'meshtermlist': mesh_terms, 'label': 0})
df = pd.DataFrame(data)

Further Information

EDirect and E-Utilities
Topic Link
PubMed API https://www.ncbi.nlm.nih.gov/pmc/tools/developers/
Entrez Direct https://www.ncbi.nlm.nih.gov/books/NBK179288/
EDirect Installation https://dataguide.nlm.nih.gov/edirect/install.html
ESearch https://dataguide.nlm.nih.gov/edirect/esearch.html
Xtract https://dataguide.nlm.nih.gov/edirect/xtract.html
E-Utilities https://www.ncbi.nlm.nih.gov/books/NBK25499/
Journal IDs https://ftp.ncbi.nih.gov/pubmed/J_Medline.txt
Library Options
Topic Link
URL query https://github.com/dtoddenroth/medicaleponyms/blob/main/downloadabstracts/pubmedcache.py
MetaPub https://github.com/metapub/metapub
PyMed https://github.com/gijswobben/pymed
EntrezPy https://gitlab.com/ncbipy/entrezpy

License

MIT License (Marcel Hiltner, 2023)

About

Python implementation of the NCBI's Entrez Direct (EDirect) API to query the PubMed Database

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published