This repository contains code for querying the PubMed database via NCBI's Entrez Direct (EDirect) using Python. There are also examples of other library functions that can be used for similar queries (but with possible limitations).
Commonly this is used for querying publications from PubMed. This was implemented for my master's thesis in order to create a custom dataset for fine-tuning NLP models.
- Clone the repository:
git clone https://github.com/marcel8168/edirect-python edirect-python
- Copy your API key from PubMed (see How to get API key) into api_key.txt
- Customize the query for your use case in the file query.py. The current query returns all articles of journal "N Engl J Med" (New England Journal of Medicine) that include an abstract.
- Build and run the docker container that automatically executes the query.py script:
cd edirect-python
# Docker runs all installations and executes the query.py script
docker compose up
- The saved xml can then be converted into a DataFrame:
# Extract data from XML and create a DataFrame
xml_file = "nejm_data.xml"
data_path = "../edirect-python/results/"
data = []
tree = ElementTree()
xml = tree.parse(data_path + xml_file)
for rec in xml.findall('.//Rec'):
try:
common = rec.find('.//Common')
pmid = common.find('PMID').text
title = common.find('Title').text
abstract = common.find('Abstract').text
mesh_term_list = rec.find('.//MeshTermList')
mesh_terms = [term.text for term in mesh_term_list.findall('MeshTerm')]
except Exception as e:
print(f"An error occurred: {e}")
print(f"Error occured for PMID: {pmid}")
data.append({'pmid': pmid, 'title': title,
'abstract': abstract, 'meshtermlist': mesh_terms, 'label': 0})
df = pd.DataFrame(data)
Topic | Link |
---|---|
PubMed API | https://www.ncbi.nlm.nih.gov/pmc/tools/developers/ |
Entrez Direct | https://www.ncbi.nlm.nih.gov/books/NBK179288/ |
EDirect Installation | https://dataguide.nlm.nih.gov/edirect/install.html |
ESearch | https://dataguide.nlm.nih.gov/edirect/esearch.html |
Xtract | https://dataguide.nlm.nih.gov/edirect/xtract.html |
E-Utilities | https://www.ncbi.nlm.nih.gov/books/NBK25499/ |
Journal IDs | https://ftp.ncbi.nih.gov/pubmed/J_Medline.txt |
MIT License (Marcel Hiltner, 2023)