MedSpider is a collection of scripts that help with web scraping tasks in order to gather online conversations from health forums. MedSpider targets the following listed online forums, which are categorized by the type of interaction between Patients (P) and Medics (M).
- Python 2.7
- Latest version of lxml installed via
pip install lxml==4.1.0
- Pandas is also needed for some of the scrapers
- Please note that the BMJ's Doc2Doc forum is discontinued, the scraper uses cached web pages from Wayback Machine/Internet Archive
- Specify the output directory to write results to by editing the
doc2doc.py
file's main entry point, e.g.Spidey().crawl('doc2doc')
(default isdoc2doc
if not specified) - Run the script via command line or terminal
python doc2doc.py
which will create tab-separated output files in the output directory you specified
DocCheck Blogs [M2M]
- This scraper will require registration of a medic-related account on DocCheck
- Specify the output directory to write results to by editing the
doccheck.py
file's main entry point, e.g.Spidey().crawl('doccheck')
(default isdoccheck
if not specified) - Run the script via command line or terminal
python doccheck.py
which will create tab-separated output files in the specified directory:blogs.tsv
,comments.tsv
, andtopics.tsv
eHealth Forum Questions [P2M]
- Specify the output directory to write results to by editing the
ehealthforum.py
file's main entry point, e.g.Spidey().crawl('ehealthforum')
(default isehealthforum
if not specified) - Run the script via command line or terminal
python ehealthforum.py
which will create a tab-separated output file calledchats.tsv
in the specified directory - To run the unit tests, use
pytest -q ehealthforum.py
Scrape the Doctors Lounge Forum in 3 Steps [P2M]
- Specify the output directory to write results to by editing the
doctorslounge.py
file's main entry point, e.g.Spidey().crawl('doctorslounge')
(default isdoctorslounge
if not specified) - Run the script via command line or terminal
python doctorslounge.py
which will create a tab-separated output file calleddiscussions.tsv
in the specified directory - To run the unit tests, use
pytest -q doctorslounge.py
Scrape the Optimal Health Network (OHN) Live Chat Archives in 3 Steps [P2M]
- Specify the output directory to write results to by editing the
ohn.py
file's main entry point, e.g.Spidey().crawl('ohn')
(default isohn
if not specified) - Run the script via command line or terminal
python ohn.py
which will create a tab-separated output file calledchats.tsv
in the specified directory - To run the unit tests, use
pytest -q ohn.py
Johns Hopkins Breast Center Expert Answers in 3 Steps [P2M]
- Specify the output file to write results to by editing the
hopkins.py
file's main entry point, e.g.Spidey().crawl('hopkins')
(default is 'hopkins' if not specified) - Run the script via command line or terminal
python hopkins.py
which will create a tab-separated output file calleddiscussions.tsv
in the specified directory - To run the unit tests, use
pytest -q hopkins.py
Scrape the Health Stack Exchange Q&A Forums in 3 Steps [P2P]
- Specify the output directory (must exist) to write results to by editing the
healthse.py
file's main entry point, e.g.Spidey().crawl('healthse')
(default is 'healthse' if not specified) - Run the script via command line or terminal
python healthse.py
which will create a collection of tab-separated output files (please note that Stack Exchange has rate limits):questions.tsv
,answers.tsv
,question_comments.tsv
, andanswer_comments.tsv
. - To run the unit tests, use
pytest -q healthse.py
Parse the Health Stack Exchange Q&A Archives in 3 Steps [P2P]
- Download the
health.stackexchange.com.7z
archive file and extract it using 7-Zip, it has Ubuntu and Windows versions - Note the dataset folder where the extracted XML files are located
- The
SEParser.py
script can create question pairs using the XML files viapython SEParse.py dataset-folder
, for examplepython SEParse.py SEparse
. It will save the results to a CSV file within the dataset folder (in the case of the example, the file will be calledSEparse.csv
). The script can be modified to perform other extraction and parsing tasks from the XML files.