The ContentMine facilitates scraping journals, via both getpapers
, quickscrape
, and journal-scrapers
, but finding the links to input into quickscrape
remains a tedious job if done manually. This repository provides a way of spidering journals which requires only minimal user adjustment.
The main workhorse, get_links.py
was written by Laszlo Szathmary in 2011. This file returns all links on a webpage, which is all we really need. Link extraction currently works for SAGE journals and Springer journals. In order to run this:
- Import
spiderer
as module into python (make sure to have installed theBeautifulSoup
module!pip install BeautifulSoup
to do this) - Run
spiderer.sage(journal = '')
orspiderer.springer(journal = '')
to download all links for that specific journal. For thespiderer.sage()
you only need the first three letters of the web url (e.g.,pss
for Psychological Science); forspringer()
you require the unique journal identifier (e.g., 13428 for Behavior Research Methods); forelsevier()
you require the unique journal identifier (e.g., 2212683X for Biologically Inspired Cognitive Architectures).
If you want to collect the links for all journals available in journal_list.csv
, you only need to use the command python run_all.py
in the commandline of your choosing.
- Incorporate some form of selection mechanism into the journal_list
- Incorporate a date checker to prevent re-spidering of recently spidered journals (what is a reasonable timeframe for this?)
- Incorporate Elsevier
- Incorporate Taylor & Francis
- Incorporate Wiley