Skip to content

Repository with tools to spider journal websites for links to articles

Notifications You must be signed in to change notification settings

chartgerink/journal-spiders

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

journal-spider: facilitating the spidering of journal articles to scrape


The ContentMine facilitates scraping journals, via both getpapers, quickscrape, and journal-scrapers, but finding the links to input into quickscrape remains a tedious job if done manually. This repository provides a way of spidering journals which requires only minimal user adjustment.

The main workhorse, get_links.py was written by Laszlo Szathmary in 2011. This file returns all links on a webpage, which is all we really need. Link extraction currently works for SAGE journals and Springer journals. In order to run this:

  1. Import spiderer as module into python (make sure to have installed the BeautifulSoup module! pip install BeautifulSoup to do this)
  2. Run spiderer.sage(journal = '') or spiderer.springer(journal = '') to download all links for that specific journal. For the spiderer.sage() you only need the first three letters of the web url (e.g., pss for Psychological Science); for springer() you require the unique journal identifier (e.g., 13428 for Behavior Research Methods); for elsevier() you require the unique journal identifier (e.g., 2212683X for Biologically Inspired Cognitive Architectures).

If you want to collect the links for all journals available in journal_list.csv, you only need to use the command python run_all.py in the commandline of your choosing.

To-do

  • Incorporate some form of selection mechanism into the journal_list
  • Incorporate a date checker to prevent re-spidering of recently spidered journals (what is a reasonable timeframe for this?)
  • Incorporate Elsevier
  • Incorporate Taylor & Francis
  • Incorporate Wiley

About

Repository with tools to spider journal websites for links to articles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages