journal-spider: facilitating the spidering of journal articles to scrape

The ContentMine facilitates scraping journals, via both getpapers, quickscrape, and journal-scrapers, but finding the links to input into quickscrape remains a tedious job if done manually. This repository provides a way of spidering journals which requires only minimal user adjustment.

The main workhorse, get_links.py was written by Laszlo Szathmary in 2011. This file returns all links on a webpage, which is all we really need. Link extraction currently works for SAGE journals and Springer journals. In order to run this:

Import spiderer as module into python (make sure to have installed the BeautifulSoup module! pip install BeautifulSoup to do this)
Run spiderer.sage(journal = '') or spiderer.springer(journal = '') to download all links for that specific journal. For the spiderer.sage() you only need the first three letters of the web url (e.g., pss for Psychological Science); for springer() you require the unique journal identifier (e.g., 13428 for Behavior Research Methods); for elsevier() you require the unique journal identifier (e.g., 2212683X for Biologically Inspired Cognitive Architectures).

If you want to collect the links for all journals available in journal_list.csv, you only need to use the command python run_all.py in the commandline of your choosing.

To-do

Incorporate some form of selection mechanism into the journal_list
Incorporate a date checker to prevent re-spidering of recently spidered journals (what is a reasonable timeframe for this?)
Incorporate Elsevier
Incorporate Taylor & Francis
Incorporate Wiley

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
journal-links		journal-links
.gitignore		.gitignore
README.md		README.md
get_links.py		get_links.py
journal_list.csv		journal_list.csv
run_all.py		run_all.py
spiderer.py		spiderer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

journal-spider: facilitating the spidering of journal articles to scrape

To-do

About

Releases

Packages

Languages

chartgerink/journal-spiders

Folders and files

Latest commit

History

Repository files navigation

journal-spider: facilitating the spidering of journal articles to scrape

To-do

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages