Skip to content

Automatic scraper that tracks changes in news articles over time.

License

Notifications You must be signed in to change notification settings

rafamerino/newsdiffs

 
 

Repository files navigation

NewsDiffs
==========

A website and framework that tracks changes in online news articles over time.

Original installation at newsdiffs.org.
A product of the Knight Mozilla MIT news hackathon in June 2012.
Authors: Eric Price (ecprice@mit.edu), Greg Price (gnprice@gmail.com),
 and Jennifer 8. Lee (jenny@jennifer8lee.com)

This is free software under the MIT/Expat license; see LICENSE.
The project's source code lives at http://github.com/ecprice/newsdiffs .


Requirements
------------

You need to have installed on your local machine
* Git
* Python 2.6 or later
* Django and other Python libraries

On a Debian- or Ubuntu-based system, it may suffice (untested) to run
  $ sudo apt-get install git-core python-django python-django-south python-simplejson

On Mac OS, the easiest way may be to install pip:
  http://www.pip-installer.org/en/latest/installing.html
and then
  $ pip install Django

 
Initial setup
-------------

  $ python website/manage.py syncdb
  $ git init articles
  $ cd articles && touch x && git add x && git commit -m 'Initial commit'


Running NewsDiffs Locally
-------------------------

Do the initial setup above.  Then to start the webserver for testing:
  $ python website/manage.py runserver

and visit http://localhost:8000/

Running the scraper
-------------------

Do the initial setup above.  You will also need additional Python
libraries; on a Debian- or Ubuntu-based system, it may suffice
(untested) to run
  $ sudo apt-get install python-bs4 python-beautifulsoup

on a Mac, you will want something like

 $ pip install beautifulsoup4
 $ pip install beautifulsoup
 $ pip install html5lib

Note that we need two versions of BeautifulSoup, both 3.2 and 4.0;
some websites are parsed correctly in only one version.

Then run
  $ python website/manage.py scraper

This will populate the articles repository with a list of current news
articles.  This is a snapshot at a single time, so the website will
not yet have any changes. To get changes, wait some time (say, 3
hours) and run 'python website/manage.py scraper' again.  If any of
the articles have changed in the intervening time, the website should
display the associated changes.

To run the scraper every hour, run something like:

 $ while true; do python website/manage.py scraper; sleep 60m; done

About

Automatic scraper that tracks changes in news articles over time.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published