Scraper

Crawls all the urls on a specified domain and creates an entry for each in the Django database.

Installation

Create the database: python manage.py syncdb

The location of the database file is hardcoded in the settings file. I couldn't figure out a way around this so it will have to be modified. Edit website/settings.py and modify the Name field under Databases to reflect the absolute path of the database.

Invocation

Set environment variables: . ./setup

Start RabbitMQ: sudo rabbitmq-server

Start Celery: cd $SCRAPER_HOME/scraper; celery -A tasks worker --loglevel=info

Run the scraper: cd $SCRAPER_HOME; python manage.py crawl <url>

Start the admin interface: cd $SCRAPER_HOME; python manage.py runserver

View the admin interface: http://localhost:8000/admin

Cleanup

Stop Celery Tasks: celeryctl purge

Stop RabbitMQ: sudo rabbitmqctl stop

Testing

Run test website: cd $SCRAPER_HOME/test; python testsite.py

Run scraper against test website: cd $SCRAPER_HOME; python manage.py crawl http://localhost:8080/page1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper

Installation

Invocation

Cleanup

Testing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scraper		scraper
test		test
website		website
.gitignore		.gitignore
README.md		README.md
manage.py		manage.py
requirements.txt		requirements.txt
setup		setup

staceysern/scraper

Folders and files

Latest commit

History

Repository files navigation

Scraper

Installation

Invocation

Cleanup

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages