GitHub

incyWebCrawler

A basic web crawler / scraper that can be used to map a site.

Visit localhost:3000/pages/new to get started
Type in a URL, for example http://www.makersacademy.com
- Be sure to include http:// and remove any trailing /
Set the number of pages to crawl
Search!

In order to keep our crawling and scraping under control, we decided to set some restrictions.

Only able to search single domains
Only able to search x number of pages within the domain (this is the links limit)
Only able to scrape links that are part of the domain - i.e. any external links will not be scraped.

This project has been deployed to Heroku, and can be accessed here.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
app		app
bin		bin
config		config
coverage		coverage
db		db
lib		lib
log		log
presentation		presentation
public		public
spec		spec
tmp		tmp
vendor/assets		vendor/assets
.gitignore		.gitignore
.rspec		.rspec
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
Rakefile		Rakefile
config.ru		config.ru
site_words.txt		site_words.txt