Skip to content

Simple crawler framework limited to one subdomain.

Notifications You must be signed in to change notification settings

scanterog/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawler

This provides a concurrent crawling execution limited to a single subdomain (no external URLs are followed) in order to produce a simple textual sitemap. The main idea was to exercise concurrency in Go.

Install

go get github.com/scanterog/crawler

Usage

crawler https://gobyexample.com

To redirect output to a file:

crawler -output-file /tmp/gobyexample.com https://gobyexample.com

Limitations

  • Only one seed URL. It does not accept a list of initial URLs.
  • One subdomain. If we start with https://wikipedia.org, it will crawl all pages within wikipedia.org but not follow external links. For example facebook.com or uk.wikipedia.org.
  • No politeness mechanism supported like robots.txt.

About

Simple crawler framework limited to one subdomain.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published