Crawler

This provides a concurrent crawling execution limited to a single subdomain (no external URLs are followed) in order to produce a simple textual sitemap. The main idea was to exercise concurrency in Go.

Install

go get github.com/scanterog/crawler

Usage

crawler https://gobyexample.com

To redirect output to a file:

crawler -output-file /tmp/gobyexample.com https://gobyexample.com

Limitations

Only one seed URL. It does not accept a list of initial URLs.
One subdomain. If we start with https://wikipedia.org, it will crawl all pages within wikipedia.org but not follow external links. For example facebook.com or uk.wikipedia.org.
No politeness mechanism supported like robots.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
crawler		crawler
vendor		vendor
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler

Install

Usage

Limitations

About

Releases

Packages

Languages

scanterog/crawler

Folders and files

Latest commit

History

Repository files navigation

Crawler

Install

Usage

Limitations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages