A multithreaded web crawler written in Go.
This is an improved version of Spidey
Spidey took 1 minute to crawl 16,572 links
- Clone this repository
git clone https://github.com/anirudhsudhir/Spidey-v2.git
cd Spidey-v2
-
Create a "seeds.txt" and add the seed links in quotes consecutively
Sample seeds.txt
"http://example.com"
"https://abcd.com"
-
Build the project and run Spidey.
Pass the crawl time, request delay and worker count as arguments.
- Crawl Time: The time during which Spidey adds new links to the crawl queue in seconds(positive integer)
- Request Delay: The required delay before a request is sent to a link of the same domain in seconds(positive integer)
- Worker Count: The number of crawl workers to run concurrently(positive integer)
go build
./spidey 10 1 5
#Here, the crawl time is 10s, request delay is 1s and worker count is 10