A web crawler written as part of CS3103 Computer Networks Practices assignment in National University of Singapore.
The crawler can run in parallel; 4-8 threads, depending on hardware concurrency supported. All the threads share a common storage to store URLs visited and retrieve URLs to be visited, to prevent the crawler to visit a page that has been visited before by any of the crawlers (assignment requirement).
The web crawler aims to go for breadth rather than depth. It will only visit sites from the same domain at most 20 times. The number of domains to visit can be customized, default to 30. The seed websites can also be customized.
See Running for more details on how to run and customize the crawler parameters.
You can terminate the crawler anytime by giving interrupt signal (eg ctrl-c).
The crawler does not parse HTTP response status, it will just check the body of the response and retrieve all the links inside an <a>
tag. There is no special handling since for most HTTP error response, there is nothing the crawler can do anyway.
This program uses htmlcxx as HTML parser.
To build and install:
-
Download the package from the link above.
-
./configure && make && [sudo] make install
-
(May not be necessary) export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
-
(May not be necessary)
sudo ldconfig
Any problem with htmlcxx installations: go to their INSTALL file (in the package) or google it. Note that the problem usually surface only during program compilation and/or runtime.
The program was built on Ubuntu 15.04, and tested (ran) on [to be filled in]. Not guaranteed to work on any other platform. Definitely won't work on non-Unix environments as the socket codes are Unix(Linux)-specific.
The program uses C++11 regex standard library. Make sure your C++ compiler supports it (for gcc
, minimum version is 4.9).
Make sure you have all the prerequisites above before running the following.
make
make run [SEED=<path/to/seed> | LIMIT=<url log limit>]
If SEED
left empty, the default seed
file will be used.
If LIMIT
left empty, the default LIMIT
is 30.
The two arguments are optional.
Alternatively, run the executable generated under build/
directory.
<path/to/program> [<path/to/seed> | <url log limit>]
Same as its make
counterpart, the arguments are optional and the order does not matter. Integers will be assumed as url log limit
, otherwise it is assumed as path/to/seed
.
Before putting a URL into the seed file, you can test if it will make the web crawler stuck (i.e. 404, 301)
make mockbuild
make mockrun [URL=<url>]
If left empty, url will default to www.comp.nus.edu.sg.