Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



2 Commits

Repository files navigation


crawly crawls the web from a set of seed urls. It sends a request to the urls, parses the urls from the response received, stores them in a repository and prints them to STDOUT as it fetches them. If number of urls to be fetched is specified, it stops crawling after successfully fetching specified number of urls.



$ python [-h] [-s SEED_URLS [SEED_URLS ...]] [-c COUNT]

optional arguments:
  -h, --help            show this help message and exit
  -s SEED_URLS [SEED_URLS ...], --seed_urls SEED_URLS [SEED_URLS ...]
                        Set of seed urls
  -c COUNT, --count COUNT
                        Number of links to be fetched

sample execution comamnd

The following command starts crawling from a set of urls [,] and stops when 10 urls are successfully fetched. If no seed url is specified, it takes as the default seed url. If no count is specified, it infinitely crawls the web until it receives a keyboard interrupt.

$ python --seed_urls '' '' --count 10


All logs (debug, error, info) generated during the execution of the program are stored in logs/crawly.log.

make commands

  • make clean
    Clears all the .pyc and .log files generated during execution of the program.

  • make clean-logs
    Clears only the .log files generated during execution of the program.

  • make clean-pyc Clears only the .pyc files generated during execution of the program.

  • make run Executes the program taking as the default seed url and crawls until it receives a keyboard interrupt.


  • Multithreaded or distributed crawler that issues many HTTP requests in parallel
  • Obey robots.txt before crawling a website
  • Skip fetching image, video and document urls