GitHub - Drahflow/crawler: I needed a serious web crawler for search engine applications. This is it.

Drahflow / crawler Public

Notifications You must be signed in to change notification settings
Fork 23
Star 75

I needed a serious web crawler for search engine applications. This is it.

75 stars 23 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
BloomSet.h		BloomSet.h
Domain.h		Domain.h
DomainStream.h		DomainStream.h
LICENCE		LICENCE
Makefile		Makefile
PostfixSet.h		PostfixSet.h
PrefixSet.h		PrefixSet.h
README		README
drahflow.name.cfg		drahflow.name.cfg
main.c++		main.c++
tests.c++		tests.c++

Repository files navigation

Problem domain:
  You want to crawl a non-trivial amount of webpages using
  only your personal computer.

This crawler offers:
  * integrated duplicate detection, only process each unique line once
  * modest memory and cpu requirements:
    400.000.000+ line duplicate cache
    128 different domains in parallel
    10 MBit/s download speed (before removal of duplicates)
    => 1 GB RAM + ~10% of a single core
  * stores results into a single stream file, optimal for later batch processing
  * short pauses between requests to the same server
  * a simplistic HTML "parser"
  * asynchronous DNS resolution via libadns
  * short an concise program code
  * liberal licencing terms

= "Architecture" (i.e. what I wrote before the code) =

Input:
  List of URLs

Create Bloom filter for seen URLs
Create Bloom filter for seen Lines

For each domain:
  Open output stream
  Fetch robots.txt, parse into patterns
  
  Create client socket and read / write buffers
  Fetch page
  Parse into lines, check with filter
  Put surviving lines into output stream

  Parse target URLs, check with filter
  Put surviving URLs into search front
    if search front limit per domain has been reached, drop URL

  Cooldown timer