Skip to content

restlabs/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler System Design

Design Plan

The system will involve 7 parts overall:

  • Seed Url (Starting url for crawling)
    • For this project I will only be allowing a single url input
    • client will be a long running web server
  • URL Frontier
    • data structure to store URLs for future downloads
    • This will ensure priority/politeness to not DDOS a website
    • Queue router to put data in queues - queue selector to select data from given queues
    • Managed redis queue for FIFO data
      • Each redis key with the primary host will have a queue associated with it
    • Workers will spin up to ingest data from the FIFO queue per key
  • HTML Downloader (including DNS resolution)
    • Gets IP addresses from the DNS resolver and starts downloading html content
  • Content Parser
    • Parses HTML to ensure raw text is not malformed
  • Content Seen?
    • Data store of MD5 hashes of html content - if this data store has the md5 hash from the parser it throws away the data and continues work - if it doesn't have the hash it stores it.
  • Link extractor
    • Extracts links from HTML page
  • URL filter
    • Gets passed the links and stores URLs
    • URLs will then be stored in the URL Frontier and the whole process will continue

Diagram

                If either the DNS resolver fails
                or parser log error and restart
                  ┌─────────────────────────┐
                  │                         │
                  │         ┌─────────┐     │
                  │         │DNS      │     │
                  │   ┌─────┤Resolver │     │
                  │   │     └───▲─────┘     │
                  │   │         │           │
                  │   │         │           │
┌─────────┐   ┌───▼───▼─┐   ┌───┴─────┐   ┌─┴───────┐
│         │   │         │   │         │   │         │
│Client   ├───►Frontier ├───►Html     ├───►Html     │
│         │   │         │   │Download │   │Parser   │
└─────────┘   └──▲───▲──┘   └─────────┘   └────┬────┘
                 │   │                         │
                 │   │       ┌───────┐    ┌────▼────┐  ┌─────────┐
                 │   │       ├───────┘    │         │  │         │
                 │   │       │ Data  ◄────┤Content  ├──►Link     │
                 │   │       │ Store │    │seen?    │  │extract  │
                 │   │       └───────┘    └┬────────┘  └────┬────┘
                 │   │                     │                │
                 │   └─────────────────────┘           ┌────▼────┐
                 │     If MD5 hash exists              │         │
                 │     restart to beginning            │URL      │
                 │                                     │Filter   │
                 │           ┌───────┐                 └───┬─────┘
                 │           ├───────┘                     │
                 └───────────┤Redis  ◄─────────────────────┘
                             │MQ     │   Url's are pushed to
                             └───────┘  redis MQ for processing

Models

The data models for this will be incredibly simple. The queue data model will take form as a redis queue per host.

{ "wikipedia": ["https://wikipedia.com", "https://wikipedia.com/test"] }
{ "go": ["https://pkg.go.com/net/http", "go.com"] }

Using this model will enable the use of grouping crawler workers only within the desired host.

For the Seen content data store it will simple be a SQLite DB containing MD5 hashes of all seen sites.

interface SeenContentModel {
  id PK int unique
  hash string
}

Frontier

Queue router and queue selector will be contained within a module. Queue router will receive links from crawlers and enqueue those links into Redis Queue. The queue selector will work on a pub/sub mechanism.

Queues have their own crawler -> crawler depth will be set by env var when program runs -> when a new queue is added the queue selector module will have a subscriber that spins up a new worker/crawler

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages