Skip to content

jiaminghong/camelcrawler

Repository files navigation

Camel Crawler

//TODO-description

Web Crawler

  • Write basic crawler to get list of URLs for single domain
  • Feed base root URLs from text file
  • Connect crawler to a persistence database [MySQL]
  • Convert the list of domains into a data frame/structure for analysis
  • Build a database of domains to crawl
  • Blacklist websites using CDNs/dynamic websites

Middleware develops API for the database - Redis

  • Configure MySQL to Redis
  • Redis - graph data structure to retrieve required information
  • Write end-points to retrieve data-points for Front-end

Front-end REACT

  • Display the total number of backlinks in the database
  • Query the API to retrieve total number of backlinks

TestCases

  • Domain Object Spec
  • JDBC Connection Spec
  • Crawler Spec
  • Webclient Spec
  • getBody() Spec -> parser

How to Run It

//TODO

Advance Features

  • Work with dynamic website by implementing a headless browser - HTMLUnit OR Selenium
  • Rotating Proxies to avoid bot check & circumnavigate CDNS e.g. cloudFare, Distill Network
  • Building jar file of the web crawler to work on multiple physical machines -> Akka Compile Issue
  • Memory optimization - use of CountdownLatcher
  • Integrate Kafka+Kafka API to stream data from crawler to database and onto a front-end for 'real-time' feedback

Team Members

  • Abdusamed
  • Ming
  • Chi

About

distributed web crawler built using Scala

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published