Skip to content

How To Derive Crawl Statistics

Michael Telford edited this page Dec 14, 2019 · 1 revision

The following code snippet crawls a single website's HTML pages; to later derive some statistical analysis about the crawl.

require 'wgit'
require 'wgit/core_ext'

crawler = Wgit::Crawler.new
url     = 'https://vlang.io'.to_url
crawled = []

crawler.crawl_site(url) { |doc| crawled << doc }

crawled.first.class
# => Wgit::Document
stats = crawled.map { |doc| [doc.url, doc.url.crawl_duration] }.to_h
# => {
#  "https://vlang.io"                            => 0.968147,
#  "https://vlang.io/docs"                       => 0.271327,
#  "https://vlang.io/compare"                    => 0.253156,
#  "https://vlang.io/cdn-cgi/l/email-protection" => 0.006685
# }
total_crawl_time = stats.values.sum
# => 1.499315
average_crawl_time = total_crawl_time / stats.length
# => 0.37482875

# etc...

Because each crawled doc is a Wgit::Document instance, we have the full power of that class available to us e.g. we could derive information about:

  • The fastest/slowest loading pages
  • The smallest/largest pages (based on their HTML sizes)
  • The number of internal or external links
  • Which pages have keywords and which don't
  • etc.

Obviously the above example could be made more efficient e.g. we could store only the data we need in crawled rather than the entire Document object etc. Also, we would likely want to run our crawl a few times to average out the crawl durations etc. but the above serves as a basic how-to for statistical analysis using Wgit.