How To Derive Crawl Statistics

The following code snippet crawls a single website's HTML pages; to later derive some statistical analysis about the crawl.

require 'wgit'
require 'wgit/core_ext'

crawler = Wgit::Crawler.new
url     = 'https://vlang.io'.to_url
crawled = []

crawler.crawl_site(url) { |doc| crawled << doc }

crawled.first.class
# => Wgit::Document
stats = crawled.map { |doc| [doc.url, doc.url.crawl_duration] }.to_h
# => {
#  "https://vlang.io"                            => 0.968147,
#  "https://vlang.io/docs"                       => 0.271327,
#  "https://vlang.io/compare"                    => 0.253156,
#  "https://vlang.io/cdn-cgi/l/email-protection" => 0.006685
# }
total_crawl_time = stats.values.sum
# => 1.499315
average_crawl_time = total_crawl_time / stats.length
# => 0.37482875

# etc...

Because each crawled doc is a Wgit::Document instance, we have the full power of that class available to us e.g. we could derive information about:

The fastest/slowest loading pages
The smallest/largest pages (based on their HTML sizes)
The number of internal or external links
Which pages have keywords and which don't
etc.

Obviously the above example could be made more efficient e.g. we could store only the data we need in crawled rather than the entire Document object etc. Also, we would likely want to run our crawl a few times to average out the crawl durations etc. but the above serves as a basic how-to for statistical analysis using Wgit.

How To's

How To Get Started

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How To Derive Crawl Statistics

How To's

Clone this wiki locally