Skip to content

How To Crawl More Than Just HTML

Michael Telford edited this page Jun 17, 2020 · 11 revisions

Although Wgit is primarily designed for crawling and serialising HTML documents, other MIME types will work. This article details how to crawl more than just HTML documents.

When we use Wgit::Crawler#crawl_site, we crawl all of the linked-to HTML pages on that domain. This is achieved under the hood by crawling each webpage and recording its internal links (to HTML pages); which are crawled in later iterations. This continues until all HTML links on the domain have been found and crawled.

But what if we want to crawl for more than just HTML files on a site? Say we want all of the images on that site instead?

The solution is to extend the API. We do this by calling Wgit::Document.define_extractor to obtain the image's source attribute URL; once we have it, we can crawl and save each image as we find it.

Let's start by defining an extractor for the images we're interested in:

require 'wgit'
require 'wgit/core_ext'

Wgit::Document.define_extractor(
  :images,                # The name of our new instance var available on each crawled Wgit::Document.
  '//img/@src',           # The xpath to "extract" the image URL's out of the domain's HTML as we crawl.
  singleton: false,       # We want all images on the page, not just the first one found.
  text_content_only: true # We want the `src` attribute URL, not the underlying Nokogiri object.
) do |links, source|
  # Map each URL String into an absolute Wgit::Url e.g. /mascot.png becomes https://vlang.io/mascot.png
  # making it crawlable. The result of this block gets set as #images.
  links.map { |link| link.to_url.prefix_base(source) }
end

Now every crawled Wgit::Document will respond to #images containing the image src Wgit::Url's.

Next, let's crawl the site we're interested in. Whether we crawl a single webpage or an entire site, it doesn't matter, the principle is the same; as is our extractor defined above.

# Our process_image function could save the image (Wgit::Document) or pass it to a queue etc.
def process_image(image)
  puts "Image #{image.url} is #{image.size} bytes"
  # save_image(image.content) etc.
end

url     = 'https://vlang.io'.to_url
crawler = Wgit::Crawler.new encode: false # Turn off encoding to keep the image bytes intact.

# Crawl each page extracting and processing the images.
crawler.crawl_site(url) do |doc|
  next if doc.images.empty?
  crawler.crawl(*doc.images) { |image| process_image(image) }
end

which prints:

Image https://vlang.io/img/v-logo.png is 29026 bytes
Image https://vlang.io/img/patreon.png is 9063 bytes
Image https://vlang.io/img/down.png is 610 bytes
Image https://vlang.io/img/vscode.png is 6179 bytes
Image https://vlang.io/img/vim.png is 50335 bytes
...

The important point to note is that it's necessary to set the crawler's encode: false parameter. Otherwise, the image bytes would likely get corrupted. This is a good rule of thumb for any non HTML crawl performed using Wgit.