-
Notifications
You must be signed in to change notification settings - Fork 2
How To Crawl More Than Just HTML
Although Wgit is primarily designed for crawling and serialising HTML documents, other MIME types will work. This article details how to crawl more than just HTML documents.
When we use Wgit::Crawler#crawl_site
, we crawl all of the linked-to HTML pages on that domain. This is achieved under the hood by crawling each webpage and recording its internal links (to HTML pages); which are crawled in later iterations. This continues until all HTML links on the domain have been found and crawled.
But what if we want to crawl for more than just HTML files on a site? Say we want all of the images on that site instead?
The solution is to extend the API. We do this by calling Wgit::Document.define_extractor
to obtain the image's source
attribute URL; once we have it, we can crawl and save each image as we find it.
Let's start by defining an extractor for the images we're interested in:
require 'wgit'
require 'wgit/core_ext'
Wgit::Document.define_extractor(
:images, # The name of our new instance var available on each crawled Wgit::Document.
'//img/@src', # The xpath to "extract" the image URL's out of the domain's HTML as we crawl.
singleton: false, # We want all images on the page, not just the first one found.
text_content_only: true # We want the `src` attribute URL, not the underlying Nokogiri object.
) do |links, source|
# Map each URL String into an absolute Wgit::Url e.g. /mascot.png becomes https://vlang.io/mascot.png
# making it crawlable. The result of this block gets set as #images.
links.map { |link| link.to_url.prefix_base(source) }
end
Now every crawled Wgit::Document
will respond to #images
containing the image src Wgit::Url
's.
Next, let's crawl the site we're interested in. Whether we crawl a single webpage or an entire site, it doesn't matter, the principle is the same; as is our extractor defined above.
# Our process_image function could save the image (Wgit::Document) or pass it to a queue etc.
def process_image(image)
puts "Image #{image.url} is #{image.size} bytes"
# save_image(image.content) etc.
end
url = 'https://vlang.io'.to_url
crawler = Wgit::Crawler.new encode: false # Turn off encoding to keep the image bytes intact.
# Crawl each page extracting and processing the images.
crawler.crawl_site(url) do |doc|
next if doc.images.empty?
crawler.crawl(*doc.images) { |image| process_image(image) }
end
which prints:
Image https://vlang.io/img/v-logo.png is 29026 bytes
Image https://vlang.io/img/patreon.png is 9063 bytes
Image https://vlang.io/img/down.png is 610 bytes
Image https://vlang.io/img/vscode.png is 6179 bytes
Image https://vlang.io/img/vim.png is 50335 bytes
...
The important point to note is that it's necessary to set the crawler's encode: false
parameter. Otherwise, the image bytes would likely get corrupted. This is a good rule of thumb for any non HTML crawl performed using Wgit.
- How To Crawl A Website
- How To Crawl Locally
- How To Crawl More Than Just HTML
- How To Derive Crawl Statistics
- How To Extract Content
- How To Handle Redirects
- How To Index
- How To Multi-Thread
- How To Parse A URL
- How To Parse Javascript
- How To Prevent Indexing
- How To Use A Database
- How To Use Last Response
- How To Use The DSL
- How To Use The Executable
- How To Use The Logger
- How To Write A Database Adapter