Skip to content

How To Crawl A Website

Michael Telford edited this page Jul 1, 2024 · 35 revisions

When we use Wgit::Crawler#crawl_site, we crawl all of the linked-to HTML pages on that host. This is achieved under the hood by crawling each webpage and recording its internal links (to other HTML pages); which are then crawled in later iterations. This continues until all HTML links on the host have been found and crawled.

Keep in mind that only links pointing to the same host will be followed and crawled. For example, https://vlang.io and https://docs.vlang.io are two different hosts/websites and are treated as such by Wgit when calling #crawl_site.

The block passed to #crawl_site will yield each crawled Wgit::Document instance in turn, offering the full capability of that class. Any failed crawl (non 2xx response) will result in an empty? document; so you can use next unless doc.empty? in your block to only process successfully crawled documents.

The following code snippet crawls a single website's pages, storing their HTML in an Array:

require 'wgit'
require 'wgit/core_ext'

crawler = Wgit::Crawler.new
url     = 'https://vlang.io'.to_url # Returns a Wgit::Url instance.
crawled = []

crawler.crawl_site(url) do |doc|
  crawled << doc unless doc.empty?
end

crawled.first.class # => Wgit::Document

htmls = crawled.map(&:html)

# Do something with the crawled page's HTML...

When passing a url to #crawl_site, it's recommended that you use a URL which has the greatest chance of finding HTML links to other pages on the site e.g. an index page. By doing so, we ensure the best chance of a complete and thorough crawl.

Using The Block Effectively

The above is simple and effective but it has a potential problem. We store each crawled web document in an Array which is fine for a small site like vlang.io (made up of 4 webpages at the time of writing). But what about larger sites where we don't know how many pages might be crawled? One solution of course, is to make better use of #crawl_site's block.

The following code snippet uses the block to process each web document as it gets crawled, releasing any reference to it before crawling the next. This is much more memory efficient. Just remember, the crawl waits for the block to complete before continuing onto the next page of the site; which will affect your overall crawl time. Basically, it's a trade off. Keep your block processing code short or pass it off to an asynchronous worker etc.

crawler.crawl_site(url) do |doc|
  next if doc.empty?
  # Use `doc.html` here or `queue_job(doc)` etc.
end

Remember that the block passed to #crawl_site is the only way to interact with each crawled web document. The return value is an Array of unique external URL's (to other sites) extracted while crawling the site; these are typically used by Wgit::Indexer to crawl other sites later on.

As with any Ruby block, you can break out of it cancelling the crawl or call next to start the next iteration of the crawl early.

Filtering URLs By Path

Another solution to a more efficient crawl is to only partially crawl a site.

To do so, we pass additional path parameters to #crawl_site. We can blacklist and/or whitelist URL paths as necessary. For example, we can whitelist certain paths using allow_paths:

crawled = []

crawler.crawl_site(url, allow_paths: ['docs', 'compare']) do |doc|
  crawled << doc.url
end

crawled # => ["https://vlang.io", "https://vlang.io/docs", "https://vlang.io/compare"]

Notice how the url (in this case - "https://vlang.io") always gets crawled regardless of its path because it's the starting point of the crawl. We can combine both allow_paths and disallow_paths, as necessary. The given paths are processed using File.fnmatch? meaning that the full glob syntax is supported e.g. 'wiki/*'.

A larger, more realistic example of partially crawling a website is... Crawling this wiki! But we definitely don't want to crawl all of the github.com site along with it (or we'll be here a while). The following code snippet will do it:

wiki    = 'https://github.com/michaeltelford/wgit/wiki'.to_url
crawled = []

allow    = 'michaeltelford/wgit/wiki/*'          # Allow all of this wiki's pages.
disallow = 'michaeltelford/wgit/wiki/*/_history' # But not archived wiki pages.

crawler.crawl_site(wiki, allow_paths: allow, disallow_paths: disallow) do |doc|
  crawled << doc.url
end

# Only the pages we specified got crawled.
crawled.all?  { |url| url.start_with?(wiki) }     # => true
crawled.none? { |url| url.end_with?('_history') } # => true

Handling Duplicate URLs

Wgit::Crawler#crawl_site won't crawl the same URL twice unless a redirect is involved. For example, let's say /location gets crawled, then /redirect, which redirects back to /location. In this case /location will be crawled twice because only the original URL (in this case /redirect) gets checked. The exception to this rule is external redirects (to a different host), which are not allowed and will not be followed by #crawl_site.

If you want to record the unique URLs on a host, then you can use a Set:

# Let's record all the unique pages on a host to determine the size of the site.
crawled = Set.new # Sets are guaranteed to be unique.

crawler.crawl_site(url) do |doc|
  # The doc.url here is the final redirected-to url, not the original.
  crawled << doc.url
end

size = crawled.size < 50 ? 'small' : 'large'
puts "#{url} is a #{size} site."

Specifying URLs To Follow

All of the code examples above use the default method of extracting URLs, in order to keep the crawl of the site going. But what if we want to be explicit and only follow certain links/URLs? Simply pass a follow: param:

crawler = Wgit::Crawler.new
url     = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')

crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
  puts doc.url
end

which will print:

http://quotes.toscrape.com/tag/humor/
http://quotes.toscrape.com/tag/humor/page/2/

The crawl will continue for as long as the follow: xpath returns URLs. It's up to you to ensure that the provided xpath is correct. Also remember that any URL pointing to a different host will not be crawled - because this would be outside the remit of #crawl_site. The URLs extracted by your xpath will be subject to the allow/disallow_paths params if provided.

Specifying Max Page Limit

If you're unsure about the size of the site you're about to crawl, you can pass a max_pages: parameter to limit the scope of the crawl. For example, setting max_pages: 3 will only allow #crawl_site to crawl the first 3 documents it encounters.

Parsing Javascript

By default, Wgit doesn't parse any crawled Javascript. But you can enable this feature if desired. See 'How To Parse Javascript' for more information.

Saving To Disk

If you're crawling a site in order to save its HTML to disk, then take a look at this script for an already built example which Wgit uses to save test fixtures.