-
Notifications
You must be signed in to change notification settings - Fork 2
How To Crawl A Website
When we use Wgit::Crawler#crawl_site
, we crawl all of the linked-to HTML pages on that host. This is achieved under the hood by crawling each webpage and recording its internal links (to other HTML pages); which are then crawled in later iterations. This continues until all HTML links on the host have been found and crawled.
Keep in mind that only links pointing to the same host will be followed and crawled. For example, https://vlang.io
and https://docs.vlang.io
are two different hosts/websites and are treated as such by Wgit when calling #crawl_site
.
The block passed to #crawl_site
will yield each crawled Wgit::Document
instance in turn, offering the full capability of that class. Any failed crawl (non 2xx
response) will result in an empty?
document; so you can use next unless doc.empty?
in your block to only process successfully crawled documents.
The following code snippet crawls a single website's pages, storing their HTML in an Array:
require 'wgit'
require 'wgit/core_ext'
crawler = Wgit::Crawler.new
url = 'https://vlang.io'.to_url # Returns a Wgit::Url instance.
crawled = []
crawler.crawl_site(url) do |doc|
crawled << doc unless doc.empty?
end
crawled.first.class # => Wgit::Document
htmls = crawled.map(&:html)
# Do something with the crawled page's HTML...
When passing a url
to #crawl_site
, it's recommended that you use a URL which has the greatest chance of finding HTML links to other pages on the site e.g. an index page. By doing so, we ensure the best chance of a complete and thorough crawl.
The above is simple and effective but it has a potential problem. We store each crawled web document in an Array which is fine for a small site like vlang.io (made up of 4 webpages at the time of writing). But what about larger sites where we don't know how many pages might be crawled? One solution of course, is to make better use of #crawl_site
's block.
The following code snippet uses the block to process each web document as it gets crawled, releasing any reference to it before crawling the next. This is much more memory efficient. Just remember, the crawl waits for the block to complete before continuing onto the next page of the site; which will affect your overall crawl time. Basically, it's a trade off. Keep your block processing code short or pass it off to an asynchronous worker etc.
crawler.crawl_site(url) do |doc|
next if doc.empty?
# Use `doc.html` here or `queue_job(doc)` etc.
end
Remember that the block passed to #crawl_site
is the only way to interact with each crawled web document. The return value is an Array of unique external URL's (to other sites) extracted while crawling the site; these are typically used by Wgit::Indexer
to crawl other sites later on.
As with any Ruby block, you can break
out of it cancelling the crawl or call next
to start the next iteration of the crawl early.
Another solution to a more efficient crawl is to only partially crawl a site.
To do so, we pass additional path parameters to #crawl_site
. We can blacklist and/or whitelist URL paths as necessary. For example, we can whitelist certain paths using allow_paths
:
crawled = []
crawler.crawl_site(url, allow_paths: ['docs', 'compare']) do |doc|
crawled << doc.url
end
crawled # => ["https://vlang.io", "https://vlang.io/docs", "https://vlang.io/compare"]
Notice how the url
(in this case - "https://vlang.io") always gets crawled regardless of its path because it's the starting point of the crawl. We can combine both allow_paths
and disallow_paths
, as necessary. The given paths are processed using File.fnmatch?
meaning that the full glob syntax is supported e.g. 'wiki/*'
.
A larger, more realistic example of partially crawling a website is... Crawling this wiki! But we definitely don't want to crawl all of the github.com site along with it (or we'll be here a while). The following code snippet will do it:
wiki = 'https://github.com/michaeltelford/wgit/wiki'.to_url
crawled = []
allow = 'michaeltelford/wgit/wiki/*' # Allow all of this wiki's pages.
disallow = 'michaeltelford/wgit/wiki/*/_history' # But not archived wiki pages.
crawler.crawl_site(wiki, allow_paths: allow, disallow_paths: disallow) do |doc|
crawled << doc.url
end
# Only the pages we specified got crawled.
crawled.all? { |url| url.start_with?(wiki) } # => true
crawled.none? { |url| url.end_with?('_history') } # => true
Wgit::Crawler#crawl_site
won't crawl the same URL twice unless a redirect is involved. For example, let's say /location
gets crawled, then /redirect
, which redirects back to /location
. In this case /location
will be crawled twice because only the original URL (in this case /redirect
) gets checked. The exception to this rule is external redirects (to a different host), which are not allowed and will not be followed by #crawl_site
.
If you want to record the unique URLs on a host, then you can use a Set
:
# Let's record all the unique pages on a host to determine the size of the site.
crawled = Set.new # Sets are guaranteed to be unique.
crawler.crawl_site(url) do |doc|
# The doc.url here is the final redirected-to url, not the original.
crawled << doc.url
end
size = crawled.size < 50 ? 'small' : 'large'
puts "#{url} is a #{size} site."
All of the code examples above use the default method of extracting URLs, in order to keep the crawl of the site going. But what if we want to be explicit and only follow certain links/URLs? Simply pass a follow:
param:
crawler = Wgit::Crawler.new
url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
puts doc.url
end
which will print:
http://quotes.toscrape.com/tag/humor/
http://quotes.toscrape.com/tag/humor/page/2/
The crawl will continue for as long as the follow: xpath
returns URLs. It's up to you to ensure that the provided xpath is correct. Also remember that any URL pointing to a different host will not be crawled - because this would be outside the remit of #crawl_site
. The URLs extracted by your xpath will be subject to the allow/disallow_paths
params if provided.
If you're unsure about the size of the site you're about to crawl, you can pass a max_pages:
parameter to limit the scope of the crawl. For example, setting max_pages: 3
will only allow #crawl_site
to crawl the first 3 documents it encounters.
By default, Wgit doesn't parse any crawled Javascript. But you can enable this feature if desired. See 'How To Parse Javascript' for more information.
If you're crawling a site in order to save its HTML to disk, then take a look at this script for an already built example which Wgit uses to save test fixtures.
- How To Crawl A Website
- How To Crawl Locally
- How To Crawl More Than Just HTML
- How To Derive Crawl Statistics
- How To Extract Content
- How To Handle Redirects
- How To Index
- How To Multi-Thread
- How To Parse A URL
- How To Parse Javascript
- How To Prevent Indexing
- How To Use A Database
- How To Use Last Response
- How To Use The DSL
- How To Use The Executable
- How To Use The Logger
- How To Write A Database Adapter