Skip to content

How To Use The DSL

Michael Telford edited this page Mar 25, 2024 · 24 revisions

The Wgit::DSL provides wrapper methods around the API for convenience, and its use is optional.

require 'wgit'
require 'json'

include Wgit::DSL

start  'http://quotes.toscrape.com/tag/humor/'
follow "//li[@class='next']/a/@href"

extract :quotes,  "//div[@class='quote']/span[@class='text']", singleton: false
extract :authors, "//div[@class='quote']/span/small",          singleton: false

quotes = []

crawl_site do |doc|
  doc.quotes.zip(doc.authors).each do |arr|
    quotes << {
      quote:  arr.first,
      author: arr.last
    }
  end
end

puts JSON.generate(quotes)

The DSL can be quicker and simpler to use than the API by abstracting away some of the boiler plate code e.g. instantiating classes. Using the DSL, you can crawl, index and search the web. But some functionality - such as URL parsing - is only possible using the API.

The Wgit DSL is typically used for quickly writing scripts that extract data from the web, either as experiments or written by non technical users. Anything that's possible with the DSL is also possible using Wgit's API classes. Often, when using Wgit in another library or appication, it's cleaner and more flexible to use the API, but the choice is yours.

When you include Wgit::DSL, you include its defined methods and instance vars. All DSL instance vars and constants are prefixed with dsl_ to avoid conflicts. It's up to you to ensure the methods don't override other definitions in your code however. If in doubt, use the Wgit API instead - which is prefixed with the Wgit:: namespace.

Check out the DSL's yardocs for the full list of available methods.


An alternative method of using the DSL is by subclassing Wgit::Base - which extends Wgit::DSL underneath. The syntax of which is similar to the Kimura framework. This approach can provide an additional layer of abstraction over the typical DSL use case from above.

class QuotesCrawler < Wgit::Base
  mode   :crawl_site
  start  'http://quotes.toscrape.com/tag/humor/'
  follow "//li[@class='next']/a/@href"

  extract :quotes,  "//div[@class='quote']/span[@class='text']", singleton: false
  extract :authors, "//div[@class='quote']/span/small",          singleton: false

  def parse(doc)
    doc.quotes.zip(doc.authors).each do |arr|
      yield({
        quote:  arr.first,
        author: arr.last
      })
    end
  end
end

if __FILE__ == $0
  quotes = []
  QuotesCrawler.run { |quote| quotes << quote }
  puts JSON.generate(quotes)
end

How it works:

  • You must call the start DSL method to define the URLs to crawl.
  • Your crawler class must define a #parse(doc) method which can optionally yield some data. You then have access to this data via a block when you run your crawler.
  • Any defined extractors will be callable on the doc passed to parse - which is called for every crawled URL/page.
  • The mode DSL method specifies which Wgit::Crawler/Indexer method to call, defaulting to crawl - which crawls a single URL/page.
  • When crawling a site, by default all internal <a> href URLs are followed - you can override this with follow XPath.
  • If indexing a site (crawling and then saving to a database), don't forget to set the ENV['WGIT_CONNECTION_STRING'].
  • Define #initialize, #setup and #teardown methods as needed inside your class. These methods are called before and after the crawl.
  • Call self.class.<dsl_method> as needed from inside your class's instance methods e.g. self.class.last_response etc.
  • The run method returns the created instance of your class for convienence. You can use this to query your class after the crawl has completed.

Here's another example of a class based DSL crawler (using some of the above points):

require "wgit"

# Suppress the index logging.
Wgit.logger.level = Logger::WARN

# Set your databases connection string.
ENV['WGIT_CONNECTION_STRING'] = "mongodb://rubyapp:abcdef@localhost/crawler"

$url = "https://txti.es"

class WebsiteIndexer < Wgit::Base
  mode :index_site
  start $url

  attr_reader :page_count, :total_time

  def initialize
    @page_count = 0
    @total_time = 0
  end

  def setup
    puts "Starting to index #{$url}..."
    puts
  end

  def parse(doc)
    @total_time += self.class.last_response.total_time
    return if doc.empty?

    puts_info(doc)
    @page_count += 1
  end

  def teardown
    puts "Finished indexing #{$url} (#{@page_count} pages in #{@total_time})"
  end

  private

  def puts_info(doc)
    puts doc.title || "No title"
    puts doc.description&.[](0..100) || "No description"
    puts doc.stats
    puts doc.url
    puts
  end
end

if __FILE__ == $0
  indexer = WebsiteIndexer.run
  puts "On average, one page was indexed every #{indexer.total_time / indexer.page_count} ms"
end

Which will insert the crawled pages into the database and output:

Starting to index https://txti.es...

txti - Fast web pages for everybody
No description
{:url=>15, :html=>3706, :text=>9, :text_bytes=>192, :links=>4, :title=>35, :author=>14}
https://txti.es

About txti
No description
{:url=>21, :html=>1834, :text=>14, :text_bytes=>826, :links=>4, :title=>10}
https://txti.es/about

How to use txti
No description
{:url=>19, :html=>3804, :text=>49, :text_bytes=>2456, :links=>7, :title=>15}
https://txti.es/how

txti - Terms of Service
No description
{:url=>21, :html=>11589, :text=>42, :text_bytes=>10481, :links=>1, :title=>23}
https://txti.es/terms

Made via txti.es:
Images in txti
All images will be centered and start on a new line (so text doesn't flow around them.
{:url=>22, :html=>2335, :text=>8, :text_bytes=>658, :links=>3, :description=>203, :title=>17, :author=>7}
https://txti.es/images

Made via txti.es:
Images in txti
All images will be centered and start on a new line (so text doesn't flow around them.
{:url=>29, :html=>2155, :text=>4, :text_bytes=>536, :links=>1, :description=>203, :title=>17, :author=>7}
https://txti.es/images/images

Finished indexing https://txti.es (6 pages in 0.913512)
On average, one page was indexed every 0.152252 ms