Skip to content

How To Index

Michael Telford edited this page Jul 9, 2024 · 24 revisions

The Wgit::Indexer class provides a means to index the web. Indexing means crawling one or more web pages and inserting their contents into a Database. Once stored in a Database, each document's content can be searched.

The rest of this article assumes you're using the default database adapter class, which is for MongoDB.

This article demonstrates this process. Consider the following:

require 'wgit'
require 'wgit/core_ext'

# Omit the connection string param to default to ENV['WGIT_CONNECTION_STRING'].
db      = Wgit::Database.new '<connection_string>'
indexer = Wgit::Indexer.new db

# It's a single method call to index a page, site or even the WWW.
#
# Each crawled Wgit::Document is yielded to the block for inspection.
# We can return :skip from the block to prevent the doc being inserted into the Database.
indexer.index_site 'http://txti.es/'.to_url

# Search the indexed documents for something we're interested in.
q = 'Fast web pages for everybody'
results = db.search q

# Get the top matching result.
doc = results.first

doc.class    # => Wgit::Document
doc.score    # => 7.3 - The query match score set by the DB.
doc.search q # => ["Fast web pages for everybody.", ...] etc.
  • Wgit::Database#search returns us the matching Documents from the Database.
  • Wgit::Document#search returns us the matching fields from that Document.

Note that the fields to be searched are defined in Wgit::Model.search_fields. Here is an example of get and set:

# Wgit's default Model search fields, which all #search methods abide by.
Wgit::Model.search_fields
# => {:title=>2, :description=>2, :keywords=>2, :text=>1}

# Passing an initialised database param will in turn call db.search_fields = {...} underneath.
Wgit::Model.set_search_fields({my_field: 2}, db)
# => {:my_field=>2}

Check out this demo search engine - built using Wgit, Sinatra and MongoDB - deployed to fly.io. Try searching for something that's Ruby related like "Matz" or "Rails".


With Wgit::Indexer you can index a page, site or the entire World Wide Web using a different method for each:

Indexer Method Effect
Wgit::Indexer#index_url Crawls and stores a single HTML page.
Wgit::Indexer#index_site Crawls and stores an entire site's HTML pages.
Wgit::Indexer#index_www Crawls one site at a time, storing its external URL's to be crawled in later iterations. This goes on until no more external URL's are left to be crawled (which will likely never happen), the Database runs out of space or you manually kill the crawl.

The Wgit::DSL module provides convenience methods for calling these Indexer methods underneath, without having to initialise your own Wgit::Database.adapter_class and Wgit::Indexer instances. See the Wgit::DSL module docs for more information.


So far we've indexed and searched the visual text on a page. But what if we want to index and search the content of a specific page element? No problem, you simply define an extractor, re-index and search.

The Document content that gets inserted into the Database depends on the Wgit::Model module; but as a rule of thumb, most Wgit::Document instance variables will be inserted as key-value pairs. Because of this, your defined extractors (which define instance variables on the Wgit::Document) will also be inserted into the Database when indexed, meaning they too can be searched.

Let's take another example. This time, we specifically want to index the <code> snippets of the site. Note that while the code snippets will also appear in the text field, we want to store the code in its own field so we can search against it exclusively.

# Let's start by defining an extractor to extract and store the <code> content.
Wgit::Document.define_extractor(
  :code, '//code', singleton: false, text_content_only: true
)

# Now let's index a site whose code snippets we're interested in.
indexer.index_site 'https://vlang.io/'.to_url

This will index the site vlang.io, storing each crawled HTML document into the Database. The stored documents will each contain our defined code field; consisting of an array of strings representing the code snippets extracted from that page.

Don't forget, in order to search the Database's code snippets, we need to add the code field to our text search index (described in Configuring MongoDB). Changing the index to set the code field means we're searching solely against that field. We can update the text index programmatically using Wgit with:

Wgit::Model.set_search_fields([:code], db)

With the text search index updated, we can search the Database in the usual way:

q = 'println'
results = db.search q

results.size      # => 2
results.map &:url # => ["https://vlang.io/docs", "https://vlang.io/compare"]

doc = results.first

# Lets search the code snippets of this Document directly for matches.
doc.code.size # => 129 code snippets on this page.
doc.code.select { |snippet| snippet.include? q }
# => ["println", "println(arr)"]

In the above examples, we've crawled an entire site, inserted its content into a Database before searching for and serialising the results - providing the basis of any search engine, all within ~10 lines of code.

Powerful stuff!