How To Multi Thread

If we want to crawl individual URLs and know the URLs ahead of time, then we can crawl in parallel using Ruby's built-in Thread class. For example:

main.rb

require 'wgit'
require 'wgit/core_ext'

urls = %w[
  https://daveceddia.com/tutorial-trap/
  https://daveceddia.com/how-i-learn-things/
].to_urls

crawler = Wgit::Crawler.new
handler = lambda { |doc| puts "#{doc.title} - #{doc.description}" }

urls.map! do |url|
  Thread.new { crawler.crawl url, &handler }
end

urls.each &:join

Run the script with:

$ ruby main.rb 
How I Learn New Things - Someone asked recently what my learning strategy was… how do I learn new things?
The Tutorial Trap - Sometimes it's better to venture out on your own.

Notice how we create a single handler that gets passed to each thread to handle its crawled url/document. We then call join on the array of threads and wait for them to finish.

We can also employ the use of threads when crawling a site. For example, using the same handler as before:

threads = []

crawler.crawl_site url do |doc|
  threads << Thread.new { handler.call doc }
end

threads.each &:join

This won't crawl each page in parallel but it will handle each page in parallel, speeding up the overall execution. This is particularly effective when you're doing a lot of processing per page.

This is how the broken_link_finder gem uses Wgit under the hood - each crawled page on a site is passed to a thread which checks that document's links, returning those which are broken. The resulting speed increase on a large site is massive.

How To's

How To Get Started

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How To Multi Thread

How To's

Clone this wiki locally