-
Notifications
You must be signed in to change notification settings - Fork 2
How To Multi Thread
If we want to crawl individual URLs and know the URLs ahead of time, then we can crawl in parallel using Ruby's built-in Thread
class. For example:
main.rb
require 'wgit'
require 'wgit/core_ext'
urls = %w[
https://daveceddia.com/tutorial-trap/
https://daveceddia.com/how-i-learn-things/
].to_urls
crawler = Wgit::Crawler.new
handler = lambda { |doc| puts "#{doc.title} - #{doc.description}" }
urls.map! do |url|
Thread.new { crawler.crawl url, &handler }
end
urls.each &:join
Run the script with:
$ ruby main.rb
How I Learn New Things - Someone asked recently what my learning strategy was… how do I learn new things?
The Tutorial Trap - Sometimes it's better to venture out on your own.
Notice how we create a single handler
that gets passed to each thread to handle its crawled url/document. We then call join
on the array of threads and wait for them to finish.
We can also employ the use of threads when crawling a site. For example, using the same handler as before:
threads = []
crawler.crawl_site url do |doc|
threads << Thread.new { handler.call doc }
end
threads.each &:join
This won't crawl each page in parallel but it will handle each page in parallel, speeding up the overall execution. This is particularly effective when you're doing a lot of processing per page.
This is how the broken_link_finder gem uses Wgit under the hood - each crawled page on a site is passed to a thread which checks that document's links, returning those which are broken. The resulting speed increase on a large site is massive.
- How To Crawl A Website
- How To Crawl Locally
- How To Crawl More Than Just HTML
- How To Derive Crawl Statistics
- How To Extract Content
- How To Handle Redirects
- How To Index
- How To Multi-Thread
- How To Parse A URL
- How To Parse Javascript
- How To Prevent Indexing
- How To Use A Database
- How To Use Last Response
- How To Use The DSL
- How To Use The Executable
- How To Use The Logger
- How To Write A Database Adapter