Skip to content

How To Parse Javascript

Michael Telford edited this page Oct 23, 2024 · 9 revisions

This article describes how to crawl and parse Javascript using Wgit. You must have Chrome/Chromium installed at $PATH.

Wgit doesn't parse the Javascript on a crawled page by default, you have to explicitly enable it using:

require 'wgit'

crawler = Wgit::Crawler.new parse_javascript: true
# OR
crawler.parse_javascript = true

When enabled, the crawler will first resolve the URL (handling redirects etc.) using a HTTP client before browsing to the final URL in a headless (no UI) web browser. The browser will load the page and parse any Javascript it finds, allowing the page's HTML to be updated dynamically - think SPA apps like React etc. After the Javascript has finished rendering, the final HTML will be returned to Wgit and normal service resumes.

The following script demonstrates the effects parsing Javascript can have on a crawl:

require 'wgit'
require 'wgit/core_ext'

# DuckDuckGo uses JavaScript to dynamically render search results.
# Let's crawl and extract the top 10 search results for "aid workers".
url   = 'https://duckduckgo.com/?q=aid+workers'.to_url
xpath = '//article[contains(@data-testid, "result")]'

# Print the HTML size, number of search results and crawl duration.
benchmark = lambda do |doc|
  print "Bytes: #{doc.size}\t"
  print "Results: #{doc.xpath(xpath).size}\t"
  puts  "Duration: #{doc.url.crawl_duration.truncate(2)}"
end

# First, let's crawl without parsing the page's Javascript.
crawler = Wgit::Crawler.new
doc = crawler.crawl(url)
benchmark.call(doc)

# Now let's parse the page's Javascript and compare the effects.
crawler.parse_javascript = true
doc = crawler.crawl(url)
benchmark.call(doc)

Which outputs:

Bytes: 14951	Results: 0	Duration: 0.24
Bytes: 47916	Results: 10	Duration: 3.64

Notice how the HTML grows > 3x because the Javascript is dynamically updating the page. As a result, the search results we want are only accessible after the Javascript has been parsed. Parsing the Javascript has had a significant hit on the crawl speed too. It should be noted however, that the first crawl with parse_javascript as true is the slowest because it requires the browser to be initialised. Subsequent crawls will be faster.


So why doesn't Wgit parse Javascript by default?

  • Design - Wgit was primarily designed for building search engines. While Javascript rendering HTML is becoming more commonplace, it generally isn't the best way of achieving good SEO. Semantic, well formed HTML, straight from the server is what many search engines look for removing Javascript parsing from the equation (at least by default).
  • Speed - Javascript engines might be fast but they're not faster than skipping the Javascript all together. Another overhead is knowing when the page has finished being dynamically updated. Wgit periodically checks if the browser's HTML is growing in size and waits until it has stopped before returning the final HTML. Wgit::Crawler#parse_javascript_delay can be configured to strike a balance between faster crawls vs enough time for the Javascript to do its thing - but it's a gamble. All of this affects the overall crawl speed.
  • Weight - Browsers (especially Chrome) aren't exactly lightweight in terms of the hardware resources they need to run. A HTTP client is a feather weight by comparison.
  • Security - Parsing Javascript is essentially evaling code on your system. When you crawl a URL, you'd want confidence that the site is legitimate before parsing its Javascript; hence why it's an opt in feature.

If you're going to be crawling Javascript intensive webpages all day, every day - then you'd probably be better off using a pure browser based crawler such as Vessel.