-
Notifications
You must be signed in to change notification settings - Fork 4
How does the Grapher work?
Inside grapher.js exists the Grapher class. This class expects a URL, which is used as the entry point to a graph, along with various options including:
-
strict
boolean
If set to true then thetoJSON
method will not render unverified links. -
crawlLimit
integer
This number represents the depth at which a chain of links can be crawled without having found a verified link before the chain is abandoned and crawling of it ceased. -
stripDeeperLinks
boolean
If set to true then only the joint shallowest paths of each domain are rendered whentoJSON
is called.
The only above option that actually impacts the construction of the graph is crawlLimit
. A number too low and the whole graph may not be rendered. A number too high and the graph may do a ton of unnecessary crawling that doesn't lead anywhere.
An example initialization is shown below :
var url = "http://premasagar.com",
options, grapher;
options = {
strict: true,
crawlLimit: 3
};
grapher = new Grapher(url, options);
Once the Grapher has been initialized, call its build
method do begin the crawling process. Depending on the size of the graph, this may take a minute or so.
This process begins with the build method using the graph's rootUrl
(set at the initialization of the Grapher) to initialize a Page object.
This Page has it verified
property manually set to true
by the build
method as the entry point into the graph is treated as being inherently valid.
Once this is done, the Page is added to the Grapher's pages
array and the fetchPages
method is called for the first time.
Side node:
When the build
method is first called, you pass it a callback function that will be executed upon completion of the graph's construction. This callback is passed to the fetchPages
method after the initial Page has been added to the pages
array.
At the core of the fetchPages
method is an each
loop that goes through every page in the pages
array and checks if any have a status
of "unfetched". If any do then they are checked to see if their depth exceeds the Grapher's crawlLimit
and provided they don't, their fetch
method is called.
If by the end of the loop every Page has been fetched then each are verified one last time before executing the callback method originally passed in by build
.
The first thing the fetch
method does when called is updates the Page's status
to "fetching". This informs the fetchPages
method of the Grapher that the graph is still being built and to not execute the callback passed in by build
.
Next the method checks the cache to see if the Page's url has already been crawled, if it hasn't then the scraper is used to acquire it. The Scraper then caches any data it scrapes. Regardless of whether the data comes from the cache
or the scraper
, the same callback is then executed.
This callback (populate
) does a few very important things.
- Firstly it populates the Page object with the data retrieved from the scraper/cache.
- Secondly it calls the Grapher's
verifyPages
method which will attempt to verify itself as well as any other pages within the Grapher'spages
array. - Next it takes any urls found on it that haven't already been crawled and uses them to initialize new Page objects which are then added to the Grapher's
pages
array ready for crawling. - Next it updates its status from "fetching" to "fetched".
- Finally it executes the
whenPageIsFetched
callback passed to it from the Grapher'sfetchPages
method.
The whenPageIsFetched
callback checks if every page in the graph has been fetched. If they have then the callback passed in by the build
method originally, is then executed. If they haven't, then it calls fetchPages
, starting the crawling-verification loop again.