-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support a limit on the crawl depth #24
Comments
Supporting this will take some work as at the moment Node Socialgraph has no concept of a crawl limit and will just crawl any page it hasn't already crawled. If a url hasn't been crawled, Node Socialgraph will check if it's at the same level or deeper of a domain that has already been crawled and if it is, not crawl it. These are the only two pieces of logic used when deciding whether or not to crawl a page. A prime example of where this logic falls down are Flickr photo page links. Here's the process:
As a result just about everyone's Flickr profile page links are considered invalid. A crawl count would fix this and provided you have no objections, I'll start a new branch and begin working on this as my next big feature. |
Whoops, closed it accidentally! |
Sure. New feature branch. Nice. Premasagar Rose, Dharmafly http://dharmafly.com |
This feature should be implemented, and the current same-domain limit on the 'feature/crawl-count' branch should be removed. |
E.g. by setting a crawl depth of 10, the spider will only follow a maximum chain of 10 URLs before it gives up on finding a valid reciprocal
rel=me
URL and abandons the chain.The cache counter increments whenever a link is followed. When the counter reaches the crawl limit, that branch of the network is abandoned. Each time a valid reciprocal link in the requested network is found, the crawl counter is reset to 0. When all branches have been explored to the extent of the crawl limit, then the response data is cached and served.
The crawl limit prevents an unreciprocated link triggering a huge search chain, and also permits a reasonable amount of crawling the network, in order to capture its spread.
E.g. example.com may link to foo.com, and foo.com to bar.com, and bar.com back to example.com - here, a crawl limit of 3 will be sufficient to discover the network.
The parameter should be adjustable via a config.json file on the server.
The text was updated successfully, but these errors were encountered: