Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a limit on the crawl depth #24

Closed
premasagar opened this issue Aug 15, 2012 · 4 comments
Closed

Support a limit on the crawl depth #24

premasagar opened this issue Aug 15, 2012 · 4 comments
Labels

Comments

@premasagar
Copy link
Member

E.g. by setting a crawl depth of 10, the spider will only follow a maximum chain of 10 URLs before it gives up on finding a valid reciprocal rel=me URL and abandons the chain.

The cache counter increments whenever a link is followed. When the counter reaches the crawl limit, that branch of the network is abandoned. Each time a valid reciprocal link in the requested network is found, the crawl counter is reset to 0. When all branches have been explored to the extent of the crawl limit, then the response data is cached and served.

The crawl limit prevents an unreciprocated link triggering a huge search chain, and also permits a reasonable amount of crawling the network, in order to capture its spread.

E.g. example.com may link to foo.com, and foo.com to bar.com, and bar.com back to example.com - here, a crawl limit of 3 will be sufficient to discover the network.

The parameter should be adjustable via a config.json file on the server.

@chrisnewtn
Copy link
Member

Supporting this will take some work as at the moment Node Socialgraph has no concept of a crawl limit and will just crawl any page it hasn't already crawled.

If a url hasn't been crawled, Node Socialgraph will check if it's at the same level or deeper of a domain that has already been crawled and if it is, not crawl it.

These are the only two pieces of logic used when deciding whether or not to crawl a page. A prime example of where this logic falls down are Flickr photo page links. Here's the process:

http://www.flickr.com/photos/newt42/ hasn't been crawled, so crawl it.

I found this link on that page, http://www.flickr.com/people/newt42/, should I crawl it?

It's the same level, on the same domain as one you already have, so no.

As a result just about everyone's Flickr profile page links are considered invalid. A crawl count would fix this and provided you have no objections, I'll start a new branch and begin working on this as my next big feature.

@chrisnewtn
Copy link
Member

Whoops, closed it accidentally!

@premasagar
Copy link
Member Author

Sure. New feature branch. Nice.

Premasagar Rose, Dharmafly http://dharmafly.com
dharmafly.com / 07941 192398
premasagar.com
twitter.com/premasagar
L4RP.com
asyncjs.com

@premasagar
Copy link
Member Author

This feature should be implemented, and the current same-domain limit on the 'feature/crawl-count' branch should be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants