Support a limit on the crawl depth #24

premasagar · 2012-08-15T22:11:55Z

E.g. by setting a crawl depth of 10, the spider will only follow a maximum chain of 10 URLs before it gives up on finding a valid reciprocal rel=me URL and abandons the chain.

The cache counter increments whenever a link is followed. When the counter reaches the crawl limit, that branch of the network is abandoned. Each time a valid reciprocal link in the requested network is found, the crawl counter is reset to 0. When all branches have been explored to the extent of the crawl limit, then the response data is cached and served.

The crawl limit prevents an unreciprocated link triggering a huge search chain, and also permits a reasonable amount of crawling the network, in order to capture its spread.

E.g. example.com may link to foo.com, and foo.com to bar.com, and bar.com back to example.com - here, a crawl limit of 3 will be sufficient to discover the network.

The parameter should be adjustable via a config.json file on the server.

The text was updated successfully, but these errors were encountered:

chrisnewtn · 2012-08-16T14:57:10Z

Supporting this will take some work as at the moment Node Socialgraph has no concept of a crawl limit and will just crawl any page it hasn't already crawled.

If a url hasn't been crawled, Node Socialgraph will check if it's at the same level or deeper of a domain that has already been crawled and if it is, not crawl it.

These are the only two pieces of logic used when deciding whether or not to crawl a page. A prime example of where this logic falls down are Flickr photo page links. Here's the process:

http://www.flickr.com/photos/newt42/ hasn't been crawled, so crawl it.

I found this link on that page, http://www.flickr.com/people/newt42/, should I crawl it?

It's the same level, on the same domain as one you already have, so no.

As a result just about everyone's Flickr profile page links are considered invalid. A crawl count would fix this and provided you have no objections, I'll start a new branch and begin working on this as my next big feature.

chrisnewtn · 2012-08-16T14:57:54Z

Whoops, closed it accidentally!

premasagar · 2012-08-20T10:39:00Z

Sure. New feature branch. Nice.

Premasagar Rose, Dharmafly http://dharmafly.com
dharmafly.com / 07941 192398
premasagar.com
twitter.com/premasagar
L4RP.com
asyncjs.com

premasagar · 2012-08-22T11:51:47Z

This feature should be implemented, and the current same-domain limit on the 'feature/crawl-count' branch should be removed.

This was referenced Aug 15, 2012

Support cached responses for any URL within a crawled network #23

Open

Even in strict mode, links should be followed, waiting for the loop to complete #20

Closed

iframes #15

Closed

chrisnewtn closed this as completed Aug 16, 2012

chrisnewtn reopened this Aug 16, 2012

This was referenced Aug 22, 2012

After verification, compare the path depth and discard longer paths from the response #29

Closed

Investigate the lowest crawl depth that will result in complete graphs #31

Open

chrisnewtn mentioned this issue Aug 23, 2012

Feature/crawl limit #35

Merged

chrisnewtn closed this as completed in 02bf182 Aug 23, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a limit on the crawl depth #24

Support a limit on the crawl depth #24

premasagar commented Aug 15, 2012

chrisnewtn commented Aug 16, 2012

chrisnewtn commented Aug 16, 2012

premasagar commented Aug 20, 2012

premasagar commented Aug 22, 2012

Support a limit on the crawl depth #24

Support a limit on the crawl depth #24

Comments

premasagar commented Aug 15, 2012

chrisnewtn commented Aug 16, 2012

chrisnewtn commented Aug 16, 2012

premasagar commented Aug 20, 2012

premasagar commented Aug 22, 2012