Support cached responses for any URL within a crawled network #23

premasagar · 2012-08-15T21:56:13Z

Aspirational feature, i.e. not too important right now.

Summary: Support cached responses to a URL's network when the URL is already known as a secondary URL in a previously cached response.

The crawling required to discover a rel=me network of URLs is intensive and takes some time, so we want to avoid as much excessive processing as much as possible.

Currently, we cache the network based only on the originally requested URL, e.g. example.com. However, we should also index the data by every other URL in the network - e.g. if a request is later made to foo.com, and if foo.com is already known to part of example.com's network, then the previously cached data should be served (although, this time, foo.com will be considered the original URL and so foo.com will not be included in the URL list, but example.com will be).

The text was updated successfully, but these errors were encountered:

premasagar · 2012-08-15T22:17:43Z

Or, is this a bad idea, since this could lead to, for example, a domain that is 10 crawl steps away from the original request's domain? (See issue on crawl depth, #24).

If the secondary domain had been crawled, it would have 10 more crawl steps to attempt to find other domains that were not in the previously cached network list.

premasagar · 2012-08-15T23:21:07Z

My issue raised in my previous comment is not actually a problem, since the second URL, which was 10 crawl steps away from the first, would reset the counter to zero when found, and it would then become the starting point for a new crawl. Hence, the resultant network will always be the same, no matter which URL in the network was originally requested.

So, the whole network should indeed be cached, with a request for any of its URLs able to retrieve the cached data.

chrisnewtn · 2012-08-16T14:38:31Z

Sorry Prem, I realize my lack of documentation is a major problem here.

At the moment, the cacher works by taking the place of the scraper if it contains data pertaining to a url. Here's the code in question in the Page class:

if (cache.has(self.url)) {
  cache.fetch(self.url, populate);
} else {
  scraper.scrape(self.url, populate);
}

I should stress that graphs are not cached, pages which contain links are. I.E. Node Socialgraph caches not the graph generated by the requested domain, but the individual pages which constitute the graph as generated by the requested domain.

The cache in it's current incarnation is purely concerned with preventing the creation of JSDOM instances, that's it. The graph is still built from scratch on every request, but it constructed using cached data. To save on this processing I could cache the graph itself, as well as it's constituent pages. I gather this is how you thought it worked anyway?

premasagar · 2012-08-17T15:23:57Z

Ok, well in the interests of faster response time, reducing CPU usage
and server memory to store additional content, I think we should
definitely be caching the network lists themselves.

I can imagine caching like this (initially just in memory, although in
future this may be on Redis or similar):

When a network list is first assembled, assign it a unique id (e.g.
from an auto-incremented counter or a hash if the contents or
something more exotic)
Store the lists in a key-value object, where the key is the of the
network and the value is an array of all URLs in the list
(including the URL given in the original request)
Store a separate key-value object that has a key for each URL in
the list, an where the value is the network id
When a request comes in, look up the URL and serve the network list.
The default behaviour could be to return the entire network list,
including the URL in the request query.
An optional flag could be passed, say omit_request, which would
have the server remove the requested URL from the list, making the
handler code in the client a bit simpler.

How does that sound?

chrisnewtn · 2012-08-17T16:26:26Z

I think I get what you're saying. I'll try and figure out the best way of doing this within the application's architecture.

Add link cache timer. Add graph caching. #23

chrisnewtn · 2012-09-08T15:30:11Z

Ok I've built in a caching timer, now anything older than an hour is discarded. It's probably not perfect, but it's simple and effective.

I've also mostly rewritten the server code to support caching of whole graphs which uses the same module as the link caching.

I've also renamed the q url parameter to url which I think is a bit more of an intuitive name. On a plus note the server no longer crashes if the url parameter is missing either!

The changes have been merged into the master branch and as soon as I've figured out how to get Jitsu onto my new Ubuntu laptop, I'll put them live there too.

Now's probably a good time for me to document this thing eh?

premasagar · 2012-09-11T14:41:45Z

Great!

jitsu: [sudo] npm install jitsu -g

chrisnewtn added a commit that referenced this issue Sep 8, 2012

Add link cache timer. Add graph caching. #23

9f05279

chrisnewtn added a commit that referenced this issue Sep 8, 2012

Merge pull request #37 from dharmafly/feature/improved-cache

3566e12

Add link cache timer. Add graph caching. #23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cached responses for any URL within a crawled network #23

Support cached responses for any URL within a crawled network #23

premasagar commented Aug 15, 2012

premasagar commented Aug 15, 2012

premasagar commented Aug 15, 2012

chrisnewtn commented Aug 16, 2012

premasagar commented Aug 17, 2012

chrisnewtn commented Aug 17, 2012

chrisnewtn commented Sep 8, 2012

premasagar commented Sep 11, 2012

Support cached responses for any URL within a crawled network #23

Support cached responses for any URL within a crawled network #23

Comments

premasagar commented Aug 15, 2012

premasagar commented Aug 15, 2012

premasagar commented Aug 15, 2012

chrisnewtn commented Aug 16, 2012

premasagar commented Aug 17, 2012

chrisnewtn commented Aug 17, 2012

chrisnewtn commented Sep 8, 2012

premasagar commented Sep 11, 2012