-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cached responses for any URL within a crawled network #23
Comments
Or, is this a bad idea, since this could lead to, for example, a domain that is 10 crawl steps away from the original request's domain? (See issue on crawl depth, #24). If the secondary domain had been crawled, it would have 10 more crawl steps to attempt to find other domains that were not in the previously cached network list. |
My issue raised in my previous comment is not actually a problem, since the second URL, which was 10 crawl steps away from the first, would reset the counter to zero when found, and it would then become the starting point for a new crawl. Hence, the resultant network will always be the same, no matter which URL in the network was originally requested. So, the whole network should indeed be cached, with a request for any of its URLs able to retrieve the cached data. |
Sorry Prem, I realize my lack of documentation is a major problem here. At the moment, the cacher works by taking the place of the scraper if it contains data pertaining to a url. Here's the code in question in the Page class: if (cache.has(self.url)) {
cache.fetch(self.url, populate);
} else {
scraper.scrape(self.url, populate);
} I should stress that graphs are not cached, pages which contain links are. I.E. Node Socialgraph caches not the graph generated by the requested domain, but the individual pages which constitute the graph as generated by the requested domain. The cache in it's current incarnation is purely concerned with preventing the creation of JSDOM instances, that's it. The graph is still built from scratch on every request, but it constructed using cached data. To save on this processing I could cache the graph itself, as well as it's constituent pages. I gather this is how you thought it worked anyway? |
Ok, well in the interests of faster response time, reducing CPU usage I can imagine caching like this (initially just in memory, although in
How does that sound? |
I think I get what you're saying. I'll try and figure out the best way of doing this within the application's architecture. |
Add link cache timer. Add graph caching. #23
Ok I've built in a caching timer, now anything older than an hour is discarded. It's probably not perfect, but it's simple and effective. I've also mostly rewritten the server code to support caching of whole graphs which uses the same module as the link caching. I've also renamed the The changes have been merged into the master branch and as soon as I've figured out how to get Jitsu onto my new Ubuntu laptop, I'll put them live there too. Now's probably a good time for me to document this thing eh? |
Great! jitsu: |
Aspirational feature, i.e. not too important right now.
Summary: Support cached responses to a URL's network when the URL is already known as a secondary URL in a previously cached response.
The crawling required to discover a
rel=me
network of URLs is intensive and takes some time, so we want to avoid as much excessive processing as much as possible.Currently, we cache the network based only on the originally requested URL, e.g. example.com. However, we should also index the data by every other URL in the network - e.g. if a request is later made to foo.com, and if foo.com is already known to part of example.com's network, then the previously cached data should be served (although, this time, foo.com will be considered the original URL and so foo.com will not be included in the URL list, but example.com will be).
The text was updated successfully, but these errors were encountered: