-
Notifications
You must be signed in to change notification settings - Fork 0
Caching
Archiver data is generally a good candidate for caching since it generally does not change. One particularly useful case for caching is on large processed data such as data that is integrated and downsampled. The myquery application attempts to set the proper HTTP cache headers to ensure clients (generally web browsers) cache aggressively, yet appropriately.
A reverse proxy server can be used to provide a shared cache for all clients. A simple way to accomplish this is via Apache HTTPD mod_cache. On version 2.2.x the configuration (in ssl.conf VirtualHost) might look like:
CacheRoot "/var/cache/apache/"
CacheEnable disk /myquery
Make sure to setup a directory for the cache:
cd /var/cache
mkdir apache
chown apache:apache apache
Then restart Apache.
Some data like regular historic raw (not downsampled) events probably should not be cached in the shared cache as it is a finite resource. Instead raw archive data should only be cached by the client requesting the data. The myquery application will use HTTP header "Cache-Control: private" for all cacheable data except for downsampled data such that the shared cache only contains downsampled data.
Responses from queries requesting data in the future generally cannot be cached since the data may change. A common example is requesting the "current year" of a PV. In this case the end boundary of the interval is in the future. The myquery application will set the HTTP cache header to no-store in this scenario. Since the Archiver will always return zero results for future events, there is no reason to request data that overlaps the future (other than perhaps simplicity of query construction) and the cost of doing so can be significant for repeat queries.
One strategy to avoid future requests is to modify all query end boundaries to be no greater than midnight "today" (last night). Using a consistent end boundary increases the chance that the query can be reused (anyone requesting "current year" for a PV anytime "today" can receive cached response). If multiple hours is too long for the client to live with stale (missing) data then perhaps use the most recent hour instead of midnight (though cache is then much less likely to be re-used).
An even more complicated strategy could be used to partition data by some consistent "cache-line", say a week, then caches could be even more reusable. This is not possible with some queries as some cannot simply be "glued" together, for example integrated data is dependent on prior events in the requested interval bounds.
Web browser caches generally have reasonable limits and a first in first out eviction policy (or some other clever eviction policy). Apache mod_disk_cache does NOT set limits and only removes expired content. Once the OS disk partition is full (and OS is throwing errors) Apache will ignore the cache altogether. A cron job should be configured to run the Apache htcacheclean application periodically.
Currently the maximum HTTP expiration duration of one year is used whenever caching is enabled. Perhaps a custom client-cache-hint header could be used to advise the server of how reusable a generated response is likely to be in order to determine a variable expiration duration? Or perhaps the server should simply not use maximum of 1 year and instead use smaller value?