-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spike] Use Flask-Caching for shared caching - prototype 2 #133
Conversation
Still trying to think through everything discuss here. One early idea related to a statement above
Can we pass in per-dataset timeout config to the decorator, which then overwrite the default timeout? Because we can do this to set timeout as 360s to overwrite the default 60s:
while in default config:
|
@lingyielia Yes, that's exactly what I was thinking. We can also set |
This reverts commit 8b7aeb8.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Superseded by #398. For the record:
|
Description
A couple of schemes that could simplify our caching solution. There's two commits here showing slightly different things:
Idea 1: 843c813
If we have lazy data then it bypasses
original_data
entirely. This removes the original behaviour that's inmain
that lazy data is only loaded on first access. This means that there's only one caching mechanism which is the "proper" flask-caching one. Hence there's no need to clear up original data afterwards. If a dataframe is originally set asoriginal_data
then there's no caching and there's a copy of the data on each worker still - it's just lazy data that is treated differently here.Idea 2: 8b7aeb8
Building on the last commit: can we simplify further and just treat both
original_data
andlazy_data
as a single paradigm? I convert something that is apd.DataFrame
into a callable that returns apd.DataFrame
just by stickinglambda
in front of it. Hence after that point there's no distinction between lazy vs. original data which simplifies things even more.At a glance, idea 1 here seems like an improvement on #116. Not sure whether/how idea 2 would work but it may be better still - needs some more thought though.
As it stands, there's no way to configure the caching on a per-dataset basis. So by default there are two options:
NullCache
: means that_load_lazy_data
call is effectively not cached at all, so every data loading query will be re-runSimpleCache
: means that_load_lazy_data
call has some kind of caching, so reloading the same data does not run every timeThere's actually two different scenarios where we care about caching:
In our discussion earlier I forgot all about scenario 1, which was stupid because this was actually the original motivation for the whole
original_data
thing 😅 You run the query once for the first figure to populateoriginal_data
fromlazy_data
, and then all subsequent requests for component data take fromoriginal_data
rather than re-running the query. The problem with this is that it becomes impossible to ever re-run the query and refreshoriginal_data
consistently across workers.Now in theory we could handle these two above scenarios differently by identifying whether the call for data is in the same HTTP request or not (you could tell this e.g. with
X-Request-ID
header in the request though I don't know whether this is really good practice). This would be fine from a consistency point of view because one HTTP request will always be processed on one worker. But it might be an unnecessary complication because we can maybe handle both scenarios in exactly the same way: by setting a sensible default timeout for the cache.e.g. let's say we have the cachelib default timeout of 5 minutes. The proposed solution here means that a single HTTP request will only ever call
_load_lazy_data
once because multiple calls to it will be easily within 5 minutes of each other. And subsequent HTTP requests will also not re-run_load_lazy_data
so long as they're in the 5 minutes of the cache being populated originally.In fact, I think that just being able to set timeouts differently for different datasets will get 90% towards answering the question of how we do data refreshes also:
_load_lazy_data
will be re-run on next request after thatThe only thing this proposal misses compared to previous ideas is that in theory I think it could lead to inconsistent data on the screen. e.g. what happens if you have 10 figures on screen, and during one HTTP request after 5 calls to
_load_lazy_data
the cache expires and so the remaining 5 calls get different data. No idea whether this is likely to be a problem in reality, but maybe it suggests that we do need to somehow handle the two above scenarios slightly differently after all? In which case I think we do need the two layers of caching that #116 provides.I know I'm having a conversation with myself here, but my latest thinking is:
flask.g
or similar + our own cache or e.g.lru_cache
or in fact just cachelib to achieve 1 and cachelib for 2SimpleCache
as the default so that lazy data doesn't get fetched twice (at build time and runtime immediately afterwards)Options for configuring per-dataset arguments:
Screenshot
Checklist
Enable feature XXX ([#1](https://github.com/mckinsey/vizro/pull/1))
(if applicable)Types of changes
Notice
I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":