[Spike] Use Flask-Caching for shared caching - prototype 2 #133

antonymilne · 2023-10-26T16:30:22Z

Description

A couple of schemes that could simplify our caching solution. There's two commits here showing slightly different things:

Idea 1: `843c813`

If we have lazy data then it bypasses original_data entirely. This removes the original behaviour that's in main that lazy data is only loaded on first access. This means that there's only one caching mechanism which is the "proper" flask-caching one. Hence there's no need to clear up original data afterwards. If a dataframe is originally set as original_data then there's no caching and there's a copy of the data on each worker still - it's just lazy data that is treated differently here.

Idea 2: `8b7aeb8`

Building on the last commit: can we simplify further and just treat both original_data and lazy_data as a single paradigm? I convert something that is a pd.DataFrame into a callable that returns a pd.DataFrame just by sticking lambda in front of it. Hence after that point there's no distinction between lazy vs. original data which simplifies things even more.

At a glance, idea 1 here seems like an improvement on #116. Not sure whether/how idea 2 would work but it may be better still - needs some more thought though.

As it stands, there's no way to configure the caching on a per-dataset basis. So by default there are two options:

NullCache: means that _load_lazy_data call is effectively not cached at all, so every data loading query will be re-run
SimpleCache: means that _load_lazy_data call has some kind of caching, so reloading the same data does not run every time

There's actually two different scenarios where we care about caching:

A single HTTP request where we have a page that contains 10 figures, all of which refer to the same underlying dataset
Subsequent HTTP requests (either the same user or different, we don't distinguish) want to load the same data.

In our discussion earlier I forgot all about scenario 1, which was stupid because this was actually the original motivation for the whole original_data thing 😅 You run the query once for the first figure to populate original_data from lazy_data, and then all subsequent requests for component data take from original_data rather than re-running the query. The problem with this is that it becomes impossible to ever re-run the query and refresh original_data consistently across workers.

Now in theory we could handle these two above scenarios differently by identifying whether the call for data is in the same HTTP request or not (you could tell this e.g. with X-Request-ID header in the request though I don't know whether this is really good practice). This would be fine from a consistency point of view because one HTTP request will always be processed on one worker. But it might be an unnecessary complication because we can maybe handle both scenarios in exactly the same way: by setting a sensible default timeout for the cache.

e.g. let's say we have the cachelib default timeout of 5 minutes. The proposed solution here means that a single HTTP request will only ever call _load_lazy_data once because multiple calls to it will be easily within 5 minutes of each other. And subsequent HTTP requests will also not re-run _load_lazy_data so long as they're in the 5 minutes of the cache being populated originally.

In fact, I think that just being able to set timeouts differently for different datasets will get 90% towards answering the question of how we do data refreshes also:

if you set timeout = 5 minutes then naturally the cache expires after 5 minutes and hence _load_lazy_data will be re-run on next request after that
if you set timeout = 0 the cache never expires
I don't know whether we'll need an option to disable the cache entirely for a dataset, because you can achieve essentially the same thing just by setting a very low timeout. But it is possible to disable cache for particular arguments entirely in cachelib which is probably cleaner

The only thing this proposal misses compared to previous ideas is that in theory I think it could lead to inconsistent data on the screen. e.g. what happens if you have 10 figures on screen, and during one HTTP request after 5 calls to _load_lazy_data the cache expires and so the remaining 5 calls get different data. No idea whether this is likely to be a problem in reality, but maybe it suggests that we do need to somehow handle the two above scenarios slightly differently after all? In which case I think we do need the two layers of caching that #116 provides.

I know I'm having a conversation with myself here, but my latest thinking is:

we do need two layers of caching: 1. to ensure consistency on the same worker within one HTTP request and 2. to improve performance and share memory over multiple requests
let's try using flask.g or similar + our own cache or e.g. lru_cache or in fact just cachelib to achieve 1 and cachelib for 2
need to think about how to handle the application building process (outside any requests) and whether to warm up the cache - probably not worth doing manually but instead would point towards SimpleCache as the default so that lazy data doesn't get fetched twice (at build time and runtime immediately afterwards)
I'll do a new prototype of this when I get a chance. It will be a bit more complex than the code here but think it could well be the best solution

Options for configuring per-dataset arguments:

data_manager["gapminder"] = retrieve_gapminder
data_manager["gapminder"]._cache_arguments = {"timeout": 600}

data_manager["gapminder2"] = retrieve_gapminder
data_manager["gapminder2"]._cache_arguments = {
    "timeout": 0,
    "unless": (lambda: True)
}

Screenshot

Checklist

I have not referenced individuals, products or companies in any commits, directly or indirectly
I have not added data or restricted code in any commits, directly or indirectly
I have updated the docstring of any public function/class/model changed
I have added the PR number to the change description in the changelog fragment, e.g. Enable feature XXX ([#1](https://github.com/mckinsey/vizro/pull/1)) (if applicable)
I have added tests to cover my changes (if applicable)

Types of changes

Docs/refactoring (non-breaking change which improves codebase)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Notice

I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
- I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
- I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorized to submit this contribution on behalf of the original creator(s) or their licensees.
- I certify that the use of this contribution as authorized by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

lingyielia · 2023-10-27T22:10:21Z

Still trying to think through everything discuss here. One early idea related to a statement above

As it stands, there's no way to configure the caching on a per-dataset basis.

Can we pass in per-dataset timeout config to the decorator, which then overwrite the default timeout? Because we can do this to set timeout as 360s to overwrite the default 60s:

    @_cache.memoize(timeout=360)
    def _load_lazy_data(self, dataset_name: DatasetName) -> pd.DataFrame:

while in default config:

data_manager._cache.config = {
        "CACHE_TYPE": "RedisCache",
        "CACHE_REDIS_URL": "redis://localhost:6379/0",
        "CACHE_DEFAULT_TIMEOUT": 60,
    }

antonymilne · 2023-10-30T09:44:14Z

@lingyielia Yes, that's exactly what I was thinking. We can also set unless to disable caching for particular datasets altogether. This shouldn't be hard to do - I just didn't want to complicate the above any more by considering it since it should be easy to add in afterwards.

This reverts commit 8b7aeb8.

for more information, see https://pre-commit.ci

antonymilne · 2024-04-03T05:26:46Z

Superseded by #398.

For the record:

went with Idea 1. Tried out Idea 2 but decided it complicated things because it subjects static datasets to the same cache as reloadable ones. Instead created _StaticDataset (in contrast to _Dataset)
there's no built in behaviour that stops cache from expiring in a given request. No good way to extend timeout. Correct in thinking that accessing cache doesn’t change expiry time - see https://github.com/pallets-eco/flask-caching/blob/master/src/flask_caching/__init__.py
hence in theory we could have inconsistent data on the screen if the cache expiry is unfortunately timed. Note that even if one action is one HTTP request, one actions chain can be multiple. Can have unique ID associated with action chain but ultimately will need user session to make sure there’s consistency?

antonymilne added 2 commits October 26, 2023 17:23

Remove conversion of lazy data into original data completely

843c813

Remove original data itself entirely

8b7aeb8

antonymilne requested review from lingyielia and petar-qb October 26, 2023 16:36

antonymilne changed the title ~~Dev/flask caching am~~ [Spike] using Flask-Caching for shared caching - prototype 2 Oct 26, 2023

antonymilne added 2 commits November 3, 2023 20:47

Revert "Remove original data itself entirely"

a2950cd

This reverts commit 8b7aeb8.

Very rough ideas for next steps

d6c2215

antonymilne mentioned this pull request Nov 9, 2023

Flask caching per dataset #152

Merged

9 tasks

Flask caching per dataset (#152)

5ea1ca8

lingyielia added the Stage: Technical Design 🎨 label Nov 14, 2023

antonymilne added 5 commits November 17, 2023 17:40

Decide the API

e0ac912

Move the caching all to _Dataset itself

95a6c82

Add comments

a9b3894

Add comments

921898c

Add comments

ff65053

huong-li-nguyen changed the title ~~[Spike] using Flask-Caching for shared caching - prototype 2~~ [Spike] Use Flask-Caching for shared caching - prototype 2 Feb 23, 2024

antonymilne and others added 8 commits March 19, 2024 11:55

Make one global DM cache rather than per-dataset

290476f

Lots more comments

35bcb74

[pre-commit.ci] auto fixes from pre-commit.com hooks

7fdded4

for more information, see https://pre-commit.ci

Finally resolve cache clashing

bcea18b

Abolish lazy data!

ce9bf95

More renaming

1fb8e3f

Introduce StaticDataset

81e6ab4

[pre-commit.ci] auto fixes from pre-commit.com hooks

93750b4

for more information, see https://pre-commit.ci

antonymilne closed this Apr 3, 2024

lingyielia deleted the dev/flask_caching-AM branch July 26, 2024 06:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spike] Use Flask-Caching for shared caching - prototype 2 #133

[Spike] Use Flask-Caching for shared caching - prototype 2 #133

antonymilne commented Oct 26, 2023 •

edited by lingyielia

Loading

lingyielia commented Oct 27, 2023

antonymilne commented Oct 30, 2023

antonymilne commented Apr 3, 2024

[Spike] Use Flask-Caching for shared caching - prototype 2 #133

[Spike] Use Flask-Caching for shared caching - prototype 2 #133

Conversation

antonymilne commented Oct 26, 2023 • edited by lingyielia Loading

Description

Idea 1: 843c813

Idea 2: 8b7aeb8

Options for configuring per-dataset arguments:

Screenshot

Checklist

Types of changes

Notice

lingyielia commented Oct 27, 2023

antonymilne commented Oct 30, 2023

antonymilne commented Apr 3, 2024

antonymilne commented Oct 26, 2023 •

edited by lingyielia

Loading

Idea 1: `843c813`

Idea 2: `8b7aeb8`