Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Use Flask-Caching for shared caching - prototype 2 #133

Closed
wants to merge 18 commits into from

Conversation

antonymilne
Copy link
Contributor

@antonymilne antonymilne commented Oct 26, 2023

Description

A couple of schemes that could simplify our caching solution. There's two commits here showing slightly different things:

Idea 1: 843c813

If we have lazy data then it bypasses original_data entirely. This removes the original behaviour that's in main that lazy data is only loaded on first access. This means that there's only one caching mechanism which is the "proper" flask-caching one. Hence there's no need to clear up original data afterwards. If a dataframe is originally set as original_data then there's no caching and there's a copy of the data on each worker still - it's just lazy data that is treated differently here.

Idea 2: 8b7aeb8

Building on the last commit: can we simplify further and just treat both original_data and lazy_data as a single paradigm? I convert something that is a pd.DataFrame into a callable that returns a pd.DataFrame just by sticking lambda in front of it. Hence after that point there's no distinction between lazy vs. original data which simplifies things even more.


At a glance, idea 1 here seems like an improvement on #116. Not sure whether/how idea 2 would work but it may be better still - needs some more thought though.

As it stands, there's no way to configure the caching on a per-dataset basis. So by default there are two options:

  • NullCache: means that _load_lazy_data call is effectively not cached at all, so every data loading query will be re-run
  • SimpleCache: means that _load_lazy_data call has some kind of caching, so reloading the same data does not run every time

There's actually two different scenarios where we care about caching:

  1. A single HTTP request where we have a page that contains 10 figures, all of which refer to the same underlying dataset
  2. Subsequent HTTP requests (either the same user or different, we don't distinguish) want to load the same data.

In our discussion earlier I forgot all about scenario 1, which was stupid because this was actually the original motivation for the whole original_data thing 😅 You run the query once for the first figure to populate original_data from lazy_data, and then all subsequent requests for component data take from original_data rather than re-running the query. The problem with this is that it becomes impossible to ever re-run the query and refresh original_data consistently across workers.

Now in theory we could handle these two above scenarios differently by identifying whether the call for data is in the same HTTP request or not (you could tell this e.g. with X-Request-ID header in the request though I don't know whether this is really good practice). This would be fine from a consistency point of view because one HTTP request will always be processed on one worker. But it might be an unnecessary complication because we can maybe handle both scenarios in exactly the same way: by setting a sensible default timeout for the cache.

e.g. let's say we have the cachelib default timeout of 5 minutes. The proposed solution here means that a single HTTP request will only ever call _load_lazy_data once because multiple calls to it will be easily within 5 minutes of each other. And subsequent HTTP requests will also not re-run _load_lazy_data so long as they're in the 5 minutes of the cache being populated originally.

In fact, I think that just being able to set timeouts differently for different datasets will get 90% towards answering the question of how we do data refreshes also:

  • if you set timeout = 5 minutes then naturally the cache expires after 5 minutes and hence _load_lazy_data will be re-run on next request after that
  • if you set timeout = 0 the cache never expires
  • I don't know whether we'll need an option to disable the cache entirely for a dataset, because you can achieve essentially the same thing just by setting a very low timeout. But it is possible to disable cache for particular arguments entirely in cachelib which is probably cleaner

The only thing this proposal misses compared to previous ideas is that in theory I think it could lead to inconsistent data on the screen. e.g. what happens if you have 10 figures on screen, and during one HTTP request after 5 calls to _load_lazy_data the cache expires and so the remaining 5 calls get different data. No idea whether this is likely to be a problem in reality, but maybe it suggests that we do need to somehow handle the two above scenarios slightly differently after all? In which case I think we do need the two layers of caching that #116 provides.


I know I'm having a conversation with myself here, but my latest thinking is:

  • we do need two layers of caching: 1. to ensure consistency on the same worker within one HTTP request and 2. to improve performance and share memory over multiple requests
  • let's try using flask.g or similar + our own cache or e.g. lru_cache or in fact just cachelib to achieve 1 and cachelib for 2
  • need to think about how to handle the application building process (outside any requests) and whether to warm up the cache - probably not worth doing manually but instead would point towards SimpleCache as the default so that lazy data doesn't get fetched twice (at build time and runtime immediately afterwards)
  • I'll do a new prototype of this when I get a chance. It will be a bit more complex than the code here but think it could well be the best solution

Options for configuring per-dataset arguments:

data_manager["gapminder"] = retrieve_gapminder
data_manager["gapminder"]._cache_arguments = {"timeout": 600}

data_manager["gapminder2"] = retrieve_gapminder
data_manager["gapminder2"]._cache_arguments = {
    "timeout": 0,
    "unless": (lambda: True)
}

Screenshot

Checklist

  • I have not referenced individuals, products or companies in any commits, directly or indirectly
  • I have not added data or restricted code in any commits, directly or indirectly
  • I have updated the docstring of any public function/class/model changed
  • I have added the PR number to the change description in the changelog fragment, e.g. Enable feature XXX ([#1](https://github.com/mckinsey/vizro/pull/1)) (if applicable)
  • I have added tests to cover my changes (if applicable)

Types of changes

  • Docs/refactoring (non-breaking change which improves codebase)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

    • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
    • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorized to submit this contribution on behalf of the original creator(s) or their licensees.
    • I certify that the use of this contribution as authorized by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

@antonymilne antonymilne changed the title Dev/flask caching am [Spike] using Flask-Caching for shared caching - prototype 2 Oct 26, 2023
@lingyielia
Copy link
Contributor

Still trying to think through everything discuss here. One early idea related to a statement above

As it stands, there's no way to configure the caching on a per-dataset basis.

Can we pass in per-dataset timeout config to the decorator, which then overwrite the default timeout? Because we can do this to set timeout as 360s to overwrite the default 60s:

    @_cache.memoize(timeout=360)
    def _load_lazy_data(self, dataset_name: DatasetName) -> pd.DataFrame:

while in default config:

data_manager._cache.config = {
        "CACHE_TYPE": "RedisCache",
        "CACHE_REDIS_URL": "redis://localhost:6379/0",
        "CACHE_DEFAULT_TIMEOUT": 60,
    }

@antonymilne
Copy link
Contributor Author

@lingyielia Yes, that's exactly what I was thinking. We can also set unless to disable caching for particular datasets altogether. This shouldn't be hard to do - I just didn't want to complicate the above any more by considering it since it should be easy to add in afterwards.

@antonymilne antonymilne mentioned this pull request Nov 9, 2023
9 tasks
@huong-li-nguyen huong-li-nguyen changed the title [Spike] using Flask-Caching for shared caching - prototype 2 [Spike] Use Flask-Caching for shared caching - prototype 2 Feb 23, 2024
@antonymilne
Copy link
Contributor Author

Superseded by #398.

For the record:

  • went with Idea 1. Tried out Idea 2 but decided it complicated things because it subjects static datasets to the same cache as reloadable ones. Instead created _StaticDataset (in contrast to _Dataset)
  • there's no built in behaviour that stops cache from expiring in a given request. No good way to extend timeout. Correct in thinking that accessing cache doesn’t change expiry time - see https://github.com/pallets-eco/flask-caching/blob/master/src/flask_caching/__init__.py
  • hence in theory we could have inconsistent data on the screen if the cache expiry is unfortunately timed. Note that even if one action is one HTTP request, one actions chain can be multiple. Can have unique ID associated with action chain but ultimately will need user session to make sure there’s consistency?

@antonymilne antonymilne closed this Apr 3, 2024
@lingyielia lingyielia deleted the dev/flask_caching-AM branch July 26, 2024 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants