-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cache] Using the query as the basis of the cache key #4016
[cache] Using the query as the basis of the cache key #4016
Conversation
@@ -203,8 +203,6 @@ def query_obj(self): | |||
|
|||
@property | |||
def cache_timeout(self): | |||
if self.form_data.get('cache_timeout'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the cache is explorer view agnostic the cache timeout is only associated with the query and not the explorer form_data
.
b09ca5e
to
88c341e
Compare
If using the query as the cache key, then we should use the query results as the payload. |
@mistercrunch shouldn't the payload which includes annotations etc. remain unchanged per https://github.com/apache/incubator-superset/blob/master/superset/views/core.py#L1082. Wouldn't a augmented response to the |
My point is there are numerous controls that don't affect the query but do affect the pandas logic. Take the rolling averages on line charts for instance: same SQL, different backend pandas logic... So there are two potential approaches here:
|
88c341e
to
fbe50c6
Compare
@mistercrunch sorry for the delay in responding to your comment. The PR is doing what you suggest as the first potential approach, i.e., the whole payload is not cached, only the data which is fetched from the database. The cache key is the deterministic database query. |
fbe50c6
to
1ca47ea
Compare
superset/viz.py
Outdated
merge_extra_filters(form_data) | ||
s = str([(k, form_data[k]) for k in sorted(form_data.keys())]) | ||
def cache_key(self, query_obj): | ||
s = self.datasource.get_query_str(query_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it such that get_query_str
needs to be deterministic. I think that's the case. Let's note that if we see cache-hit-misses in the future it may be related to get_query_str
not being fully deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I'll add a docstring to provide more context.
👍 This looks good to me. I'll let @graceguo-supercat push the merge button so that you can fit it best in your internal release schedule. |
5e4940a
to
96542e1
Compare
superset/viz.py
Outdated
form_data = self.form_data.copy() | ||
merge_extra_filters(form_data) | ||
s = str([(k, form_data[k]) for k in sorted(form_data.keys())]) | ||
def cache_key(self, query_obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to include the datasource itself => you can have 2 databases with the same tables, which as far as I can tell would cause collisions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I think so too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabianmenges/@mistercrunch I agree. I made the cache key a tuple of the database name and query string. PTAL.
96542e1
to
f4a8c6b
Compare
LGTM |
superset/viz.py
Outdated
|
||
return hashlib.md5( | ||
json.dumps(( | ||
self.datasource.database.database_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just using the Database name as part of the cache key can cause authorization issues. I think we need to include the superset Datasource unique id.
E.g. I'm authorized to only query a single schema in a Database but data for a schema that I don't have access to is cached.
Also the database name can still collide with other database names on a different server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabianmenges I've made the change to use the datasource ID (which indirectly has the database information encoded in it).
f4a8c6b
to
f3771a1
Compare
superset/viz.py
Outdated
return hashlib.md5( | ||
json.dumps(( | ||
self.datasource.id, | ||
self.datasource.get_query_str(query_obj).encode('utf-8'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably already an issue with the existing setup, but you may also want to consider cases where different users may see different results for the same query and datasource based on out of band information, such as a user that gets passed through in the query context (e.g. row-level permission in a view).
ddf1a92
to
932cd27
Compare
932cd27
to
3c9c69a
Compare
closes apache#4222 Related to: apache#4016
* Working polygon layer for deckGL * add js controls * add thumbnail * better description * refactor to leverage line_column controls * templates: open code and documentation on a new tab (#4217) As they are external resources. * Fix tutorial doesn't match the current interface #4138 (#4215) * [bugfix] markup and iframe viz raise 'Empty query' (#4225) closes #4222 Related to: #4016 * [bugfix] time_pivot entry got missing in merge conflict (#4221) PR here #3518 missed a line of code while merging conflicts with time_pivot viz * Improve deck.gl GeoJSON visualization (#4220) * Improve geoJSON * Addressing comments * lint * refactor to leverage line_column controls * refactor to use DeckPathViz * oops
* Working polygon layer for deckGL * add js controls * add thumbnail * better description * refactor to leverage line_column controls * templates: open code and documentation on a new tab (apache#4217) As they are external resources. * Fix tutorial doesn't match the current interface apache#4138 (apache#4215) * [bugfix] markup and iframe viz raise 'Empty query' (apache#4225) closes apache#4222 Related to: apache#4016 * [bugfix] time_pivot entry got missing in merge conflict (apache#4221) PR here apache#3518 missed a line of code while merging conflicts with time_pivot viz * Improve deck.gl GeoJSON visualization (apache#4220) * Improve geoJSON * Addressing comments * lint * refactor to leverage line_column controls * refactor to use DeckPathViz * oops
* Working polygon layer for deckGL * add js controls * add thumbnail * better description * refactor to leverage line_column controls * templates: open code and documentation on a new tab (apache#4217) As they are external resources. * Fix tutorial doesn't match the current interface apache#4138 (apache#4215) * [bugfix] markup and iframe viz raise 'Empty query' (apache#4225) closes apache#4222 Related to: apache#4016 * [bugfix] time_pivot entry got missing in merge conflict (apache#4221) PR here apache#3518 missed a line of code while merging conflicts with time_pivot viz * Improve deck.gl GeoJSON visualization (apache#4220) * Improve geoJSON * Addressing comments * lint * refactor to leverage line_column controls * refactor to use DeckPathViz * oops
* Working polygon layer for deckGL * add js controls * add thumbnail * better description * refactor to leverage line_column controls * templates: open code and documentation on a new tab (apache#4217) As they are external resources. * Fix tutorial doesn't match the current interface apache#4138 (apache#4215) * [bugfix] markup and iframe viz raise 'Empty query' (apache#4225) closes apache#4222 Related to: apache#4016 * [bugfix] time_pivot entry got missing in merge conflict (apache#4221) PR here apache#3518 missed a line of code while merging conflicts with time_pivot viz * Improve deck.gl GeoJSON visualization (apache#4220) * Improve geoJSON * Addressing comments * lint * refactor to leverage line_column controls * refactor to use DeckPathViz * oops
This PR resolves issue #3840 where previously the visualization data query key was a hashed form of the
form_data
.Per the description in the issue the
form_data
stored in the DB which is what the/warm_up_cache/
endpoint leverages is not the same form data in the explorer view (which is augmented by both the front- and back-ends). Given the cached data is the query response it makes more sense to simply use the hashed query string as the cache key. In addition to resolving this issue the cache is based purely on the relevant portions of theform_data
which are used to generate the query, thus UI changes to the color palette etc. will not violate the cache.Note there is no way to determine via the cache API when the cache key was created and thus it is necessary to cache the DTTM alongside the data.
Note the cache is now explorer view agnostic, i.e., a cache key may be shared by multiple explorer views/slices.
to: @graceguo-supercat @michellethomas @mistercrunch @timifasubaa