Athena: metadata caching #461
Labels
enhancement
New feature or request
minor release
Will be addressed in the next minor release
ready to release
Milestone
Is your idea related to a problem? Please describe.
We rely on Athena caching quite a bit - we currently run our own caching mechanism, but moving to the aws wrangler would be desirable. What's stopping us a bit is the way caching is implemented. If I want to extend the cache to the last, say, 300 queries,
_read.py
will querylist_query_executions
andbatch_get_query_execution
six times each, both of which are fairly slow.Describe the solution you'd like
Since the logic is written in composable functions, we can't quite rely on some private caching in a class, so a global would have to do, which is not a super pretty pattern, but it would probably do. Caching
batch_get_query_execution
is rather simple, we could create a_cache = defaultdict(dict)
and save all SUCCEEDED query metadata in_cache[workgroup][queryId]
.Then when listing past queries, we could check the cache and only do
batch_get_query_execution
for those not in our cache. Ideally we'd want to skip the listing as well, but that would only be doable if we allowed for stale results - e.g. imagine you first check the cache and if a query is there, return it directly. But what if the query ran after you last filled the cache and it returned different results than the one before (e.g.count(*)
and a new partition was added). We'd get quite flaky results.So at least some listing stays. Say you request 150 past queries. You list once and find 3 new queries and 47 cached ones. If your cache contains at least 103 another entries, you don't need to do any more listing or batch getting - that's a nice speedup. It's even compatible with something more complex than a
defaultdict
- we'd probably prefer some sorted data structure that would start deleting old queries.There are a few edge cases here and there, but overall, I think this could be done without a huge amount of complexity and could benefit interactive workloads, where you keep calling
read_sql_query
repeatedly and build your cache as you go along.So that's my initial brain dump when it comes to client side caching of metadata. Let me know if you have thoughts on this topic
The text was updated successfully, but these errors were encountered: