Athena: metadata caching #461

kokes · 2020-11-25T14:56:47Z

Is your idea related to a problem? Please describe.

We rely on Athena caching quite a bit - we currently run our own caching mechanism, but moving to the aws wrangler would be desirable. What's stopping us a bit is the way caching is implemented. If I want to extend the cache to the last, say, 300 queries, _read.py will query list_query_executions and batch_get_query_execution six times each, both of which are fairly slow.

Describe the solution you'd like

Since the logic is written in composable functions, we can't quite rely on some private caching in a class, so a global would have to do, which is not a super pretty pattern, but it would probably do. Caching batch_get_query_execution is rather simple, we could create a _cache = defaultdict(dict) and save all SUCCEEDED query metadata in _cache[workgroup][queryId].

Then when listing past queries, we could check the cache and only do batch_get_query_execution for those not in our cache. Ideally we'd want to skip the listing as well, but that would only be doable if we allowed for stale results - e.g. imagine you first check the cache and if a query is there, return it directly. But what if the query ran after you last filled the cache and it returned different results than the one before (e.g. count(*) and a new partition was added). We'd get quite flaky results.

So at least some listing stays. Say you request 150 past queries. You list once and find 3 new queries and 47 cached ones. If your cache contains at least 103 another entries, you don't need to do any more listing or batch getting - that's a nice speedup. It's even compatible with something more complex than a defaultdict - we'd probably prefer some sorted data structure that would start deleting old queries.

There are a few edge cases here and there, but overall, I think this could be done without a huge amount of complexity and could benefit interactive workloads, where you keep calling read_sql_query repeatedly and build your cache as you go along.

So that's my initial brain dump when it comes to client side caching of metadata. Let me know if you have thoughts on this topic

The text was updated successfully, but these errors were encountered:

igorborgest · 2020-11-25T17:02:26Z

Hi @kokes ! It's a great idea for sure.

The whole Wrangler architecture was designed to keep things as simple as possible without handling states, so this implementation using only "raw functions" was conscious. But of course it has some drawbacks and will make our lives harder in situations like this one.

I really like you idea because it is not invasive at all. And I think the only point where some refactoring would be required is on this for statement. We just need to replace the current logic by something smarter like what you've proposed.

Like you said, a class holding this state would be more elegant, but a global variable solves the problem too. So I don't see any serious reason to not do it. To be honesty, we already have a global variable implemented in this code base.

For sure it is something that we will be happy to implement in the future. But please, feel free to also send your PR if you want.

maxispeicher · 2021-01-05T17:36:17Z

@kokes If you want to, you can have a look if the implementation done in the PR fits your need 😃

igorborgest · 2021-01-05T22:42:08Z

@maxispeicher your propose seems perfect to me.

@kokes if you want, I recommend you to give it a try directly from the development branch:

pip install git+https://github.com/maxispeicher/aws-data-wrangler.git@athena-metadata-caching

igorborgest · 2021-01-10T14:40:24Z

Released on version 2.3.0 🚀

kokes added the enhancement New feature or request label Nov 25, 2020

maxispeicher mentioned this issue Jan 5, 2021

Add local metadata caching #504

Merged

igorborgest added this to the 2.3.0 milestone Jan 5, 2021

igorborgest assigned maxispeicher Jan 5, 2021

igorborgest added minor release Will be addressed in the next minor release ready to release labels Jan 8, 2021

igorborgest closed this as completed Jan 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Athena: metadata caching #461

Athena: metadata caching #461

kokes commented Nov 25, 2020

igorborgest commented Nov 25, 2020

maxispeicher commented Jan 5, 2021

igorborgest commented Jan 5, 2021

igorborgest commented Jan 10, 2021

Athena: metadata caching #461

Athena: metadata caching #461

Comments

kokes commented Nov 25, 2020

igorborgest commented Nov 25, 2020

maxispeicher commented Jan 5, 2021

igorborgest commented Jan 5, 2021

igorborgest commented Jan 10, 2021