Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Athena: metadata caching #461

Closed
kokes opened this issue Nov 25, 2020 · 4 comments
Closed

Athena: metadata caching #461

kokes opened this issue Nov 25, 2020 · 4 comments
Assignees
Labels
enhancement New feature or request minor release Will be addressed in the next minor release ready to release
Milestone

Comments

@kokes
Copy link

kokes commented Nov 25, 2020

Is your idea related to a problem? Please describe.

We rely on Athena caching quite a bit - we currently run our own caching mechanism, but moving to the aws wrangler would be desirable. What's stopping us a bit is the way caching is implemented. If I want to extend the cache to the last, say, 300 queries, _read.py will query list_query_executions and batch_get_query_execution six times each, both of which are fairly slow.

Describe the solution you'd like

Since the logic is written in composable functions, we can't quite rely on some private caching in a class, so a global would have to do, which is not a super pretty pattern, but it would probably do. Caching batch_get_query_execution is rather simple, we could create a _cache = defaultdict(dict) and save all SUCCEEDED query metadata in _cache[workgroup][queryId].

Then when listing past queries, we could check the cache and only do batch_get_query_execution for those not in our cache. Ideally we'd want to skip the listing as well, but that would only be doable if we allowed for stale results - e.g. imagine you first check the cache and if a query is there, return it directly. But what if the query ran after you last filled the cache and it returned different results than the one before (e.g. count(*) and a new partition was added). We'd get quite flaky results.

So at least some listing stays. Say you request 150 past queries. You list once and find 3 new queries and 47 cached ones. If your cache contains at least 103 another entries, you don't need to do any more listing or batch getting - that's a nice speedup. It's even compatible with something more complex than a defaultdict - we'd probably prefer some sorted data structure that would start deleting old queries.

There are a few edge cases here and there, but overall, I think this could be done without a huge amount of complexity and could benefit interactive workloads, where you keep calling read_sql_query repeatedly and build your cache as you go along.


So that's my initial brain dump when it comes to client side caching of metadata. Let me know if you have thoughts on this topic

@kokes kokes added the enhancement New feature or request label Nov 25, 2020
@igorborgest
Copy link
Contributor

Hi @kokes ! It's a great idea for sure.

The whole Wrangler architecture was designed to keep things as simple as possible without handling states, so this implementation using only "raw functions" was conscious. But of course it has some drawbacks and will make our lives harder in situations like this one.

I really like you idea because it is not invasive at all. And I think the only point where some refactoring would be required is on this for statement. We just need to replace the current logic by something smarter like what you've proposed.

Like you said, a class holding this state would be more elegant, but a global variable solves the problem too. So I don't see any serious reason to not do it. To be honesty, we already have a global variable implemented in this code base.

For sure it is something that we will be happy to implement in the future. But please, feel free to also send your PR if you want.

@maxispeicher
Copy link
Contributor

@kokes If you want to, you can have a look if the implementation done in the PR fits your need 😃

@igorborgest
Copy link
Contributor

@maxispeicher your propose seems perfect to me.

@kokes if you want, I recommend you to give it a try directly from the development branch:

pip install git+https://github.com/maxispeicher/aws-data-wrangler.git@athena-metadata-caching

@igorborgest igorborgest added this to the 2.3.0 milestone Jan 5, 2021
@igorborgest igorborgest added minor release Will be addressed in the next minor release ready to release labels Jan 8, 2021
@igorborgest
Copy link
Contributor

Released on version 2.3.0 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request minor release Will be addressed in the next minor release ready to release
Projects
None yet
Development

No branches or pull requests

3 participants