Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ordered read_postgres tables #2596

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

tychoish
Copy link
Collaborator

@tychoish tychoish commented Feb 4, 2024

massive WIP;

want to see if we can get something that resembles pushdown sorts into
data sources that can sort on their own, without rearchitecting
datafusion.

if this works, relevant group-by and sorts can stay streaming in DF
and take advantage of indexes in the upstream datasource.

@tychoish
Copy link
Collaborator Author

tychoish commented Feb 4, 2024

Interestingly enough, this seems to have the desired behavior (returning results in the correct order), which is really cool.

Since we don't have a good way to assert about explain in SLTs (or anywhere else, I think?) I don't know if we have actually successfully exposed the fact that the the results will be ordered from the data source. If we don't do this DF will buffer all of the results to sort the (already sorted) data, which will be faster than having to sort the data, but still less than ideal.

The other big question that I think we should make sure to decide on before landing this (and duplicating it in other relevant data sources/table providers and table funcs) is the syntax.

Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't expose a new dsl for datasource specific ordering. All pushdowns need to be handled in the optimizer.

@tychoish
Copy link
Collaborator Author

tychoish commented Feb 5, 2024

We shouldn't expose a new dsl for datasource specific ordering. All pushdowns need to be handled in the optimizer.

So the optimizer, as I'm sure you know, doesn't pass sort information to table providers/data sources, so while I'm not wed to this particular approach, you're proposing something that isn't possible in datafusion at the moment. If you have a better idea for this I'd love to hear it.

@universalmind303
Copy link
Contributor

So the optimizer, as I'm sure you know, doesn't pass sort information to table providers/data sources, so while I'm not wed to this particular approach, you're proposing something that isn't possible in datafusion at the moment. If you have a better idea for this I'd love to hear it.

It doesn't natively have any mechanisms to push down anything beyond slice, predicate and projections. This doesn't stop us for writing custom optimizer rules for pushing these down.

This is how cube does sort pushdowns (among others), and how datafusion advises handling these kind of specialized optimizations

@tychoish
Copy link
Collaborator Author

tychoish commented Feb 5, 2024

So I think the syntax in the table function is probably wrong, but "just write an optimizer plan to solve this generically" is definitely at the very least "making the perfect the enemy of the good enough." I definitely think as an option in CREATE EXTERNAL TABLE (perhaps) it's more reasonable.

From an implementation perspective, implementing a physical optimizer pass, would (via it's own path) {maybe} end up constructing a table provider with roughly exactly the same code as the one here.

From an ergonomics perspective this isn't all together that different from the way that transactional systems require users to create indexes on their tables to support fast sort operations, and it wouldn't be reasonable for those systems to say "users shouldn't need an explicit syntax, the optimizer should be able to handle it." Similarly, such systems have some kind of HINT functionality, that let users explicitly indicate which indexes to use, in the case when the optimizer selects the wrong index (usually because optimizers don't have a great view into workload or data distribution,) and doing something like this (creating an ordered table) is a lot like a hint.

Thinking about it in these terms, pushing down sorts is the kind of thing that you'd really only want to do in some situations. While it'd be faster for some queries and some tables, if the tables weren't properly indexed, users might really not want the sort pushed down. This ends up not being so much about "how can we best optimize things" and more "how do users want [glaredb/etc] to interact with their databases," and the second part can't really be solved with optimizer abstractions.

Apart from all of this I think the core use case here is pretty valid, and I don't think we should just walk away from any user who wants to sort their data without table scanning their operational databases.

@universalmind303
Copy link
Contributor

marking as draft as it's not actively waiting on review

@universalmind303 universalmind303 marked this pull request as draft March 5, 2024 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants