Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: TopNRank optimization #11554

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

feat: TopNRank optimization #11554

wants to merge 1 commit into from

Conversation

aditi-pandit
Copy link
Collaborator

@aditi-pandit aditi-pandit commented Nov 15, 2024

Design doc : https://docs.google.com/document/d/1WQfNigR9bVrbM-PqY7F0mswcetN_tdNahzD9ENye-Q0/edit?usp=sharing

#9404

The refactoring in #11440 reduces the diffs in that part of the code.

e2e Presto PR (with changes in the Presto optimizer as well) prestodb/presto#24138

Latency for SF1K TPC-DS Q67 fell from 399s to 146s with this change.

(I also started working on a fuzzer in #12103 which I will enhance for the rank and dense_rank functions added here).

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2024
Copy link

netlify bot commented Nov 15, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 67f644d
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/678068753a11b00008487921

public:
MultiTopNRowNumberTest() : TopNRowNumberTest(GetParam()) {}
};

Copy link
Contributor

@liujiayi771 liujiayi771 Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test case for the logic in fixTopRank.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liujiayi771 : The fixTopRank logic is tested very thoroughly in the fewPartitions test. Have added a comment there.

@JkSelf
Copy link
Collaborator

JkSelf commented Nov 27, 2024

@aditi-pandit
For TopNRowNumber, there is an issue similar to Window, which is that before TopNRowNumber, Spark will insert an OrderBy operator to sort the data as following.

image

So, do we need to make some abstractions in addInput here as well, to facilitate the addition of TopNStreamingRowNumber later on?

@aditi-pandit
Copy link
Collaborator Author

aditi-pandit commented Nov 27, 2024

@aditi-pandit For TopNRowNumber, there is an issue similar to Window, which is that before TopNRowNumber, Spark will insert an OrderBy operator to sort the data as following.

image

So, do we need to make some abstractions in addInput here as well, to facilitate the addition of TopNStreamingRowNumber later on?

@JkSelf : TopNRowNumber is a somewhat streaming operator in its current implementation ... It uses HashTable internally to map the input row to a partition and each partition has an accumulator that maintains the ordered rows (as many required for limit) in a priority queue.

Window accumulates all the input rows and does a full sort of the input rows to demarcate into partitions and sort by order-by. So the preceding Sort was useful and we abstracted the streaming window.

With TopNRowNumber, doing a full sort and then making TopNRowNumber limit to only a partition at a time, the tradeoffs are different. Have you considered removing the global sort and checking if TopNRowNumber suffices ?

If we decide eventually that having a full streaming operator for topNRowNumber is useful, then it might be worth it to write a new operator itself (rather than enhance this current one). Offcourse, we can try to reuse some of the ranking logic pieces.

@JkSelf
Copy link
Collaborator

JkSelf commented Nov 27, 2024

@aditi-pandit For TopNRowNumber, there is an issue similar to Window, which is that before TopNRowNumber, Spark will insert an OrderBy operator to sort the data as following.
image
So, do we need to make some abstractions in addInput here as well, to facilitate the addition of TopNStreamingRowNumber later on?

@JkSelf : TopNRowNumber is a streaming operator... It uses HashTable internally to map the input row to a partition and each partition has an accumulator that maintains the ordered rows in a priority queue.

So the OrderBy is wasteful. It's not required. Can you consider removing the OrderBy before TopNRowNumber ?

@aditi-pandit I see. Thanks for your explanations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants