Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement delimiter join #626

Closed
wants to merge 3 commits into from

Conversation

EpsilonPrime
Copy link
Member

This defines the physical delimiter join relation. It utilizes duplicate elimination to reduce the size of one side and then performs multiple scans on the other side to reduce the work of the actual join.

Comment on lines 729 to 733
// A delimiter join performs duplicate elimination on one side and then
// pushes the rows with the duplicates eliminated into an arbitrary
// number of scans on the opposing side to enact the join. The keys are
// used to implement the join condition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, I think I don't understand. Why would there be more than one scan? How do scans enact a join? Is this for a case where a single input is sent to two different join relations? Or do we somehow need multiple scans to satisfy a single join?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my observations there is only one delimiter scan to generate the delimiter for matching. The unclear language here is what probably should be considered an internal implementation detail (how to break up the task into smaller workloads). I've updated this text.

Copy link

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've left a comment below:

// pushes the rows with the duplicates eliminated into an arbitrary
// number of scans on the opposing side to enact the join. The keys are
// used to implement the join condition.
message DelimiterJoinRel {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this looks correct to me as far as the join goes, this also needs to have a corresponding set of DelimScan nodes that can be referenced from the delim join. Since these can be nested, there needs to be some way of figuring out which DelimScan nodes belong to which DelimJoin, so likely the delim scan nodes need to have an index and the DelimJoin needs to have a list of indexes/references to the delim scans.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between a DelimScan node and a projection? It seems like both construct some value for later consumption. If a DelimScan's output is merely a new field we can provide the field references of interest to the DelimJoin. I've looked at the explain results for a number of queries and it is not as simple as a DelimScan directly providing its results directly to the DelimJoin (for instance TPC-H queries number 17 and 21).

@EpsilonPrime
Copy link
Member Author

Will be addressed in an alternative PR.

@EpsilonPrime EpsilonPrime deleted the delimiter_join branch September 26, 2024 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants