-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement delimiter join #626
Conversation
proto/substrait/algebra.proto
Outdated
// A delimiter join performs duplicate elimination on one side and then | ||
// pushes the rows with the duplicates eliminated into an arbitrary | ||
// number of scans on the opposing side to enact the join. The keys are | ||
// used to implement the join condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, I think I don't understand. Why would there be more than one scan? How do scans enact a join? Is this for a case where a single input is sent to two different join relations? Or do we somehow need multiple scans to satisfy a single join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my observations there is only one delimiter scan to generate the delimiter for matching. The unclear language here is what probably should be considered an internal implementation detail (how to break up the task into smaller workloads). I've updated this text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've left a comment below:
// pushes the rows with the duplicates eliminated into an arbitrary | ||
// number of scans on the opposing side to enact the join. The keys are | ||
// used to implement the join condition. | ||
message DelimiterJoinRel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this looks correct to me as far as the join goes, this also needs to have a corresponding set of DelimScan
nodes that can be referenced from the delim join. Since these can be nested, there needs to be some way of figuring out which DelimScan
nodes belong to which DelimJoin
, so likely the delim scan nodes need to have an index and the DelimJoin
needs to have a list of indexes/references to the delim scans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between a DelimScan node and a projection? It seems like both construct some value for later consumption. If a DelimScan's output is merely a new field we can provide the field references of interest to the DelimJoin. I've looked at the explain results for a number of queries and it is not as simple as a DelimScan directly providing its results directly to the DelimJoin (for instance TPC-H queries number 17 and 21).
6cda4f8
to
af2e1d4
Compare
Will be addressed in an alternative PR. |
This defines the physical delimiter join relation. It utilizes duplicate elimination to reduce the size of one side and then performs multiple scans on the other side to reduce the work of the actual join.