feat: implement delimiter join #626

EpsilonPrime · 2024-04-12T03:33:02Z

This defines the physical delimiter join relation. It utilizes duplicate elimination to reduce the size of one side and then performs multiple scans on the other side to reduce the work of the actual join.

westonpace · 2024-04-12T21:20:36Z

proto/substrait/algebra.proto

+// A delimiter join performs duplicate elimination on one side and then
+// pushes the rows with the duplicates eliminated into an arbitrary
+// number of scans on the opposing side to enact the join.  The keys are
+// used to implement the join condition.


Once again, I think I don't understand. Why would there be more than one scan? How do scans enact a join? Is this for a case where a single input is sent to two different join relations? Or do we somehow need multiple scans to satisfy a single join?

From my observations there is only one delimiter scan to generate the delimiter for matching. The unclear language here is what probably should be considered an internal implementation detail (how to break up the task into smaller workloads). I've updated this text.

Mytherin

Thanks! I've left a comment below:

Mytherin · 2024-04-15T11:13:40Z

proto/substrait/algebra.proto

+// pushes the rows with the duplicates eliminated into an arbitrary
+// number of scans on the opposing side to enact the join.  The keys are
+// used to implement the join condition.
+message DelimiterJoinRel {


While this looks correct to me as far as the join goes, this also needs to have a corresponding set of DelimScan nodes that can be referenced from the delim join. Since these can be nested, there needs to be some way of figuring out which DelimScan nodes belong to which DelimJoin, so likely the delim scan nodes need to have an index and the DelimJoin needs to have a list of indexes/references to the delim scans.

What's the difference between a DelimScan node and a projection? It seems like both construct some value for later consumption. If a DelimScan's output is merely a new field we can provide the field references of interest to the DelimJoin. I've looked at the explain results for a number of queries and it is not as simple as a DelimScan directly providing its results directly to the DelimJoin (for instance TPC-H queries number 17 and 21).

EpsilonPrime · 2024-08-08T03:11:14Z

Will be addressed in an alternative PR.

westonpace reviewed Apr 12, 2024

View reviewed changes

Mytherin reviewed Apr 15, 2024

View reviewed changes

EpsilonPrime added 2 commits April 29, 2024 14:52

Merge branch 'main' of github.com:substrait-io/substrait

6ca533f

feat: implement delimiter join

af2e1d4

EpsilonPrime force-pushed the delimiter_join branch from 6cda4f8 to af2e1d4 Compare April 30, 2024 01:46

Updated based on review feedback.

6b65e4f

EpsilonPrime closed this Aug 8, 2024

EpsilonPrime deleted the delimiter_join branch September 26, 2024 04:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement delimiter join #626

feat: implement delimiter join #626

EpsilonPrime commented Apr 12, 2024

westonpace Apr 12, 2024

EpsilonPrime Apr 30, 2024

Mytherin left a comment

Mytherin Apr 15, 2024

EpsilonPrime Apr 30, 2024

EpsilonPrime commented Aug 8, 2024

feat: implement delimiter join #626

feat: implement delimiter join #626

Conversation

EpsilonPrime commented Apr 12, 2024

westonpace Apr 12, 2024

Choose a reason for hiding this comment

EpsilonPrime Apr 30, 2024

Choose a reason for hiding this comment

Mytherin left a comment

Choose a reason for hiding this comment

Mytherin Apr 15, 2024

Choose a reason for hiding this comment

EpsilonPrime Apr 30, 2024

Choose a reason for hiding this comment

EpsilonPrime commented Aug 8, 2024