Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add single match join #623

Closed
wants to merge 1 commit into from

Conversation

EpsilonPrime
Copy link
Member

@EpsilonPrime EpsilonPrime commented Apr 11, 2024

This establishes the physical single join relation. Similar to other joins it uses an expression to compare the left and right inputs. What sets it apart from other joins is that it requires that each row be involved in at most one match. This kind of join is commonly used in unnesting arbitrary subqueries.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this task on. I've got a first round of questions up.

@@ -725,6 +726,18 @@ message NestedLoopJoinRel {
substrait.extensions.AdvancedExtension advanced_extension = 10;
}

// A single join enforces that each row is matched at once most counterpart.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A single join enforces that each row is matched at once most counterpart.
// A single join enforces that each row is matched with at most one counterpart.

Rel right = 3;
// optional, defaults to true (a cartesian join)
Expression expression = 4;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No post_join_filter?


## Single Join Operator

The single match join works by enforcing that each row is involved in only a single match. If this violated, an error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The single match join works by enforcing that each row is involved in only a single match. If this violated, an error
The single match join works by enforcing that each row is involved in only a single match. If this is violated, an error


## Single Join Operator

The single match join works by enforcing that each row is involved in only a single match. If this violated, an error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you say "only a single match" In the protobuf you say "at most one match". Which is it? At most one? Or exactly one?

If it "at most one" then what happens if there is no match? Is this an inner join (no row is emitted) or a left join (a row is emitted with the values from the left side and nulls on the right side)?

Comment on lines +79 to +80
is raised. Typical use would be to have one side's duplicates eliminated and then be pushed into a column scan on the
other side. This operator is useful when converting subqueries into joins.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably dense but I don't understand how this is helpful. This doesn't seem like it would be useful in eliminating duplicates (an error would be raised). Are you saying this is somehow used on the deduplicated values? How so? What does "pushed into a column scan on the other side" mean?

@@ -444,6 +444,7 @@ message Rel {
HashJoinRel hash_join = 13;
MergeJoinRel merge_join = 14;
NestedLoopJoinRel nested_loop_join = 18;
SingleJoinRel single_join = 22;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this was previously attempted as a separate join type:

JOIN_TYPE_SINGLE = 7;

We should either deprecate that join type or collapse this PR into better documenting that join type. I'm uncertain which approach I'd prefer and think it hinges on the "at most one" or "exactly one" question

@EpsilonPrime
Copy link
Member Author

As noted above we already have a single join, closing this one as unneeded.

@EpsilonPrime EpsilonPrime deleted the delim_join branch September 26, 2024 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants