-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement mark join #625
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -445,6 +445,7 @@ message Rel { | |
HashJoinRel hash_join = 13; | ||
MergeJoinRel merge_join = 14; | ||
NestedLoopJoinRel nested_loop_join = 18; | ||
MarkJoinRel mark_join = 23; | ||
ConsistentPartitionWindowRel window = 17; | ||
ExchangeRel exchange = 15; | ||
ExpandRel expand = 16; | ||
|
@@ -726,6 +727,20 @@ message NestedLoopJoinRel { | |
substrait.extensions.AdvancedExtension advanced_extension = 10; | ||
} | ||
|
||
// A mark join internally scans the left side, constructing a hash table that | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This claims that the hash table is built from the left side and then the right side is probed. This is the reverse of the |
||
// is used to mark the right side as having a join partner on the left side. A | ||
// mark is a nullable boolean field. The mark join operator is used to | ||
// implement semi-joins, anti-joins, and other join types that are not equijoins. | ||
message MarkJoinRel { | ||
RelCommon common = 1; | ||
Rel left = 2; | ||
Rel right = 3; | ||
// optional, defaults to true (a cartesian join) | ||
Expression expression = 4; | ||
|
||
substrait.extensions.AdvancedExtension advanced_extension = 10; | ||
} | ||
|
||
// The argument of a function | ||
message FunctionArgument { | ||
oneof arg_type { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -71,6 +71,29 @@ The merge equijoin does a join by taking advantage of two sets that are sorted o | |
| Post Join Predicate | An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. | Optional, defaults true. | | ||
| Join Type | One of the join types defined in the Join operator. | Required | | ||
|
||
|
||
|
||
## Mark Join Operator | ||
|
||
A mark join internally scans the left side, constructing a hash table that is used to mark the right side as having a join partner on the left side. This mark can end up being True, False, or NULL. The NULL mark is used to indicate that the right side does not have a join partner on the left side. The mark join operator is used to implement semi-joins, anti-joins, and other join types that are not equijoins. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the mark might be null then it sounds like this is a right join. In other words, there is one output row for each row on the right side. Is that true? Can we describe that more clearly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A regular right/outer join can still emit multiple output rows per input row, a mark join preserves the cardinality of the left hand side exactly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is not correct. From the paper:
|
||
|
||
| Signature | Value | | ||
| -------------------- | ------------------------------------------------------------ | | ||
| Inputs | 2 | | ||
| Outputs | 1 | | ||
| Property Maintenance | Distribution is maintained. Orderedness is eliminated. | | ||
| Direct Output Order | Same as the [Join](logical_relations.md#join-operator) operator. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does the mark join actually join two tables or does it just add the mark column? In other words, if I apply mark join to a left table From my read of the mark join algorithm in the paper it appears that a mark join does not actually join two tables, but only adds the mark (the algorithm scans There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are correct - the mark join only adds the mark column, the output is |
||
|
||
### Mark Join Properties | ||
|
||
| Property | Description | Required | | ||
|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------| | ||
| Left Input | A relational input. | Required | | ||
| Right Input | A relational input. | Required | | ||
| Join Expression | A boolean condition that describes whether each record from the left set "match" the record from the right set. Field references correspond to the direct output order of the data. | Required. Can be (but not expected to be) the literal True. | | ||
|
||
|
||
|
||
## Exchange Operator | ||
|
||
The exchange operator will redistribute data based on an exchange type definition. Applying this operation will lead to an output that presents the desired distribution. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does internally here mean?