Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add single match join #623

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -444,6 +444,7 @@ message Rel {
HashJoinRel hash_join = 13;
MergeJoinRel merge_join = 14;
NestedLoopJoinRel nested_loop_join = 18;
SingleJoinRel single_join = 22;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this was previously attempted as a separate join type:

JOIN_TYPE_SINGLE = 7;

We should either deprecate that join type or collapse this PR into better documenting that join type. I'm uncertain which approach I'd prefer and think it hinges on the "at most one" or "exactly one" question

ConsistentPartitionWindowRel window = 17;
ExchangeRel exchange = 15;
ExpandRel expand = 16;
Expand Down Expand Up @@ -725,6 +726,18 @@ message NestedLoopJoinRel {
substrait.extensions.AdvancedExtension advanced_extension = 10;
}

// A single join enforces that each row is matched at once most counterpart.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A single join enforces that each row is matched at once most counterpart.
// A single join enforces that each row is matched with at most one counterpart.

// If this constraint is violated, an error is raised.
message SingleJoinRel {
RelCommon common = 1;
Rel left = 2;
Rel right = 3;
// optional, defaults to true (a cartesian join)
Expression expression = 4;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No post_join_filter?

substrait.extensions.AdvancedExtension advanced_extension = 10;
}

// The argument of a function
message FunctionArgument {
oneof arg_type {
Expand Down
27 changes: 26 additions & 1 deletion site/docs/relations/physical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,10 +67,35 @@ The merge equijoin does a join by taking advantage of two sets that are sorted o
| Left Input | A relational input. | Required |
| Right Input | A relational input. | Required |
| Left Keys | References to the fields to join on in the left input. | Required |
| Right Keys | References to the fields to join on in the right input. | Reauired |
| Right Keys | References to the fields to join on in the right input. | Required |
| Post Join Predicate | An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. | Optional, defaults true. |
| Join Type | One of the join types defined in the Join operator. | Required |



## Single Join Operator

The single match join works by enforcing that each row is involved in only a single match. If this violated, an error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The single match join works by enforcing that each row is involved in only a single match. If this violated, an error
The single match join works by enforcing that each row is involved in only a single match. If this is violated, an error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you say "only a single match" In the protobuf you say "at most one match". Which is it? At most one? Or exactly one?

If it "at most one" then what happens if there is no match? Is this an inner join (no row is emitted) or a left join (a row is emitted with the values from the left side and nulls on the right side)?

is raised. Typical use would be to have one side's duplicates eliminated and then be pushed into a column scan on the
other side. This operator is useful when converting subqueries into joins.
Comment on lines +79 to +80
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably dense but I don't understand how this is helpful. This doesn't seem like it would be useful in eliminating duplicates (an error would be raised). Are you saying this is somehow used on the deduplicated values? How so? What does "pushed into a column scan on the other side" mean?


| Signature | Value |
| -------------------- | ------------------------------------------------------------ |
| Inputs | 2 |
| Outputs | 1 |
| Property Maintenance | Distribution is maintained. Orderedness is eliminated. |
| Direct Output Order | Same as the [Join](logical_relations.md#join-operator) operator. |

### Single Join Properties

| Property | Description | Required |
|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|
| Left Input | A relational input. | Required |
| Right Input | A relational input. | Required |
| Join Expression | A boolean condition that describes whether each record from the left set "match" the record from the right set. Field references correspond to the direct output order of the data. | Required. Can be (but not expected to be) the literal True. |



## Exchange Operator

The exchange operator will redistribute data based on an exchange type definition. Applying this operation will lead to an output that presents the desired distribution.
Expand Down
Loading