Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement mark join #625

Closed
wants to merge 3 commits into from

Conversation

EpsilonPrime
Copy link
Member

This establishes the physical mark join relation. It utilizes a mark (a nullable boolean field reference) to reduce the amount of data that needs to be processed.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm probably missing something.

|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|
| Left Input | A relational input. | Required |
| Right Input | A relational input. | Required |
| Mark Reference | A nullable boolean field reference that is used to filter the right input. If the mark is null, the row is not included in the join. | Required. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's it? Why not just use a FilterRel?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is not correct - a mark join does not change the cardinality of the left table (much like a single join). The marker is either true, false or NULL depending on if a match is found and/or if any NULL values were encountered. Often the marker is used in a filter later on - but this is not required. If the marker is exclusively used in a filter the join can also be converted into a semi join and the marker is not necessary. It's described in this paper.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the documentation. It hopefully is more clear what the mark join is trying to accomplish.

I've also removed the mark reference as that appears to be an internal detail for the mark join (an internal hash table is constructed instead of the right join input having a new column introduced).

Copy link

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I'm not that well versed in Substrait but the implementation looks good to me. There just seems to be a mistake in the description of what the join does.

|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|
| Left Input | A relational input. | Required |
| Right Input | A relational input. | Required |
| Mark Reference | A nullable boolean field reference that is used to filter the right input. If the mark is null, the row is not included in the join. | Required. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is not correct - a mark join does not change the cardinality of the left table (much like a single join). The marker is either true, false or NULL depending on if a match is found and/or if any NULL values were encountered. Often the marker is used in a filter later on - but this is not required. If the marker is exclusively used in a filter the join can also be converted into a semi join and the marker is not necessary. It's described in this paper.

… the act of marking is internal to the join.
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a scan of the paper and I'm not sure it agrees with what is proposed. Note, the paper also has some mention / notion of "compare type" and suggests the join might actually do a cast of sorts on the key columns. If this is true, do we need to include this compare type concept in order for these relations to be usable by duckdb?

@@ -726,6 +727,20 @@ message NestedLoopJoinRel {
substrait.extensions.AdvancedExtension advanced_extension = 10;
}

// A mark join internally scans the left side, constructing a hash table that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does internally here mean?

@@ -726,6 +727,20 @@ message NestedLoopJoinRel {
substrait.extensions.AdvancedExtension advanced_extension = 10;
}

// A mark join internally scans the left side, constructing a hash table that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This claims that the hash table is built from the left side and then the right side is probed. This is the reverse of the HashJoinRel. If this intentional, then we should point this out (e.g. "Note that the table is built from the left side which is the opposite of the approach taken in the HashJoinRel)

| Inputs | 2 |
| Outputs | 1 |
| Property Maintenance | Distribution is maintained. Orderedness is eliminated. |
| Direct Output Order | Same as the [Join](logical_relations.md#join-operator) operator. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the mark join actually join two tables or does it just add the mark column? In other words, if I apply mark join to a left table l_key, l_payload and a right table r_key, r_payload then is the output l_key, l_payload, r_key, r_payload, mark or is it l_key, l_payload, mark (which would be very strange if the left table is the build side, see previous comment)

From my read of the mark join algorithm in the paper it appears that a mark join does not actually join two tables, but only adds the mark (the algorithm scans R and S. Only rows from R are inserted into the hash table H. The emit at the end is a scan of H which only contains rows from R. This is reinforced by the notation of the emit (emit r, marker of r) Compare this to the full outer hash join algorithm which contains lines like emit r,s and emit r,NULL)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct - the mark join only adds the mark column, the output is l_key, l_payload, mark (where left is the probe side). The cardinality of the left side (probe side) is preserved - i.e. the mark join neither removes nor adds additional rows.


## Mark Join Operator

A mark join internally scans the left side, constructing a hash table that is used to mark the right side as having a join partner on the left side. This mark can end up being True, False, or NULL. The NULL mark is used to indicate that the right side does not have a join partner on the left side. The mark join operator is used to implement semi-joins, anti-joins, and other join types that are not equijoins.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the mark might be null then it sounds like this is a right join. In other words, there is one output row for each row on the right side. Is that true? Can we describe that more clearly?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A regular right/outer join can still emit multiple output rows per input row, a mark join preserves the cardinality of the left hand side exactly.


## Mark Join Operator

A mark join internally scans the left side, constructing a hash table that is used to mark the right side as having a join partner on the left side. This mark can end up being True, False, or NULL. The NULL mark is used to indicate that the right side does not have a join partner on the left side. The mark join operator is used to implement semi-joins, anti-joins, and other join types that are not equijoins.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NULL mark is used to indicate that the right side does not have a join partner on the left side.

This is not correct. From the paper:

the marker may not only be TRUE (had a join partner) or FALSE (had no
join partner), but also NULL (had a join partner where the comparison result is NULL, but
none where the comparison is TRUE)

@EpsilonPrime
Copy link
Member Author

Will be addressed in an alternative PR.

@EpsilonPrime EpsilonPrime deleted the mark_join branch September 26, 2024 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants