Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support existence join type for broadcast nested loop join #5301

Merged
merged 7 commits into from
Apr 27, 2022

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Apr 22, 2022

Closes #5034
Support existence join type for broadcast nested loop join

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

JoinGatherer for existence join
Existence join generates an exists boolean column with true or false in it,
then appends it to the output columns. It does not shrink or expand left table.
Just provides the exists column for the next following operator(usually filter).

e.g.:
select * from left_table where
  left_table.column_0 >= 3
  or
  exists (select * from right_table where left_table.column_1 < right_table.column_1)

Explanation of this sql is:

Filter(left_table.column_0 >= 3 or `exists`)
  Existence_join // generate `exists` column, do not shrink or expand the rows of left table
    left_table
    right_table

This is to add a new JoinGatherer to handle this existence join.

Copy link
Collaborator

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick review, and got some comments.



@ignore_order
@pytest.mark.parametrize('aqeEnabled', [pytest.param(True, id='aqe:on'), pytest.param(False, id='aqe:off')])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.parametrize('aqeEnabled', [pytest.param(True, id='aqe:on'), pytest.param(False, id='aqe:off')])
@pytest.mark.parametrize('aqeEnabled', [True, False], ids=['AQE_ON', 'AQE_OFF'])

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a consistent id scheme across the pytest code base regardless of the choice made

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

outOfBoundsPolicy: OutOfBoundsPolicy): JoinGatherer =
new JoinGathererImpl(gatherMap, inputData, outOfBoundsPolicy)
outOfBoundsPolicy: OutOfBoundsPolicy, isExistenceJoin: Boolean = false): JoinGatherer = {
if (!isExistenceJoin) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

    if (isExistenceJoin) {
      new JoinGathererForExistenceJoin(gatherMap, inputData, outOfBoundsPolicy)
    } else {
      new JoinGathererImpl(gatherMap, inputData, outOfBoundsPolicy)
    }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* </code>
*/
class JoinGathererForExistenceJoin(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT

Suggested change
class JoinGathererForExistenceJoin(
class ExistenceJoinGathererImpl(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need now.

// `exists.numRows` == `batch.numRows`,
// with true or false in it indicating if the row is gathered
withResource(genExistsColumn(gatherView, batch.numRows())) { exists =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I perfer to generate the column of existence according to the subTableCbTmp, not the whole batch. Then there is no need to split it next. Instead you can append the generated 'existence' column to the sub table directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as @firestarman mentioned, this may lead to a lot of duplicated runs of full-scale scatter. IIUC, if we split the gather map as left table, we need to transform the global offset to the local offset before scattering. Perhaps there is another approach, just cache the exist column for reuse over mini batches.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emit the batch directly as the exists boolean column is relatively small.
It's the same as GpuHashJoin does, no need to split now.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

@sameerz sameerz added the feature request New feature or request label Apr 25, 2022
@sameerz sameerz added this to the Apr 18 - Apr 29 milestone Apr 25, 2022
Copy link
Contributor

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits on the class names used but otherwise looking good.

Chong Gao added 2 commits April 26, 2022 09:16
Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

jlowe
jlowe previously approved these changes Apr 26, 2022
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some nits

@@ -601,7 +572,7 @@ class ExistenceJoinIterator(
* the value "false", scattering "true" into column FC will produce the "exists"
* column of ExistenceJoin
*/
private def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {
def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {
override def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


use(condition)

def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {
override def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


use(condition)

def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {
override def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -601,7 +572,7 @@ class ExistenceJoinIterator(
* the value "false", scattering "true" into column FC will produce the "exists"
* column of ExistenceJoin
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment should be moved to the abstract parent class method declaration

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -601,7 +572,7 @@ class ExistenceJoinIterator(
* the value "false", scattering "true" into column FC will produce the "exists"
* column of ExistenceJoin
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment should be moved to the abstract parent class method declaration

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

@@ -601,7 +572,7 @@ class ExistenceJoinIterator(
* the value "false", scattering "true" into column FC will produce the "exists"
* column of ExistenceJoin
*/
private def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {
def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {
override def existsScatterMap(leftColumnarBatch: ColumnarBatch): GatherMap = {

@res-life res-life merged commit 45d6fcc into NVIDIA:branch-22.06 Apr 27, 2022
@res-life res-life deleted the existence-join branch April 27, 2022 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Implement ExistenceJoin for BroadcastNestedLoopJoin Exec
6 participants