Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect results for NOT IN subqueries with nulls #8271

Closed
wants to merge 5 commits into from

Conversation

msirek
Copy link
Contributor

@msirek msirek commented Nov 20, 2023

Which issue does this PR close?

Closes #8241.

Rationale for this change

NOT IN queries executed as antijoin did not honor three-valued boolean logic.
The equality predicate built for antijoin, such as 1 = my_tab.a for the following
query is never true, so would previously make the filter appear to be qualified:

select 'qualified filter' where 1 NOT IN (select * from (values (3), (null)) my_tab(a));

The above NOT IN expression should evaluate the same as the expression 1 != 3 AND 1 != null,
which evaluates to null (which effectively is the same as false when used as a filter or join predicate).

What changes are included in this PR?

The code which decorrelates a NOT IN subquery into an antijoin with an equality condition
between the left and right expressions of the NOT IN predicate is updated so that where
previously the condition left_expression = right_expression was always built, we now build
left_expression = right_expression IS NOT FALSE whenever the left expression or right
expression is nullable.

Are these changes tested?

  • unit tests for nullable left/right expressions of a NOT IN subquery predicate
  • sqllogictests for nullable left/right expressions of a NOT IN subquery predicate

Are there any user-facing changes?

No

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Nov 20, 2023
@msirek msirek marked this pull request as ready for review November 20, 2023 06:17
Copy link
Contributor

@yukkit yukkit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works for me.

@@ -1077,6 +1116,148 @@ mod tests {
Ok(())
}

/// Test for single NOT IN subquery filter and nullable subquery column
#[test]
fn not_in_nullable_subquery_simple() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 beautiful unit tests

.query
.subquery
.schema()
.field_from_column(&unqualified_right_col)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use field_with_unqualified_name here? The unqualified name can be cloned from right_col.name.
So that there is no need to create unqualified_right_col, and we do not need to modify the create_col_from_scalar_expr function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've made the suggested change.

@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Nov 21, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @msirek -- I was going through old PRs and found this one that somehow slipped through.

I took the liberty of merging the code in the PR up from main.

The code is quite elegant (as all your previous PRs) -- but mind bending (as are many NULL related PRs). Some of the plans look like they have regressed and are no longer using HashJoin with equality predicates.

Perhaps we can use https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.null_equals_null or similar to mark the joins correctly and still use eqi-joins but with different null matching semantics

cc @jackwener

@@ -78,31 +78,25 @@ GlobalLimitExec: skip=0, fetch=10
------------------CoalesceBatchesExec: target_batch_size=8192
--------------------RepartitionExec: partitioning=Hash([p_brand@0, p_type@1, p_size@2, alias1@3], 4), input_partitions=4
----------------------AggregateExec: mode=Partial, gby=[p_brand@1 as p_brand, p_type@2 as p_type, p_size@3 as p_size, ps_suppkey@0 as alias1], aggr=[]
------------------------CoalesceBatchesExec: target_batch_size=8192
--------------------------HashJoinExec: mode=Partitioned, join_type=LeftAnti, on=[(ps_suppkey@0, s_suppkey@0)]
------------------------NestedLoopJoinExec: join_type=LeftAnti, filter=ps_suppkey@0 = s_suppkey@1 IS DISTINCT FROM false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is likely a regression in the query plan now because somehow the join can't use an eqijoin. Maybe we need to teach the join operator about this subtely, or set null_equal_null perhaps

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should identify is distinct from, and then set the null_equal_null according to it.

@msirek
Copy link
Contributor Author

msirek commented Feb 2, 2024

I took the liberty of merging the code in the PR up from main.

Thank you!

Some of the plans look like they have regressed and are no longer using HashJoin with equality predicates.
Perhaps we can use https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.null_equals_null or similar to mark the joins correctly and still use eqi-joins but with different null matching semantics

The semantics of null_equals_null can't really be re-used here because it only works for comparing 2 nulls, as I understand it.
https://github.com/apache/arrow-datafusion/blob/d594e6257b34a5ad47112e26d41516aaeb19e6dd/datafusion/physical-plan/src/joins/hash_join.rs#L1081-L1084
https://arrow.apache.org/rust/arrow/compute/kernels/cmp/fn.not_distinct.html

For example, in this test case:

SELECT 'Found' WHERE 1 NOT IN (NULL, 2, 3);
0 rows in set.

The result of evaluating "1 IS NOT DISTINCT FROM NULL" is FALSE, the same as when comparing 1 with 2 and 3, so the NOT IN result would be TRUE (but should be NULL).

There is a good description of the needed semantics in null-aware antijoin, referenced in #8241.

For the general case:

SELECT * FROM t1 WHERE nullable_column1 NOT IN (SELECT nullable_column2 FROM t2);

The presence of any NULL in the subquery means the predicate should always evaluate to NULL for every outer row, so the antijoin would need to be taught to detect the presence of a NULL in the subquery and return no rows.
Similarly, the presence of NULL in an outer row could short circuit and avoid joining that row with the subquery (always evaluates to NULL).
For multicolumn NOT IN it would be more complicated.

If we don't want any plan regressions, a null-aware flavor of hash antijoin seems to be needed.

@jackwener
Copy link
Member

jackwener commented Feb 3, 2024

Get it, I thought wrong in the past, thanks @msirek.
We indeed need NAAJ.

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 21, 2024
@github-actions github-actions bot closed this Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect results for NOT IN subqueries with nulls
5 participants