Fix incorrect results for NOT IN subqueries with nulls #8271

msirek · 2023-11-20T03:54:57Z

Which issue does this PR close?

Closes #8241.

Rationale for this change

NOT IN queries executed as antijoin did not honor three-valued boolean logic.
The equality predicate built for antijoin, such as 1 = my_tab.a for the following
query is never true, so would previously make the filter appear to be qualified:

select 'qualified filter' where 1 NOT IN (select * from (values (3), (null)) my_tab(a));

The above NOT IN expression should evaluate the same as the expression 1 != 3 AND 1 != null,
which evaluates to null (which effectively is the same as false when used as a filter or join predicate).

What changes are included in this PR?

The code which decorrelates a NOT IN subquery into an antijoin with an equality condition
between the left and right expressions of the NOT IN predicate is updated so that where
previously the condition left_expression = right_expression was always built, we now build
left_expression = right_expression IS NOT FALSE whenever the left expression or right
expression is nullable.

Are these changes tested?

unit tests for nullable left/right expressions of a NOT IN subquery predicate
sqllogictests for nullable left/right expressions of a NOT IN subquery predicate

Are there any user-facing changes?

No

yukkit

It works for me.

yukkit · 2023-11-21T02:24:59Z

datafusion/optimizer/src/decorrelate_predicate_subquery.rs

@@ -1077,6 +1116,148 @@ mod tests {
        Ok(())
    }

+    /// Test for single NOT IN subquery filter and nullable subquery column
+    #[test]
+    fn not_in_nullable_subquery_simple() -> Result<()> {


👍 beautiful unit tests

jonahgao · 2023-11-21T03:10:49Z

datafusion/optimizer/src/decorrelate_predicate_subquery.rs

+            .query
+            .subquery
+            .schema()
+            .field_from_column(&unqualified_right_col)?;


Could we use field_with_unqualified_name here? The unqualified name can be cloned from right_col.name.
So that there is no need to create unqualified_right_col, and we do not need to modify the create_col_from_scalar_expr function.

Thanks! I've made the suggested change.

alamb

Thank you @msirek -- I was going through old PRs and found this one that somehow slipped through.

I took the liberty of merging the code in the PR up from main.

The code is quite elegant (as all your previous PRs) -- but mind bending (as are many NULL related PRs). Some of the plans look like they have regressed and are no longer using HashJoin with equality predicates.

Perhaps we can use https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.null_equals_null or similar to mark the joins correctly and still use eqi-joins but with different null matching semantics

cc @jackwener

alamb · 2024-01-31T20:55:45Z

datafusion/sqllogictest/test_files/tpch/q16.slt.part

@@ -78,31 +78,25 @@ GlobalLimitExec: skip=0, fetch=10
 ------------------CoalesceBatchesExec: target_batch_size=8192
 --------------------RepartitionExec: partitioning=Hash([p_brand@0, p_type@1, p_size@2, alias1@3], 4), input_partitions=4
 ----------------------AggregateExec: mode=Partial, gby=[p_brand@1 as p_brand, p_type@2 as p_type, p_size@3 as p_size, ps_suppkey@0 as alias1], aggr=[]
------------------------CoalesceBatchesExec: target_batch_size=8192
--------------------------HashJoinExec: mode=Partitioned, join_type=LeftAnti, on=[(ps_suppkey@0, s_suppkey@0)]
+------------------------NestedLoopJoinExec: join_type=LeftAnti, filter=ps_suppkey@0 = s_suppkey@1 IS DISTINCT FROM false


It seems like this is likely a regression in the query plan now because somehow the join can't use an eqijoin. Maybe we need to teach the join operator about this subtely, or set null_equal_null perhaps

Maybe we should identify is distinct from, and then set the null_equal_null according to it.

datafusion/optimizer/src/decorrelate_predicate_subquery.rs

msirek · 2024-02-02T20:24:28Z

I took the liberty of merging the code in the PR up from main.

Thank you!

Some of the plans look like they have regressed and are no longer using HashJoin with equality predicates.
Perhaps we can use https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.null_equals_null or similar to mark the joins correctly and still use eqi-joins but with different null matching semantics

The semantics of null_equals_null can't really be re-used here because it only works for comparing 2 nulls, as I understand it.
https://github.com/apache/arrow-datafusion/blob/d594e6257b34a5ad47112e26d41516aaeb19e6dd/datafusion/physical-plan/src/joins/hash_join.rs#L1081-L1084
https://arrow.apache.org/rust/arrow/compute/kernels/cmp/fn.not_distinct.html

For example, in this test case:

SELECT 'Found' WHERE 1 NOT IN (NULL, 2, 3);
0 rows in set.

The result of evaluating "1 IS NOT DISTINCT FROM NULL" is FALSE, the same as when comparing 1 with 2 and 3, so the NOT IN result would be TRUE (but should be NULL).

There is a good description of the needed semantics in null-aware antijoin, referenced in #8241.

For the general case:

SELECT * FROM t1 WHERE nullable_column1 NOT IN (SELECT nullable_column2 FROM t2);

The presence of any NULL in the subquery means the predicate should always evaluate to NULL for every outer row, so the antijoin would need to be taught to detect the presence of a NULL in the subquery and return no rows.
Similarly, the presence of NULL in an outer row could short circuit and avoid joining that row with the subquery (always evaluates to NULL).
For multicolumn NOT IN it would be more complicated.

If we don't want any plan regressions, a null-aware flavor of hash antijoin seems to be needed.

jackwener · 2024-02-03T06:00:11Z

Get it, I thought wrong in the past, thanks @msirek.
We indeed need NAAJ.

github-actions · 2024-04-21T01:46:44Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

Fix incorrect results for NOT IN subqueries with nulls

abac256

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Nov 20, 2023

update sqllogictest tpch Q16 query plan

04dad99

msirek marked this pull request as ready for review November 20, 2023 06:17

alamb mentioned this pull request Nov 20, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 20, 2023 #8274

Closed

7 tasks

yukkit approved these changes Nov 21, 2023

View reviewed changes

jonahgao reviewed Nov 21, 2023

View reviewed changes

Address review comments

8e4903f

github-actions bot removed the logical-expr Logical plan and expressions label Nov 21, 2023

alamb mentioned this pull request Nov 27, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 27, 2023 #8329

Closed

8 tasks

alamb added 2 commits January 31, 2024 15:49

Merge remote-tracking branch 'apache/main' into not_in_values

863c3a7

cleanup tests

de49fbc

alamb mentioned this pull request Jan 31, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 29, 2024 #9030

Closed

6 tasks

alamb reviewed Jan 31, 2024

View reviewed changes

jackwener reviewed Feb 1, 2024

View reviewed changes

datafusion/optimizer/src/decorrelate_predicate_subquery.rs Show resolved Hide resolved

datafusion/optimizer/src/decorrelate_predicate_subquery.rs Show resolved Hide resolved

alamb mentioned this pull request Feb 4, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 5, 2024 #9121

Closed

6 tasks

alamb mentioned this pull request Feb 12, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 12, 2024 #9200

Closed

8 tasks

github-actions bot added the Stale PR has not had any activity for some time label Apr 21, 2024

github-actions bot closed this Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect results for NOT IN subqueries with nulls #8271

Fix incorrect results for NOT IN subqueries with nulls #8271

msirek commented Nov 20, 2023 •

edited

Loading

yukkit left a comment

yukkit Nov 21, 2023

jonahgao Nov 21, 2023

msirek Nov 21, 2023

alamb left a comment

alamb Jan 31, 2024

jackwener Feb 1, 2024

msirek commented Feb 2, 2024

jackwener commented Feb 3, 2024 •

edited

Loading

github-actions bot commented Apr 21, 2024

Fix incorrect results for NOT IN subqueries with nulls #8271

Fix incorrect results for NOT IN subqueries with nulls #8271

Conversation

msirek commented Nov 20, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

yukkit left a comment

Choose a reason for hiding this comment

yukkit Nov 21, 2023

Choose a reason for hiding this comment

jonahgao Nov 21, 2023

Choose a reason for hiding this comment

msirek Nov 21, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 31, 2024

Choose a reason for hiding this comment

jackwener Feb 1, 2024

Choose a reason for hiding this comment

msirek commented Feb 2, 2024

jackwener commented Feb 3, 2024 • edited Loading

github-actions bot commented Apr 21, 2024

msirek commented Nov 20, 2023 •

edited

Loading

jackwener commented Feb 3, 2024 •

edited

Loading