-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix incorrect results for NOT IN subqueries with nulls #8271
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works for me.
@@ -1077,6 +1116,148 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
/// Test for single NOT IN subquery filter and nullable subquery column | |||
#[test] | |||
fn not_in_nullable_subquery_simple() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 beautiful unit tests
.query | ||
.subquery | ||
.schema() | ||
.field_from_column(&unqualified_right_col)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use field_with_unqualified_name
here? The unqualified name can be cloned from right_col.name
.
So that there is no need to create unqualified_right_col
, and we do not need to modify the create_col_from_scalar_expr
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've made the suggested change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @msirek -- I was going through old PRs and found this one that somehow slipped through.
I took the liberty of merging the code in the PR up from main.
The code is quite elegant (as all your previous PRs) -- but mind bending (as are many NULL related PRs). Some of the plans look like they have regressed and are no longer using HashJoin with equality predicates.
Perhaps we can use https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.null_equals_null or similar to mark the joins correctly and still use eqi-joins but with different null matching semantics
cc @jackwener
@@ -78,31 +78,25 @@ GlobalLimitExec: skip=0, fetch=10 | |||
------------------CoalesceBatchesExec: target_batch_size=8192 | |||
--------------------RepartitionExec: partitioning=Hash([p_brand@0, p_type@1, p_size@2, alias1@3], 4), input_partitions=4 | |||
----------------------AggregateExec: mode=Partial, gby=[p_brand@1 as p_brand, p_type@2 as p_type, p_size@3 as p_size, ps_suppkey@0 as alias1], aggr=[] | |||
------------------------CoalesceBatchesExec: target_batch_size=8192 | |||
--------------------------HashJoinExec: mode=Partitioned, join_type=LeftAnti, on=[(ps_suppkey@0, s_suppkey@0)] | |||
------------------------NestedLoopJoinExec: join_type=LeftAnti, filter=ps_suppkey@0 = s_suppkey@1 IS DISTINCT FROM false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this is likely a regression in the query plan now because somehow the join can't use an eqijoin. Maybe we need to teach the join operator about this subtely, or set null_equal_null
perhaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should identify is distinct from
, and then set the null_equal_null
according to it.
Thank you!
The semantics of For example, in this test case: SELECT 'Found' WHERE 1 NOT IN (NULL, 2, 3);
0 rows in set. The result of evaluating "1 IS NOT DISTINCT FROM NULL" is FALSE, the same as when comparing 1 with 2 and 3, so the NOT IN result would be TRUE (but should be NULL). There is a good description of the needed semantics in null-aware antijoin, referenced in #8241. For the general case: SELECT * FROM t1 WHERE nullable_column1 NOT IN (SELECT nullable_column2 FROM t2); The presence of any NULL in the subquery means the predicate should always evaluate to NULL for every outer row, so the antijoin would need to be taught to detect the presence of a NULL in the subquery and return no rows. If we don't want any plan regressions, a null-aware flavor of hash antijoin seems to be needed. |
Get it, I thought wrong in the past, thanks @msirek. |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #8241.
Rationale for this change
NOT IN queries executed as antijoin did not honor three-valued boolean logic.
The equality predicate built for antijoin, such as 1 = my_tab.a for the following
query is never
true
, so would previously make the filter appear to be qualified:The above NOT IN expression should evaluate the same as the expression 1 != 3 AND 1 != null,
which evaluates to null (which effectively is the same as false when used as a filter or join predicate).
What changes are included in this PR?
The code which decorrelates a NOT IN subquery into an antijoin with an equality condition
between the left and right expressions of the NOT IN predicate is updated so that where
previously the condition left_expression = right_expression was always built, we now build
left_expression = right_expression IS NOT FALSE whenever the left expression or right
expression is nullable.
Are these changes tested?
Are there any user-facing changes?
No