-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plan/executor: handle null
for anti join properly
#8706
Conversation
null
for AntiSemiJoin properlynull
for anti join properly
/run-all-tests |
/run-all-tests tidb-test=pr/702 |
/run-all-tests tidb-test=pr/702 |
/run-unit-test |
/run-all-tests tidb-test=pr/702 |
@winoros comments addressed, the failed unit-test is caused by goroutine leak in ddl package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
/run-unit-test |
@zz-jason comments addressed, PTAL |
- enhance joiner to be aware of whether the result of join condition is false or null, and act correspondingly for AntiSemiJoin and LeftOuterAntiSemiJoin; - differentiate `not exists` from `not in / != all`, since they have various behaviors regarding null input; - for `not in (subq)`, there are 2 challenges: * executor of index join / hash join / merge join filters out null inner key before the real join operation; * we don't differentiate `in` condition and equal conditions pulled up later in decorrelation now, but they should have various behaviors either regarding null input; so if the inner column of `not in` condition is possible to be null, we cannot apply decorrelate rule for it, otherwise we would generate wrong results.
- rename isnull to hasNull - add comments
30a2728
to
a5c4f95
Compare
After some investigation from the behavior of MySQL, the join result of
Here the OuterRow is a record from the left hand side table. 1. LeftOuterSemiJoin
2. Anti LeftOuterSemiJoin
@eurekaka Am I right? |
@zz-jason yes, you are right. This is the output of MySQL:
while this is the output after applying this PR:
For |
@zz-jason
|
/rebuild |
} | ||
|
||
for inner := inners.Current(); inner != inners.End(); inner = inners.Next() { | ||
j.makeShallowJoinRow(j.outerIsRight, inner, outer) | ||
|
||
matched, err = expression.EvalBool(j.ctx, j.conditions, j.shallowRow.ToRow()) | ||
matched, _, err = expression.EvalBool(j.ctx, j.conditions, j.shallowRow.ToRow()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also maintain the correctness of hasNull
for semi join? If not, how about adding some comment about why we don't care whether there is any NULL
value in the inner side and make the return value hasNull
to false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For semi join, false and null are same because in both cases the outer row would not be outputted. I will add some comments.
} | ||
|
||
for inner := inners.Current(); inner != inners.End(); inner = inners.Next() { | ||
j.makeShallowJoinRow(false, inner, outer) | ||
|
||
matched, err = expression.EvalBool(j.ctx, j.conditions, j.shallowRow.ToRow()) | ||
matched, _, err = expression.EvalBool(j.ctx, j.conditions, j.shallowRow.ToRow()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For "left outer semi join", I think we should make the return value hasNull
be correct. Because in this join method, what we really needed is the join result: matched or is NULL. For example, if we don't consider hasNull
, we won't be able to produce the correct result for this query:
TiDB(root@127.0.0.1:test) > desc select a in (select b from t) from t;
+-------------------------+-------+------+----------------------------------------------------------------------------+
| id | count | task | operator info |
+-------------------------+-------+------+----------------------------------------------------------------------------+
| Projection_6 | 4.00 | root | 5_aux_0 |
| └─HashLeftJoin_7 | 4.00 | root | left outer semi join, inner:TableReader_11, equal:[eq(test.t.a, test.t.b)] |
| ├─TableReader_9 | 4.00 | root | data:TableScan_8 |
| │ └─TableScan_8 | 4.00 | cop | table:t, range:[-inf,+inf], keep order:false, stats:pseudo |
| └─TableReader_11 | 4.00 | root | data:TableScan_10 |
| └─TableScan_10 | 4.00 | cop | table:t, range:[-inf,+inf], keep order:false, stats:pseudo |
+-------------------------+-------+------+----------------------------------------------------------------------------+
6 rows in set (0.00 sec)
TiDB(root@127.0.0.1:test) > select a in (select b from t) from t;
+------------------------+
| a in (select b from t) |
+------------------------+
| 1 |
| 0 |
| 0 |
| 0 |
+------------------------+
4 rows in set (0.00 sec)
the table was generated by:
create table t(a bigint, b bigint, c bigint);
insert into t values(1, 1, 1), (null, 2, 2), (3, null, 3), (4, 400, 400);
the correct result is:
MySQL(root@localhost:test) > select a in (select b from t) from t;
+------------------------+
| a in (select b from t) |
+------------------------+
| 1 |
| NULL |
| NULL |
| NULL |
+------------------------+
4 rows in set (0.00 sec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done after we differentiate whether the join key is from in
or not, for example, here is the output from MySQL:
mysql> select a in (select b from t) from t;
+------------------------+
| a in (select b from t) |
+------------------------+
| 1 |
| NULL |
| NULL |
| NULL |
+------------------------+
4 rows in set (0.00 sec)
mysql> select exists(select * from t t2 where t2.b = t1.a) from t t1;
+----------------------------------------------+
| exists(select * from t t2 where t2.b = t1.a) |
+----------------------------------------------+
| 1 |
| 0 |
| 0 |
| 0 |
+----------------------------------------------+
4 rows in set (0.00 sec)
mysql> select a in (select a from t t2 where t2.b = t1.a) from t t1;
+---------------------------------------------+
| a in (select a from t t2 where t2.b = t1.a) |
+---------------------------------------------+
| 1 |
| 0 |
| 0 |
| 0 |
+---------------------------------------------+
4 rows in set (0.00 sec)
} | ||
|
||
func (j *leftOuterSemiJoiner) onMatch(outer chunk.Row, chk *chunk.Chunk) { | ||
chk.AppendPartialRow(0, outer) | ||
chk.AppendInt64(outer.Len(), 1) | ||
} | ||
|
||
func (j *leftOuterSemiJoiner) onMissMatch(outer chunk.Row, chk *chunk.Chunk) { | ||
func (j *leftOuterSemiJoiner) onMissMatch(hasNull bool, outer chunk.Row, chk *chunk.Chunk) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After maintaining the correct hasNull
value, if hasNull
is true, we should append a NULL value to make the correct join result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
The issue is fixed by another new PR: #9051, so we can close this now. |
What problem does this PR solve?
Fix #8642
What is changed and how it works?
not exists
fromnot in / != all
, since they have various behaviors regarding null input;not in (subq)
, there are 2 challenges:in
condition and equal conditions pulled up later in decorrelation now, but they should have various behaviors regarding null input as well;not in
condition is possible to be null, we cannot apply decorrelate rule for it, otherwise we would generate wrong results.Check List
Tests
Code changes
Side effects
not in
queries unable to decorrelate now, but correctness should be much more important than performance.Related changes
This change is