Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semi Join should be NULL-Aware #8844

Closed
zz-jason opened this issue Dec 27, 2018 · 7 comments · Fixed by #19620
Closed

Semi Join should be NULL-Aware #8844

zz-jason opened this issue Dec 27, 2018 · 7 comments · Fixed by #19620
Assignees
Labels
priority/P2 The issue has P2 priority. severity/critical sig/execution SIG execution sig/planner SIG: Planner type/bug The issue is confirmed as a bug.
Milestone

Comments

@zz-jason
Copy link
Member

zz-jason commented Dec 27, 2018

Bug Report

Take this as an example:

drop table if exists t;
create table t(a bigint, b bigint, c bigint);
insert into t values(null, 1, 1), (2, 2, 2), (3, null, 3), (4, 4, 3);

LeftOuterSemiJoin

The join result should be OuterRow + NULL if:

  1. the outer join key is NULL(row count from inner is not empty), or:
  2. there is no inner join key has the same value with the outer side, but there is at least a NULL value in the inner side join key.

MySQL:

MySQL(root@localhost:test) > select a, b, a in (select b from t) from t;
+------+------+------------------------+
| a    | b    | a in (select b from t) |
+------+------+------------------------+
| NULL |    1 |                   NULL | -- satisfies the first condition
|    2 |    2 |                      1 |
|    3 | NULL |                   NULL | -- satisfies the second condition
|    4 |    4 |                      1 |
+------+------+------------------------+
3 rows in set (0.00 sec)

While in TiDB the result is:

TiDB(root@127.0.0.1:test) > select a, b, a in (select b from t) from t;
+------+------+------------------------+
| a    | b    | a in (select b from t) |
+------+------+------------------------+
| NULL |    1 |                      0 | -- incorrect
|    3 | NULL |                      0 | -- incorrect
|    2 |    2 |                      1 |
|    4 |    4 |                      1 |
+------+------+------------------------+
4 rows in set (0.00 sec)

The join result should be OuterRow + 1 if:

  1. the outer join key is not NULL and there is at least one inner join key has the same value with the outer side.

MySQL:

MySQL(root@localhost:test) > select a, b, a in (select b from t) from t;
+------+------+------------------------+
| a    | b    | a in (select b from t) |
+------+------+------------------------+
| NULL |    1 |                   NULL |
|    2 |    2 |                      1 | -- satisfies the first condition
|    3 | NULL |                   NULL |
|    4 |    4 |                      1 | -- satisfies the first condition
+------+------+------------------------+
4 rows in set (0.00 sec)

While in TiDB, the result is:

TiDB(root@127.0.0.1:test) > select a, b, a in (select b from t) from t;
+------+------+------------------------+
| a    | b    | a in (select b from t) |
+------+------+------------------------+
| NULL |    1 |                      0 | -- incorrect
|    3 | NULL |                      0 | -- incorrect
|    4 |    4 |                      1 |
|    2 |    2 |                      1 |
+------+------+------------------------+
4 rows in set (0.00 sec)

The join result should be OuterRow + 0 if:

  1. the outer join key is not NULL and there is no inner join key has the same value with the outer side, and there is no NULL value in the inner side join key.

MySQL:

MySQL(root@localhost:test) > select a, c, a in (select c from t) from t;
+------+------+------------------------+
| a    | c    | a in (select c from t) |
+------+------+------------------------+
| NULL |    1 |                   NULL |
|    2 |    2 |                      1 |
|    3 |    3 |                      1 |
|    4 |    3 |                      0 | -- satisfies the first condition
+------+------+------------------------+
4 rows in set (0.00 sec)

While in TiDB, the result is:

TiDB(root@127.0.0.1:test) > select a, c, a in (select c from t) from t;
+------+------+------------------------+
| a    | c    | a in (select c from t) |
+------+------+------------------------+
| NULL |    1 |                      0 | -- incorrect
|    2 |    2 |                      1 |
|    3 |    3 |                      1 |
|    4 |    3 |                      0 | -- correct
+------+------+------------------------+
4 rows in set (0.00 sec)

Anti LeftOuterSemiJoin

The join result should be OuterRow + NULL if:

  1. the outer join key is NULL, or:
  2. there is no inner join key has the same value with the outer side, but there is at least a NULL value in the inner side join key.

MySQL:

MySQL(root@localhost:test) > select a, b, a not in (select b from t) from t;
+------+------+----------------------------+
| a    | b    | a not in (select b from t) |
+------+------+----------------------------+
| NULL |    1 |                       NULL | -- satisfies the first condition
|    2 |    2 |                          0 |
|    3 | NULL |                       NULL | -- satisfies the second condition
|    4 |    4 |                          0 |
+------+------+----------------------------+
4 rows in set (0.00 sec)

While in TiDB:

TiDB(root@127.0.0.1:test) > select a, b, a not in (select b from t) from t;
+------+------+----------------------------+
| a    | b    | a not in (select b from t) |
+------+------+----------------------------+
|    2 |    2 |                          0 |
| NULL |    1 |                          1 | -- incorrect, should be NULL
|    3 | NULL |                          1 | -- incorrect, should be NULL
|    4 |    4 |                          0 |
+------+------+----------------------------+
4 rows in set (0.00 sec)

The join result should be OuterRow + 0 if:

  1. the outer join key is not NULL and there is at least one inner join key has the same value with the outer side.

MySQL:

MySQL(root@localhost:test) > select a, b, a not in (select b from t) from t;
+------+------+----------------------------+
| a    | b    | a not in (select b from t) |
+------+------+----------------------------+
| NULL |    1 |                       NULL |
|    2 |    2 |                          0 | -- satisfies the first condition
|    3 | NULL |                       NULL |
|    4 |    4 |                          0 | -- satisfies the first condition
+------+------+----------------------------+
4 rows in set (0.00 sec)

While in TiDB:

TiDB(root@127.0.0.1:test) > select a, b, a not in (select b from t) from t;
+------+------+----------------------------+
| a    | b    | a not in (select b from t) |
+------+------+----------------------------+
| NULL |    1 |                          1 |
|    3 | NULL |                          1 |
|    4 |    4 |                          0 | -- this part is correct, satisfies the first condition
|    2 |    2 |                          0 | -- this part is correct, satisfies the first condition
+------+------+----------------------------+
4 rows in set (0.00 sec)

The join result should be OuterRow + 0 if:

  1. the outer join key is not NULL and there is no inner join key has the same value with the outer side, and there is no NULL value in the inner side join key.

MySQL:

MySQL(root@localhost:test) > select a, c, a not in (select c from t) from t;
+------+------+----------------------------+
| a    | c    | a not in (select c from t) |
+------+------+----------------------------+
| NULL |    1 |                       NULL |
|    2 |    2 |                          0 | -- satisfies the first condition
|    3 |    3 |                          0 | -- satisfies the first condition
|    4 |    3 |                          1 |
+------+------+----------------------------+
4 rows in set (0.01 sec)

While in TiDB:

TiDB(root@127.0.0.1:test) > select a, c, a not in (select c from t) from t;
+------+------+----------------------------+
| a    | c    | a not in (select c from t) |
+------+------+----------------------------+
| NULL |    1 |                          1 |
|    3 |    3 |                          0 | -- this part is correct, satisfies the first condition
|    4 |    3 |                          1 |
|    2 |    2 |                          0 | -- this part is correct, satisfies the first condition
+------+------+----------------------------+
4 rows in set (0.00 sec)

Summary

For LeftOuterSemiJoin and Anti LeftOuterSemiJoin, TiDB can not correctly produce the OuterRow + NULL.

There is another two semi join types in TiDB:

  • SemiJoin: to make its result correct, it should only returns the OuterRow if the LeftOuterSemiJoin result is OuterRow + 1, not OuterRow + 0 and OuterRow + NULL
  • Anti SemiJoin: to make its result correct, it should only returns the OuterRow if the Anti LeftOuterSemiJoin result is OuterRow + 1, not OuterRow + 0 and OuterRow + NULL

Suggestions

In Planner

  1. do not infer NOT NULL filters on the inner side of the join key if the join type is the four semi joins.
  2. do not use Merge Join and Index Join for these four semi joins.

If possible, we can remove SemiJoin and Anti SemiJoin. LeftOuterSemiJoin and Anti LeftOuterSemiJoin has the whole message of the join result, we only need these two semi joins actually:

  1. If the semi join appears in the where condition and we need to return the matched outer row, the plan can be generated to:
LeftOuterSemiJoin -> Filter -> Project

LeftOuterSemiJoin: Do the LeftOuterSemiJoin on t1 and t2, generate result of "OuterRow + result"
Filter: filters out the result using this expression: IF(ISNULL(result), 0, result)
Project: remove unused result column, return only the OuterRow
  1. If the semi join appears in the select list, we should return the join result, the plan can be generated to:
LeftOuterSemiJoin -> Project

LeftOuterSemiJoin: Do the LeftOuterSemiJoin on t1 and t2, generate result of "OuterRow + result"
Project: remove unused OuterRow, return only one column contain the result

In Executor

For Hash Join and NestLoopedApply, we should:

  1. specially handle NULL outer join keys for semi joins, return OuterRow + NULL
  2. record whether there is a NULL value in the inner join key, return OuterRow + NULL if the outer row is not NULL and there is no matched inner join key.
@zz-jason zz-jason added type/bug The issue is confirmed as a bug. sig/planner SIG: Planner labels Dec 27, 2018
@eurekaka
Copy link
Contributor

For hash join, we cannot use it for semi-joins at least now, because for queries like select *, a in (select a from t t2 where t2.b = t1.b) from t t1, we would put both t2.a and t2.b into join keys and do not differentiate them, so even if we specially hand null input, we may generate wrong results using hash join.

@eurekaka
Copy link
Contributor

AntiLeftOuterSemiJoin is fixed by #8706

@fzhedu
Copy link
Contributor

fzhedu commented Oct 10, 2019

For hash join, we cannot use it for semi-joins at least now, because for queries like select *, a in (select a from t t2 where t2.b = t1.b) from t t1, we would put both t2.a and t2.b into join keys and do not differentiate them, so even if we specially hand null input, we may generate wrong results using hash join.

This means that the hash join should be null aware.
image
the results from TiDB are wrong in contrast to the results from Mysql.
image

To correct this error, we should change the current hash join

  • building hash tables should not get rid of tuples with null keys;
  • differing equal conditions from semi joins or where clauses.

@fzhedu fzhedu reopened this Oct 10, 2019
@fzhedu fzhedu self-assigned this Nov 8, 2019
@ghost ghost added the correctness label Aug 7, 2020
@ghost
Copy link

ghost commented Aug 7, 2020

I believe the original cases are all fixed, but the hash-join case is not fixed.

Original testcase:

drop table if exists t;
create table t(a bigint, b bigint, c bigint);
insert into t values(null, 1, 1), (2, 2, 2), (3, null, 3), (4, 4, 3);
select a, b, a in (select b from t) from t;
select a, c, a in (select c from t) from t;
select a, b, a not in (select b from t) from t;
select a, c, a not in (select c from t) from t;

select tidb_version()\G

..

mysql> select a, b, a in (select b from t) from t; # correct
+------+------+------------------------+
| a    | b    | a in (select b from t) |
+------+------+------------------------+
| NULL |    1 |                   NULL |
|    2 |    2 |                      1 |
|    3 | NULL |                   NULL |
|    4 |    4 |                      1 |
+------+------+------------------------+
4 rows in set (0.00 sec)

mysql> select a, c, a in (select c from t) from t; # correct
+------+------+------------------------+
| a    | c    | a in (select c from t) |
+------+------+------------------------+
| NULL |    1 |                   NULL |
|    2 |    2 |                      1 |
|    3 |    3 |                      1 |
|    4 |    3 |                      0 |
+------+------+------------------------+
4 rows in set (0.00 sec)

mysql> select a, b, a not in (select b from t) from t; # correct
+------+------+----------------------------+
| a    | b    | a not in (select b from t) |
+------+------+----------------------------+
| NULL |    1 |                       NULL |
|    2 |    2 |                          0 |
|    3 | NULL |                       NULL |
|    4 |    4 |                          0 |
+------+------+----------------------------+
4 rows in set (0.00 sec)

mysql> select a, c, a not in (select c from t) from t; # correct
+------+------+----------------------------+
| a    | c    | a not in (select c from t) |
+------+------+----------------------------+
| NULL |    1 |                       NULL |
|    2 |    2 |                          0 |
|    3 |    3 |                          0 |
|    4 |    3 |                          1 |
+------+------+----------------------------+
4 rows in set (0.00 sec)

mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.0-beta.2-893-g4e829aaee
Edition: Community
Git Commit Hash: 4e829aaee7b656aa807814708ae05af5233302af
Git Branch: master
UTC Build Time: 2020-08-04 12:40:52
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)

Hash join case:

DROP TABLE IF EXISTS ss, tt;

create table ss (
 a bigint,
 b bigint
);
create table tt (
 a bigint,
 b bigint
);
INSERT INTO ss VALUES (1,NULL),(2,NULL),(2,2);
INSERT INTO tt VALUES (1,1),(1,NULL),(2,NULL);
SELECT tt.a, tt.b, (tt.a, tt.b) in (select a,b from ss) from tt;

..

mysql> SELECT tt.a, tt.b, (tt.a, tt.b) in (select a,b from ss) from tt;
+------+------+--------------------------------------+
| a    | b    | (tt.a, tt.b) in (select a,b from ss) |
+------+------+--------------------------------------+
|    1 |    1 |                                    0 |
|    1 | NULL |                                    0 |
|    2 | NULL |                                    0 |
+------+------+--------------------------------------+
3 rows in set (0.01 sec)

The third column should be NULL and not zero.

@eurekaka
Copy link
Contributor

@fzhedu PTAL

@qw4990 qw4990 added priority/P1 The issue has P1 priority. sig/execution SIG execution labels Aug 27, 2020
@fzhedu fzhedu added priority/P2 The issue has P2 priority. and removed priority/P1 The issue has P1 priority. labels Aug 28, 2020
@scsldb scsldb added this to the v4.0.13 milestone Oct 29, 2020
@lzmhhh123 lzmhhh123 linked a pull request Nov 2, 2020 that will close this issue
@scsldb scsldb modified the milestones: v4.0.13, v4.0.9 Nov 4, 2020
@ti-srebot
Copy link
Contributor

ti-srebot commented Nov 12, 2020

Please edit this comment or add a new comment to complete the following information

Not a bug

  1. Remove the 'type/bug' label
  2. Add notes to indicate why it is not a bug

Duplicate bug

  1. Add the 'type/duplicate' label
  2. Add the link to the original bug

Bug

Note: Make Sure that 'component', and 'severity' labels are added
Example for how to fill out the template: #20100

1. Root Cause Analysis (RCA) (optional)

2. Symptom (optional)

3. All Trigger Conditions (optional)

4. Workaround (optional)

5. Affected versions

6. Fixed versions

v5.0.0-rc

@ti-srebot
Copy link
Contributor

( AffectedVersions ) fields are empty.
The values in ( FixedVersions ) fields are incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/P2 The issue has P2 priority. severity/critical sig/execution SIG execution sig/planner SIG: Planner type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants