Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix](multi-catalog) Fix string dictionary filtering when using null related functions in parquet and orc reader by disabling dictionary filtering when predicates contain functions. #35335

Merged
merged 1 commit into from
May 27, 2024

Conversation

kaka11chen
Copy link
Contributor

Proposed changes

Issue

The following sql and when the dictionary column contains functions related to null, the results will be incorrect.

select * from ( select IF(o_orderpriority IS NULL, 'null', o_orderpriority) AS o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = 'null';
select * from ( select IFNULL(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null'
select * from ( select COALESCE(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null';

Root cause:

The current implementation of dictionary filtering does not take into account the implementation of NULL values because the dictionary itself does not contain NULL value encoding. As a result, many NULL-related functions or expressions cannot work properly, such as is null, is not null, coalesce, etc.

Solution

Here we first disable dictionary filtering when predicate contains functions. Implementation of NULL value dictionary filtering will be carried out later.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@kaka11chen kaka11chen changed the title [Fix](multi-catalog) Fix string dict filtering when use null related functions in parquet and orc reader. [Fix](multi-catalog) Fix string dictionary filtering when using null related functions in parquet and orc reader by disabling dictionary filtering when predicates contain functions. May 24, 2024
@kaka11chen
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 39799 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 5efc1d5f7942574acdcd03c4d6c972e0d2801dac, data reload: false

------ Round 1 ----------------------------------
q1	17896	4485	4279	4279
q2	2700	212	189	189
q3	11609	1171	1156	1156
q4	10603	771	882	771
q5	7616	2729	2740	2729
q6	220	132	136	132
q7	945	608	603	603
q8	9564	2058	2044	2044
q9	8902	6452	6429	6429
q10	8925	3684	3682	3682
q11	449	248	242	242
q12	427	219	214	214
q13	18119	2982	2965	2965
q14	256	218	232	218
q15	503	468	474	468
q16	518	381	377	377
q17	951	626	750	626
q18	8077	7452	7475	7452
q19	4060	1549	1440	1440
q20	641	295	312	295
q21	4955	3209	3809	3209
q22	328	279	279	279
Total cold run time: 118264 ms
Total hot run time: 39799 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4334	4207	4189	4189
q2	364	264	274	264
q3	3004	2794	2735	2735
q4	1850	1605	1596	1596
q5	5221	5252	5262	5252
q6	212	124	127	124
q7	2072	1766	1748	1748
q8	3148	3280	3256	3256
q9	8309	8330	8316	8316
q10	3855	3657	3641	3641
q11	582	474	482	474
q12	758	560	575	560
q13	17447	2982	3003	2982
q14	295	265	264	264
q15	519	471	465	465
q16	469	424	411	411
q17	1754	1488	1468	1468
q18	7616	7588	7464	7464
q19	2799	1559	1545	1545
q20	1969	1786	1771	1771
q21	4843	4652	4697	4652
q22	565	474	499	474
Total cold run time: 71985 ms
Total hot run time: 53651 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 35.66% (9019/25295)
Line Coverage: 27.32% (74583/273043)
Region Coverage: 26.54% (38601/145432)
Branch Coverage: 23.40% (19690/84134)
Coverage Report: http://coverage.selectdb-in.cc/coverage/5efc1d5f7942574acdcd03c4d6c972e0d2801dac_5efc1d5f7942574acdcd03c4d6c972e0d2801dac/report/index.html

@doris-robot
Copy link

TPC-DS: Total hot run time: 169494 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 5efc1d5f7942574acdcd03c4d6c972e0d2801dac, data reload: false

query1	911	379	378	378
query2	6452	2292	2373	2292
query3	6653	206	221	206
query4	19179	17217	17134	17134
query5	4162	411	411	411
query6	256	157	158	157
query7	4582	311	291	291
query8	242	188	184	184
query9	8615	2369	2360	2360
query10	452	271	277	271
query11	10511	10028	9922	9922
query12	138	86	88	86
query13	1650	354	351	351
query14	10208	7486	7611	7486
query15	212	172	168	168
query16	7882	267	259	259
query17	1838	535	519	519
query18	1973	272	281	272
query19	208	169	184	169
query20	89	86	81	81
query21	195	135	128	128
query22	4072	3900	3857	3857
query23	33523	33259	33125	33125
query24	12024	2815	2872	2815
query25	694	344	355	344
query26	1791	159	156	156
query27	2936	315	328	315
query28	7386	2009	2007	2007
query29	1121	597	593	593
query30	311	171	173	171
query31	953	760	750	750
query32	97	52	52	52
query33	771	265	286	265
query34	1016	459	483	459
query35	715	579	582	579
query36	1053	919	888	888
query37	272	69	75	69
query38	2904	2792	2779	2779
query39	857	790	775	775
query40	279	122	125	122
query41	46	45	43	43
query42	102	94	97	94
query43	567	538	542	538
query44	1201	714	719	714
query45	182	166	165	165
query46	1072	734	725	725
query47	1846	1749	1777	1749
query48	364	290	286	286
query49	1187	396	383	383
query50	770	377	415	377
query51	6862	6737	6774	6737
query52	98	93	92	92
query53	349	281	286	281
query54	989	421	423	421
query55	72	72	72	72
query56	267	239	234	234
query57	1145	1047	1060	1047
query58	241	215	227	215
query59	3199	3189	3164	3164
query60	278	260	257	257
query61	96	91	111	91
query62	631	445	471	445
query63	308	282	280	280
query64	9772	2231	1742	1742
query65	3168	3120	3126	3120
query66	1382	349	321	321
query67	15474	15308	14723	14723
query68	4545	541	564	541
query69	438	275	258	258
query70	1127	1073	1097	1073
query71	406	268	269	268
query72	7567	5349	2732	2732
query73	713	323	319	319
query74	5998	5692	5633	5633
query75	3405	2626	2608	2608
query76	2856	967	920	920
query77	438	269	264	264
query78	10184	9788	9965	9788
query79	2415	510	516	510
query80	998	436	425	425
query81	523	244	246	244
query82	662	91	95	91
query83	236	170	170	170
query84	240	90	85	85
query85	1639	365	267	267
query86	484	276	294	276
query87	3321	3183	3151	3151
query88	4084	2326	2339	2326
query89	480	412	388	388
query90	2002	189	191	189
query91	127	100	95	95
query92	61	46	48	46
query93	1565	507	485	485
query94	1208	182	184	182
query95	407	303	310	303
query96	571	259	266	259
query97	3222	2973	3068	2973
query98	240	213	216	213
query99	1103	850	870	850
Total cold run time: 274116 ms
Total hot run time: 169494 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.57 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 5efc1d5f7942574acdcd03c4d6c972e0d2801dac, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.05	0.04
query3	0.23	0.05	0.05
query4	1.69	0.07	0.06
query5	0.48	0.47	0.51
query6	1.11	0.73	0.72
query7	0.01	0.01	0.02
query8	0.05	0.04	0.04
query9	0.52	0.49	0.50
query10	0.54	0.53	0.55
query11	0.14	0.10	0.11
query12	0.15	0.12	0.11
query13	0.61	0.59	0.59
query14	0.77	0.79	0.77
query15	0.82	0.80	0.80
query16	0.34	0.37	0.37
query17	1.02	1.01	1.01
query18	0.22	0.25	0.25
query19	1.74	1.64	1.70
query20	0.02	0.01	0.01
query21	15.71	0.67	0.65
query22	4.46	7.11	2.11
query23	18.30	1.34	1.19
query24	1.62	0.37	0.19
query25	0.13	0.08	0.08
query26	0.25	0.17	0.17
query27	0.08	0.07	0.07
query28	13.27	1.01	1.07
query29	12.76	3.31	3.30
query30	0.24	0.07	0.05
query31	2.87	0.37	0.37
query32	3.31	0.46	0.47
query33	2.84	2.91	2.86
query34	16.99	4.43	4.42
query35	4.48	4.51	4.49
query36	0.64	0.48	0.46
query37	0.18	0.15	0.15
query38	0.15	0.15	0.14
query39	0.04	0.04	0.03
query40	0.17	0.13	0.14
query41	0.09	0.05	0.06
query42	0.06	0.05	0.05
query43	0.04	0.04	0.04
Total cold run time: 109.26 s
Total hot run time: 30.57 s

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 26, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@morningman morningman merged commit a93eafe into apache:master May 27, 2024
28 of 30 checks passed
dataroaring pushed a commit that referenced this pull request May 27, 2024
…function in parquet and orc reader. (#35335)

The following sql and when the dictionary column contains functions related to null, the results will be incorrect.
```
select * from ( select IF(o_orderpriority IS NULL, 'null', o_orderpriority) AS o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = 'null';
```
```
select * from ( select IFNULL(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null'
```
```
select * from ( select COALESCE(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null';
```
yiguolei pushed a commit that referenced this pull request May 27, 2024
…function in parquet and orc reader. (#35335)

The following sql and when the dictionary column contains functions related to null, the results will be incorrect.
```
select * from ( select IF(o_orderpriority IS NULL, 'null', o_orderpriority) AS o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = 'null';
```
```
select * from ( select IFNULL(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null'
```
```
select * from ( select COALESCE(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null';
```
seawinde pushed a commit to seawinde/doris that referenced this pull request May 27, 2024
…function in parquet and orc reader. (apache#35335)

The following sql and when the dictionary column contains functions related to null, the results will be incorrect.
```
select * from ( select IF(o_orderpriority IS NULL, 'null', o_orderpriority) AS o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = 'null';
```
```
select * from ( select IFNULL(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null'
```
```
select * from ( select COALESCE(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null';
```
kaka11chen added a commit to kaka11chen/doris that referenced this pull request May 28, 2024
…function in parquet and orc reader. (apache#35335)

The following sql and when the dictionary column contains functions related to null, the results will be incorrect.
```
select * from ( select IF(o_orderpriority IS NULL, 'null', o_orderpriority) AS o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = 'null';
```
```
select * from ( select IFNULL(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null'
```
```
select * from ( select COALESCE(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null';
```
kaka11chen added a commit to kaka11chen/doris that referenced this pull request May 28, 2024
…function in parquet and orc reader. (apache#35335)

The following sql and when the dictionary column contains functions related to null, the results will be incorrect.
```
select * from ( select IF(o_orderpriority IS NULL, 'null', o_orderpriority) AS o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = 'null';
```
```
select * from ( select IFNULL(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null'
```
```
select * from ( select COALESCE(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null';
```
xiaokang pushed a commit that referenced this pull request May 28, 2024
…related functions in parquet and orc reader by disabling dictionary filtering when predicates contain functions #35335 (#35514)
@morningman morningman mentioned this pull request Jun 1, 2024
mongo360 pushed a commit to mongo360/doris that referenced this pull request Aug 16, 2024
…related functions in parquet and orc reader by disabling dictionary filtering when predicates contain functions apache#35335 (apache#35514)
morningman pushed a commit that referenced this pull request Oct 21, 2024
…te express is not slot (#42113)

## Proposed changes
follow up #35335
When the `"case when ... then ... when ... then ... else"` occurs,
function_expr may not exist in the pushed down predicate, but the
handling of null values ​​is still problematic.

table data:
```text
mysql> select o_orderpriority from test_string_dict_filter_orc;
+-----------------+
| o_orderpriority |
+-----------------+
| 5-LOW           |
| 1-URGENT        |
| 5-LOW           |
| NULL            |
| 5-LOW           |
+-----------------+
```

before:
```text
mysql> select count(o_orderpriority) from ( select (case when o_orderpriority = 'x' then '1' when o_orderpriority = 'y' then '2' else '0' end) as o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = '0';
+------------------------+
| count(o_orderpriority) |
+------------------------+
|                      4 |
+------------------------+
```

after:
```text
mysql> select count(o_orderpriority) from ( select (case when o_orderpriority = 'x' then '1' when o_orderpriority = 'y' then '2' else '0' end) as o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = '0';
+------------------------+
| count(o_orderpriority) |
+------------------------+
|                      5 |
+------------------------+
```
morningman pushed a commit to morningman/doris that referenced this pull request Oct 21, 2024
…te express is not slot (apache#42113)

## Proposed changes
follow up apache#35335
When the `"case when ... then ... when ... then ... else"` occurs,
function_expr may not exist in the pushed down predicate, but the
handling of null values ​​is still problematic.

table data:
```text
mysql> select o_orderpriority from test_string_dict_filter_orc;
+-----------------+
| o_orderpriority |
+-----------------+
| 5-LOW           |
| 1-URGENT        |
| 5-LOW           |
| NULL            |
| 5-LOW           |
+-----------------+
```

before:
```text
mysql> select count(o_orderpriority) from ( select (case when o_orderpriority = 'x' then '1' when o_orderpriority = 'y' then '2' else '0' end) as o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = '0';
+------------------------+
| count(o_orderpriority) |
+------------------------+
|                      4 |
+------------------------+
```

after:
```text
mysql> select count(o_orderpriority) from ( select (case when o_orderpriority = 'x' then '1' when o_orderpriority = 'y' then '2' else '0' end) as o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = '0';
+------------------------+
| count(o_orderpriority) |
+------------------------+
|                      5 |
+------------------------+
```
morningman pushed a commit to morningman/doris that referenced this pull request Oct 21, 2024
…te express is not slot (apache#42113)

## Proposed changes
follow up apache#35335
When the `"case when ... then ... when ... then ... else"` occurs,
function_expr may not exist in the pushed down predicate, but the
handling of null values ​​is still problematic.

table data:
```text
mysql> select o_orderpriority from test_string_dict_filter_orc;
+-----------------+
| o_orderpriority |
+-----------------+
| 5-LOW           |
| 1-URGENT        |
| 5-LOW           |
| NULL            |
| 5-LOW           |
+-----------------+
```

before:
```text
mysql> select count(o_orderpriority) from ( select (case when o_orderpriority = 'x' then '1' when o_orderpriority = 'y' then '2' else '0' end) as o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = '0';
+------------------------+
| count(o_orderpriority) |
+------------------------+
|                      4 |
+------------------------+
```

after:
```text
mysql> select count(o_orderpriority) from ( select (case when o_orderpriority = 'x' then '1' when o_orderpriority = 'y' then '2' else '0' end) as o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = '0';
+------------------------+
| count(o_orderpriority) |
+------------------------+
|                      5 |
+------------------------+
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.11-merged dev/2.1.4-merged dev/3.0.0-merged reviewed usercase Important user case type label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants