Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

Incorrect execution result caused by join operation #1168

Open
ziyangRen opened this issue Nov 10, 2022 · 3 comments
Open

Incorrect execution result caused by join operation #1168

ziyangRen opened this issue Nov 10, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@ziyangRen
Copy link

Describe the bug
When running the following sql, the number of result data will increase. When you perform a full join on a table with 10,000 rows of data and two identical tables, the result is 13,200 entries instead of 10,000 entries.In addition, if inner join is used, the amount of data will be reduced. If left join is used, coredump error will be generated.We think that the condition for this error to recur is to use the max aggregate function on the string field before SortMergeJoin. This is because the above error is not repeated when aggregating non string fields or using sum and other aggregation functions for strings.

To Reproduce
Here's the sql:
select t1.value as value1, t2.value as value2, t1.data1a as data1a from( select value, MAX(data1) as data1a from gy_orc.test_smj3 t1 group by value) t1 FULL JOIN (select value, MAX(data1) as data1b from gy_orc.test_smj4 t2 group by value) t2 on t1.value=t2.value
Notes:The data type of data1 is string

@ziyangRen ziyangRen added the bug Something isn't working label Nov 10, 2022
@ziyangRen
Copy link
Author

image
The execution result is as above. Rows containing null values are extra data

@ziyangRen
Copy link
Author

@zhouyuan We have found the cause of this problem: we have configured spark. oap. sql. columnar. hashagg. support String=true, when aggregating String type fields, when aggregating String type fields, it is converted to the ColumnarHashAggregate operator, which results in the deletion of Sort subtree, which results in incorrect SortMergeJoin results. How can we correctly insert a Sort operator before SortMergeJoin?
image

@zhouyuan
Copy link
Collaborator

Hi @ziyangRen,

Thanks for the detailed and clear log, this is a bug on the "hashagg for string" - the impl does not fit for "sortagg + SMJ" case.
A quick fix is to do not allow to use hashagg in "sortagg + SMJ" case, however in this way gazelle would need to fallback to Vanilla Spark as Gazelle does not have "SortAgg" impl. I'll generate a quick patch for you.

Thanks, -yuan

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants