Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-4652][VL] Fix min_by/max_by result mismatch #5544

Merged
merged 2 commits into from
Apr 29, 2024
Merged

Conversation

yma11
Copy link
Contributor

@yma11 yma11 commented Apr 26, 2024

What changes were proposed in this pull request?

Fix min_by/max_by result mismatch. Take max_by for example, we need to keep intermediate result row like <null, 11> which will be compared with another result like <5, 8> and assure final result is <null, 11>.

How was this patch tested?

New UT added

Copy link

#4652

@yma11 yma11 force-pushed the max-by branch 6 times, most recently from 6065078 to 789591b Compare April 28, 2024 07:34
@yma11
Copy link
Contributor Author

yma11 commented Apr 28, 2024

@rui-mo Can you help review this PR? Thanks.

case _: Average | _: Sum if aggFunc.dataType.isInstanceOf[DecimalType] =>
"row_constructor"
case _: MaxMinBy =>
"row_constructor_with_all_null"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yma11 Is this change due to the semantics difference between Velox and Spark? I wonder if it is possible to adjust the semantics of the implementation in Velox and avoid introducing more hack in Gluten.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by default, the upstream row_constructor doesn't set null value of the struct, it just take the input children and wrap them as a row. These issues are mainly caused by the additional projects we added, so it's reasonable to handle it at our side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Could we add a comment to explain why row_constructor_with_all_null is needed by MaxMinBy?

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

case _: Average | _: Sum if aggFunc.dataType.isInstanceOf[DecimalType] =>
"row_constructor"
case _: MaxMinBy =>
"row_constructor_with_all_null"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Could we add a comment to explain why row_constructor_with_all_null is needed by MaxMinBy?

#include "velox/expression/SpecialForm.h"

namespace gluten {
class RowConstructorWithAllNullCallToSpecialForm : public facebook::velox::exec::FunctionCallToSpecialForm {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to extend RowConstructorWithNullCallToSpecialForm and reuse most of the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

cpp/velox/operators/functions/RowFunctionWithNull.h Outdated Show resolved Hide resolved
cpp/velox/operators/functions/RowFunctionWithNull.h Outdated Show resolved Hide resolved
@yma11 yma11 force-pushed the max-by branch 2 times, most recently from d522a35 to bd5ebc7 Compare April 29, 2024 04:30
@yma11
Copy link
Contributor Author

yma11 commented Apr 29, 2024

@rui-mo your comments are addressed. Please help review again. Thanks.

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@zhouyuan zhouyuan changed the title [GLUTEN-4652] Fix min_by/max_by result mismatch [GLUTEN-4652][VL] Fix min_by/max_by result mismatch Apr 29, 2024
@zhouyuan zhouyuan merged commit 049a477 into apache:main Apr 29, 2024
42 checks passed
@yma11 yma11 deleted the max-by branch May 6, 2024 05:30
@zhouyifan279
Copy link
Contributor

I found the result still not right after applying this patch.
Run SQL:

set spark.sql.leafNodeDefaultParallelism=2;
select min_by(a, b), max_by(a, b) from values (5, 6), (null, 11), (null, 5) test(a, b);

Expected result is

NULL    NULL

Actual result is

5      5

@zhouyifan279
Copy link
Contributor

@yma11 @rui-mo I created a follow up PR #5711. Can you help to review it?

@zhouyifan279
Copy link
Contributor

Also cc @ulysses-you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants