-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow #27627
Conversation
Please see some notes in this JIRA for the two approaches. This is a implementation for approach 2 fix. I'm marking this as WIP in order to get comments. |
if (child.nullable) { | ||
Seq( | ||
/* sum = */ | ||
coalesce(coalesce(sum, zero) + child.cast(sumDataType), sum), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in the Sum are mostly to check if overflow has occurred when we do the different additions in the updateExpressions and mergeExpressions. The actual addition operations are all the same. Reading the diff may not show that easily so wanted to make a note here on that.
|
||
override lazy val initialValues: Seq[Expression] = Seq( | ||
/* sum = */ Literal.create(null, sumDataType) | ||
/* sum = */ Literal.create(null, sumDataType), | ||
/* overflow = */ Literal.create(false, BooleanType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We keep track of overflow using this aggBufferAttributes - overflow to know if any of the intermediate add operations in updateExpressions and/or mergeExpressions overflow'd. If the overflow is true and if spark.sql.ansi.enabled flag is false, then we return null for the sum operation in evaluateExpression.
cc @mgaido91 |
I'm a little confused. These expressions are used in non-whole-stage-codegen as well, why only whole-stage-codegen has the problem? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/decimalExpressions.scala
Outdated
Show resolved
Hide resolved
|
||
override def dataType: DataType = BooleanType | ||
|
||
override def nullable: Boolean = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we override nullable
with false
as doGenCode()
does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the child can be nullable, the input value can be null. Making nullable to false in that case will not work, as it may result in npe. We can change the doGenCode() here to make the check for null for that, but since the nullSafeCodeGen in UnaryExpression already takes care of the if nullable checks, it seems there is no need to add if null checks here.
Thanks @cloud-fan and @kiszk for your comments.
In case of the whole stage disabled, the exception for the error comes from here. AggregationIterator —> .. JoinedRow —> UnsafeRow -> Decimal.set where it checks if the value can fit within the precision and scale.
——— In case of the whole stage, here is the codegen that is generated for it (spark today, so no changes from this pr). a) codegen agg sum - all default (ansi mode false) In case of the whole stage codegen, you can see the decimal ‘+’ expressions will at some point be larger than what can be contained in dec 38,18 but it gets written out as null. This messes up the end result of the sum and you get wrong results. Here is a snippet highlighting the behavior observed for the usecase in this issue: https://github.com/skambha/notes/blob/master/UnsafeRowWriterTestSnippet |
Thanks for checking. I was hoping to get some feedback on the preferred approach on the JIRA. So in the meantime, I have these 2 pr open. The other PR throws exception if there is overflow. This pr is to honor the ansi flag/null behavior and follows the approach in SPARK-28224. Once we make a decision I can close one out. |
OK to test |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Sum.scala
Show resolved
Hide resolved
Still a bit confused. If it gets written out as null, then null + decimal always return null and the final result is null? |
So if we look into the aggregate Sum, we have coalesce in updateExpressions and mergeExpressions, so it is not purely only a null + decimal only expression. For e.g, in updateExpressions if the intermediate sum becomes null because of overflow, then in the next iteration of the updateExpressions, coalesce will be used to make the sum become 0 for that and the sum will be updated to 0 + decimal.
Please let me know if this helps. |
I tried some queries locally but can't reproduce. Is it only a problem with a join? |
find a way to reproduce without join
I think the root cause is, We should create a |
cc @viirya @maropu @dongjoon-hyun as well |
Thank you for the example and analysis, @cloud-fan . |
For integral sum (e.g., int/long), overflow can happen in partial aggregate sides (via
|
Ideally, under ANSI mode, decimal sum should also fail if overflows. It's hard to fail at the partial aggregate side, as we don't have a chance to check overflow before shuffling the aggregate buffer row. We can fail at the final aggregate side: If the value is null and |
Based on previous explanation, looks like the overflow happens during partial aggregate ( updateExpressions and mergeExpressions)? At final aggregate, is a check for null useful for the reported case here? |
partial aggregate does |
Sounds like not just a problem of |
Note: the semantic of most aggregate functions are skipping nulls. So sum will skip null inputs, and overflow doesn't return null (also true for decimal), so |
In the repro (that has the join and then the agg), we also see the null being written out via the UnsafeRowWriter that is mentioned in my earlier comment with codegen details here.#27627 (comment) Codegen and the plan details from #27627 (comment) here:
The relevant code lines of interest are the following:
Codegen id1: Codegen id2:
@cloud-fan, given the above pieces of code, why would the problem not exist for the partial aggregate (updateExpression) case? |
You can try In my repro, |
Yes. I agree. E.g
Yes. I will look into this. If someone else has suggestions please do share. |
I looked into why the spark.range(0,12,1,1) works and the other scenario returns wrong results Case 1 that used spark.range(0,12,1,1) - this returns null. There is only one partition and 1 wholestagecodegen subtree. There is no intermediate writing out via UnsafeRowWriter. Hence all the values are added up which ends up being a value that is a overflow value and then the check overflow code kicks in, that sees that is is a overflow value and it will turn it to a null. The null then gets written out.
This explains why this scenario works fine. Case 2 that used spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1)) and got wrong results
(Sidenote: If you split the range to 2 and 11 elements, you can see the result is We already saw that when we write out to UnsafeRow an overflow value of decimal, it will write a null value. So, whether the wrong results repros or not is basically
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Sum.scala
Outdated
Show resolved
Hide resolved
LGTM except a few minor comments. This changes the aggregate buffer of the sum function, which means the streaming state store format is also changed. We need to mention it in the migration guide and suggest users restart the stream if their query contains sum of decimals. cc @marmbrus @zsxwing @HeartSaVioR @xuanyuanking @dongjoon-hyun Usually we need to backport correctness fixes, but this breaks the streaming checkpoint and I don't know if it affects the backport policy. |
This PR looks to "selectively" change the aggregation buffer (not entirely) - I don't know the possibility/ratio of actual usage on decimal for input on sum so it's not easy to say, but we let the streaming queries which contain stream-stream outer join "fail" if it starts with previous version of state, whereas we allow to run with previous version of state for stream-stream inner join. https://spark.apache.org/docs/3.0.0-preview2/ss-migration-guide.html
SPARK-26154 was decided to only put to Spark 3.0.0 because of backward incompatibility. IMHO this sounds like a similar case - it should be landed on Spark 3.0.0, not sure we do want to make breaking change on Spark 2.4.x. Btw, this was simply possible because we added "versioning" of the state, but we have been applying the versioning for the operator rather than "specific" aggregate function, hence this hasn't been the case yet. If we feel beneficial to construct the way to apply versioning for specific aggregate function (at least built-in) and someone has the straightforward way to deal with this, it would be ideal to do that. Otherwise, I'd rather say we should make it fail, but shouldn't be with "arbitrary" error message. SPARK-27237 (#24173) will make it possible to show the schema incompatibility and why it's considered as incompatible. I'd also propose to introduce state data source on batch query via SPARK-28190 so that end users can easily read and even rewrite the state, which eventually makes end users possible to migrate from old state format to the new state format. (It's available on separate project for Spark 2.4.x, but given there were some relevant requests in user mailing list - e.g. pre-load initial state, schema evolution - it's easier to work within the Spark project.) |
Test build #123389 has finished for PR 27627 at commit
|
retest this, please |
Thanks @cloud-fan for reminding.
|
How about we merge it to master only first, and wait for the schema incompatibility check to be done? This is a long-standing issue and we don't need to block 3.0. We can still backport to branch-3.0 with more discussion. |
Personally I'm OK with skip including this on Spark 3.0.0, preferably with mentioning as "known issue". It would be ideal if we can maintain the page about known "correctness"/"data-loss" issues and fixed version(s) if the issue is resolved on specific version. |
retest this please |
Test build #123398 has finished for PR 27627 at commit
|
Agree.
@HeartSaVioR Thanks for the reminding. I also looked into #24173 before. My approach is checking the underlying unsafe row format instead of adding a new schema file in the checkpoint. It is decided by the requirement of detecting the format changing during migration, which has no chance for the user to create a schema file. But I think our approaches can complement each other. Let's discuss in my newly created PR, I'll submit it late today. |
Test build #123414 has finished for PR 27627 at commit
|
retest this please |
tests already pass: #27627 (comment) I'm merging it to master first, thanks! |
Test build #123431 has finished for PR 27627 at commit
|
Thanks @cloud-fan and all for merge into master. Thanks for the discussion related to streaming use cases backward compatibility.
Initially, it sounded like both (@xuanyuanking and @HeartSaVioR were ok with it going into v3.0 and not go into v2.4. But later we decide to not put it into v3.0. Could you point out the PR/ JIRA that we are waiting for before this can go into v3.x branch. I agree, it seems like if there is a compatibility issue, it would be best to go into a major release than a minor one. Just trying to understand: What is different in this use case than what we have in SPARK-26154 that went into v3.0 and was pointed out here. #27627 (comment) |
Actually, Spark 3.0.0 is the better place to land if we only concern about backward compatibility, but even for the major version update we also don't want to scare end users. SPARK-26154 introduced the "versioning" of the state of stream-stream join, so that Spark 3.0.0 can indicate the "old" state and fail the query with "proper" error message. There's no such thing for this patch; that's why I asked about "versioning" of state for streaming aggregation "function" but I'm not sure it's preferred approach and even we agree with that I'm not sure we have enough time to deal with it in Spark 3.0.0. (It's actually delayed pretty much.) My personal feeling is that we should bring the essential functionality (say, schema information of state, #24173) ASAP, so that we can at least guide such case for end users like "if you didn't touch your query but encounter the schema incompatible error on state, please find migration guide for your Spark version to see there's any backward incompatible change" in the future. Unfortunately, even we adopt #24173 into Spark 3.0.0 (even I'm not sure it would happen), that doesn't apply on migration from Spark 2.x to 3.0.0 as they won't have schema in existing state from Spark 2.x as of now. Even we can craft a tool to create schema file for states on Spark 2.x structured streaming query so that end users can adopt it before migrating to Spark 3.0.0, but well, that's a sketched idea yet and I have to get some sort of supports from community to move on. #28707 would help determining the issue at least for this issue (as the number of fields will not match) so #28707 might unblock this patch to be included in branch-3.0, but the error message would be a bit unfriendly because we won't have detailed information about schema of the state. |
I think so too since this is a correctness issue. If I summarize my understanding from discussion with @cloud-fan on the #28707, we are waiting on that pr which is probably going into v3.1 - it will help detect and throw a better error message when it detects compatibility issues, which is a good thing. That said, if we wait for that and both these go into v3.1 lets say, I wonder if it is more of an unexpected breaking change for streaming use cases when they go from v3.0 to v3.1. @cloud-fan, wdyt? Thanks. |
### What changes were proposed in this pull request? This is a followup of #27627 to fix the remaining issues. There are 2 issues fixed in this PR: 1. `UnsafeRow.setDecimal` can set an overflowed decimal and causes an error when reading it. The expected behavior is to return null. 2. The update/merge expression for decimal type in `Sum` is wrong. We shouldn't turn the `sum` value back to 0 after it becomes null due to overflow. This issue was hidden because: 2.1 for hash aggregate, the buffer is unsafe row. Due to the first bug, we fail when overflow happens, so there is no chance to mistakenly turn null back to 0. 2.2 for sort-based aggregate, the buffer is generic row. The decimal can overflow (the Decimal class has unlimited precision) and we don't have the null problem. If we only fix the first bug, then the second bug is exposed and test fails. If we only fix the second bug, there is no way to test it. This PR fixes these 2 bugs together. ### Why are the changes needed? Fix issues during decimal sum when overflow happens ### Does this PR introduce _any_ user-facing change? Yes. Now decimal sum can return null correctly for overflow under non-ansi mode. ### How was this patch tested? new test and updated test Closes #29026 from cloud-fan/decimal. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…erflow of sum aggregation ### What changes were proposed in this pull request? This is a followup of #29125 In branch 3.0: 1. for hash aggregation, before #29125 there will be a runtime exception on decimal overflow of sum aggregation; after #29125, there could be a wrong result. 2. for sort aggregation, with/without #29125, there could be a wrong result on decimal overflow. While in master branch(the future 3.1 release), the problem doesn't exist since in #27627 there is a flag for marking whether overflow happens in aggregation buffer. However, the aggregation buffer is written in steaming checkpoints. Thus, we can't change to aggregation buffer to resolve the issue. As there is no easy solution for returning null/throwing exception regarding `spark.sql.ansi.enabled` on overflow in branch 3.0, we have to make a choice here: always throw exception on decimal value overflow of sum aggregation. ### Why are the changes needed? Avoid returning wrong result in decimal value sum aggregation. ### Does this PR introduce _any_ user-facing change? Yes, there is always exception on decimal value overflow of sum aggregation, instead of a possible wrong result. ### How was this patch tested? Unit test case Closes #29404 from gengliangwang/fixSum. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
* KE-24858 [SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow * [SPARK-28067][SPARK-32018] Fix decimal overflow issues ### What changes were proposed in this pull request? This is a followup of apache#27627 to fix the remaining issues. There are 2 issues fixed in this PR: 1. `UnsafeRow.setDecimal` can set an overflowed decimal and causes an error when reading it. The expected behavior is to return null. 2. The update/merge expression for decimal type in `Sum` is wrong. We shouldn't turn the `sum` value back to 0 after it becomes null due to overflow. This issue was hidden because: 2.1 for hash aggregate, the buffer is unsafe row. Due to the first bug, we fail when overflow happens, so there is no chance to mistakenly turn null back to 0. 2.2 for sort-based aggregate, the buffer is generic row. The decimal can overflow (the Decimal class has unlimited precision) and we don't have the null problem. If we only fix the first bug, then the second bug is exposed and test fails. If we only fix the second bug, there is no way to test it. This PR fixes these 2 bugs together. ### Why are the changes needed? Fix issues during decimal sum when overflow happens ### Does this PR introduce _any_ user-facing change? Yes. Now decimal sum can return null correctly for overflow under non-ansi mode. ### How was this patch tested? new test and updated test Closes apache#29026 from cloud-fan/decimal. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> * KE-24858 fix error: java.lang.IllegalArgumentException: Can not interpolate java.lang.Boolean into code block. * KE-24858 fix ci error * KE-24858 update pom version Co-authored-by: Sunitha Kambhampati <skambha@us.ibm.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: longfei.jiang <longfei.jiang@kyligence.io>
What changes were proposed in this pull request?
JIRA SPARK-28067: Wrong results are returned for aggregate sum with decimals with whole stage codegen enabled
Repro:
WholeStage enabled enabled -> Wrong results
WholeStage disabled -> Returns exception Decimal precision 39 exceeds max precision 38
Issues:
Cause:
Sum does not take care of possibility of overflow for the intermediate steps. ie the updateExpressions and mergeExpressions.
This PR makes the following changes:
Before the fix: Scenario 1: - WRONG RESULTS
--
Before fix: Scenario2: Setting spark.sql.ansi.enabled to true - WRONG RESULTS
After the fix: Scenario1:
After fix: Scenario2: Setting the spark.sql.ansi.enabled to true:
Why are the changes needed?
The changes are needed in order to fix the wrong results that are returned for decimal aggregate sum.
Does this PR introduce any user-facing change?
User would see wrong results on aggregate sum that involved decimal overflow prior to this change, but now the user will see null. But if user enables the spark.sql.ansi.enabled flag to true, then the user will see an exception and not incorrect results.
How was this patch tested?
New test has been added and existing tests for sql, catalyst and hive suites were run ok.