-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48719][SQL] Fix the calculation bug of RegrSlope
& RegrIntercept
when the first parameter is null
#47105
Conversation
@@ -1,7 +1,7 @@ | |||
-- Automatically generated by SQLQueryTestSuite | |||
-- !query | |||
CREATE OR REPLACE TEMPORARY VIEW testRegression AS SELECT * FROM VALUES | |||
(1, 10, null), (2, 10, 11), (2, 20, 22), (2, 25, null), (2, 30, 35) | |||
(1, 10, null), (2, 10, 11), (2, 20, 22), (2, 25, null), (2, 30, 35), (2, null, 40) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A tuple is added with the value of y
is null, it should be filtered out during calculation, so the output related to RegrSlope
& RegrIntercept
in the output remains unchanged.
RegrSlope
&RegrIntercept` when the first parameter is nullRegrSlope
& RegrIntercept
when the first parameter is null
cc @beliefer |
Gentle ping @HyukjinKwon, when you have time. |
seems ok, cc @beliefer and @cloud-fan |
@@ -311,7 +312,8 @@ case class RegrIntercept(left: Expression, right: Expression) extends Declarativ | |||
|
|||
private val covarPop = new CovPopulation(right, left) | |||
|
|||
private val varPop = new VariancePop(right) | |||
private val varPop = new VariancePop(If(And(IsNotNull(left), IsNotNull(right)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of adding null check to the underlying VariancePop
, shall we add it to RegrIntercept
? e.g. we can add null check to updateExpressions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried changing it locally and related test cases passed. It seems that the logic is more complicated. Is it worth updating like this? WDYT @cloud-fan
override lazy val updateExpressions: Seq[Expression] = {
val isNull = left.isNull || right.isNull
val updateResult = covarPop.updateExpressions ++ varPop.updateExpressions
aggBufferAttributes.zip(updateResult).map { case (oldValue, newValue) =>
If(isNull, oldValue, newValue)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more efficient, as we do null check earlier, before we execute VariancePop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the value is calculated regardless of whether it is null with code val updateResult = covarPop.updateExpressions ++ varPop.updateExpressions
.
It's similar to the updateExpressionsDef
func in Covariance
. The value is calculated first, and finally different values are returned depending on whether it is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found a better way to implement this logic. If can use if else
directly, it may not be calculated in advance, but it cannot be expressed directly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it, would you like to take another look? Thank you. @cloud-fan
val isNull = left.isNull || right.isNull | ||
val updateResult = covarPop.updateExpressions ++ varPop.updateExpressions | ||
aggBufferAttributes.zip(updateResult).map { case (oldValue, newValue) => | ||
If(isNull, oldValue, newValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only need to add this for varPop.updateExpressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code should be something like
covarPop.updateExpressions ++ varPop.updateExpressions.zip(varPop.aggBufferAttributes)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW to add a bit more color to my proposal: the update expressions of VariancePop
is
trimHigherOrder(Seq(
If(child.isNull, n, newN),
If(child.isNull, avg, newAvg),
If(child.isNull, m2, newM2),
If(child.isNull, m3, newM3),
If(child.isNull, m4, newM4)
))
The child
appears not only in the if conditions, but also newAvg
, newM2
, etc. It's better to not add the extra null check to the child
, as it will be duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good insight.
override lazy val updateExpressions: Seq[Expression] = { | ||
val isNull = left.isNull || right.isNull | ||
covarPop.updateExpressions ++ varPop.updateExpressions.zip(varPop.aggBufferAttributes).map { | ||
case (newValue, oldValue) => If(isNull, oldValue, newValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add some code comments to explain the null check
override lazy val updateExpressions: Seq[Expression] = { | ||
val isNull = left.isNull || right.isNull | ||
covarPop.updateExpressions ++ varPop.updateExpressions.zip(varPop.aggBufferAttributes).map { | ||
case (newValue, oldValue) => If(isNull, oldValue, newValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
thanks, merging to master! |
can you open a backport PR for 3.5? |
Ok, let me do it. Thank you for review. @cloud-fan |
…rcept` when the first parameter is null ### What changes were proposed in this pull request? This PR aims to fix the calculation bug of `RegrSlope`&`RegrIntercept` when the first parameter is null. Regardless of whether the first parameter(y) or the second parameter(x) is null, this tuple should be filtered out. ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? Yes, the calculation changes when the first value of a tuple is null, but the value is truly correct. ### How was this patch tested? Pass GA and test with `build/sbt "~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z linear-regression.sql"` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47105 from wayneguow/SPARK-48719. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…rcept` when the first parameter is null ### What changes were proposed in this pull request? This PR aims to fix the calculation bug of `RegrSlope`&`RegrIntercept` when the first parameter is null. Regardless of whether the first parameter(y) or the second parameter(x) is null, this tuple should be filtered out. ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? Yes, the calculation changes when the first value of a tuple is null, but the value is truly correct. ### How was this patch tested? Pass GA and test with `build/sbt "~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z linear-regression.sql"` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47105 from wayneguow/SPARK-48719. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR aims to fix the calculation bug of
RegrSlope
&RegrIntercept
when the first parameter is null. Regardless of whether the first parameter(y) or the second parameter(x) is null, this tuple should be filtered out.Why are the changes needed?
Fix bug.
Does this PR introduce any user-facing change?
Yes, the calculation changes when the first value of a tuple is null, but the value is truly correct.
How was this patch tested?
Pass GA and test with
build/sbt "~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z linear-regression.sql"
Was this patch authored or co-authored using generative AI tooling?
No.