Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48719][SQL] Fix the calculation bug of RegrSlope & RegrIntercept when the first parameter is null #47105

Closed
wants to merge 5 commits into from

Conversation

wayneguow
Copy link
Contributor

@wayneguow wayneguow commented Jun 26, 2024

What changes were proposed in this pull request?

This PR aims to fix the calculation bug of RegrSlope&RegrIntercept when the first parameter is null. Regardless of whether the first parameter(y) or the second parameter(x) is null, this tuple should be filtered out.

Why are the changes needed?

Fix bug.

Does this PR introduce any user-facing change?

Yes, the calculation changes when the first value of a tuple is null, but the value is truly correct.

How was this patch tested?

Pass GA and test with build/sbt "~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z linear-regression.sql"

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jun 26, 2024
@@ -1,7 +1,7 @@
-- Automatically generated by SQLQueryTestSuite
-- !query
CREATE OR REPLACE TEMPORARY VIEW testRegression AS SELECT * FROM VALUES
(1, 10, null), (2, 10, 11), (2, 20, 22), (2, 25, null), (2, 30, 35)
(1, 10, null), (2, 10, 11), (2, 20, 22), (2, 25, null), (2, 30, 35), (2, null, 40)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tuple is added with the value of y is null, it should be filtered out during calculation, so the output related to RegrSlope & RegrIntercept in the output remains unchanged.

@wayneguow wayneguow changed the title [SPARK-48719][SQL] Fix the calculation bug of RegrSlope&RegrIntercept` when the first parameter is null [SPARK-48719][SQL] Fix the calculation bug of RegrSlope & RegrIntercept when the first parameter is null Jun 26, 2024
@wayneguow wayneguow marked this pull request as ready for review June 27, 2024 02:42
@wayneguow
Copy link
Contributor Author

cc @beliefer

@wayneguow
Copy link
Contributor Author

Gentle ping @HyukjinKwon, when you have time.

@LuciferYang
Copy link
Contributor

seems ok, cc @beliefer and @cloud-fan

@@ -311,7 +312,8 @@ case class RegrIntercept(left: Expression, right: Expression) extends Declarativ

private val covarPop = new CovPopulation(right, left)

private val varPop = new VariancePop(right)
private val varPop = new VariancePop(If(And(IsNotNull(left), IsNotNull(right)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding null check to the underlying VariancePop, shall we add it to RegrIntercept? e.g. we can add null check to updateExpressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried changing it locally and related test cases passed. It seems that the logic is more complicated. Is it worth updating like this? WDYT @cloud-fan

  override lazy val updateExpressions: Seq[Expression] = {
    val isNull = left.isNull || right.isNull
    val updateResult = covarPop.updateExpressions ++ varPop.updateExpressions
    aggBufferAttributes.zip(updateResult).map { case (oldValue, newValue) =>
      If(isNull, oldValue, newValue)
    }
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more efficient, as we do null check earlier, before we execute VariancePop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the value is calculated regardless of whether it is null with code val updateResult = covarPop.updateExpressions ++ varPop.updateExpressions.

It's similar to the updateExpressionsDef func in Covariance. The value is calculated first, and finally different values ​​are returned depending on whether it is null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found a better way to implement this logic. If can use if else directly, it may not be calculated in advance, but it cannot be expressed directly here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it, would you like to take another look? Thank you. @cloud-fan

val isNull = left.isNull || right.isNull
val updateResult = covarPop.updateExpressions ++ varPop.updateExpressions
aggBufferAttributes.zip(updateResult).map { case (oldValue, newValue) =>
If(isNull, oldValue, newValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only need to add this for varPop.updateExpressions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code should be something like

covarPop.updateExpressions ++ varPop.updateExpressions.zip(varPop.aggBufferAttributes)...

Copy link
Contributor

@cloud-fan cloud-fan Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW to add a bit more color to my proposal: the update expressions of VariancePop is

    trimHigherOrder(Seq(
      If(child.isNull, n, newN),
      If(child.isNull, avg, newAvg),
      If(child.isNull, m2, newM2),
      If(child.isNull, m3, newM3),
      If(child.isNull, m4, newM4)
    ))

The child appears not only in the if conditions, but also newAvg, newM2, etc. It's better to not add the extra null check to the child, as it will be duplicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good insight.

override lazy val updateExpressions: Seq[Expression] = {
val isNull = left.isNull || right.isNull
covarPop.updateExpressions ++ varPop.updateExpressions.zip(varPop.aggBufferAttributes).map {
case (newValue, oldValue) => If(isNull, oldValue, newValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add some code comments to explain the null check

override lazy val updateExpressions: Seq[Expression] = {
val isNull = left.isNull || right.isNull
covarPop.updateExpressions ++ varPop.updateExpressions.zip(varPop.aggBufferAttributes).map {
case (newValue, oldValue) => If(isNull, oldValue, newValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@cloud-fan
Copy link
Contributor

cloud-fan commented Jul 5, 2024

thanks, merging to master!

@cloud-fan cloud-fan closed this in f1eca90 Jul 5, 2024
@cloud-fan
Copy link
Contributor

can you open a backport PR for 3.5?

@wayneguow
Copy link
Contributor Author

can you open a backport PR for 3.5?

Ok, let me do it. Thank you for review. @cloud-fan

ericm-db pushed a commit to ericm-db/spark that referenced this pull request Jul 10, 2024
…rcept` when the first parameter is null

### What changes were proposed in this pull request?

This PR aims to fix the calculation bug of `RegrSlope`&`RegrIntercept` when the first parameter is null. Regardless of whether the first parameter(y) or the second parameter(x) is null, this tuple should be filtered out.

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

Yes, the calculation changes when the first value of a tuple is null, but the value is truly correct.

### How was this patch tested?

Pass GA and test with `build/sbt "~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z linear-regression.sql"`

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47105 from wayneguow/SPARK-48719.

Authored-by: Wei Guo <guow93@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…rcept` when the first parameter is null

### What changes were proposed in this pull request?

This PR aims to fix the calculation bug of `RegrSlope`&`RegrIntercept` when the first parameter is null. Regardless of whether the first parameter(y) or the second parameter(x) is null, this tuple should be filtered out.

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

Yes, the calculation changes when the first value of a tuple is null, but the value is truly correct.

### How was this patch tested?

Pass GA and test with `build/sbt "~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z linear-regression.sql"`

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47105 from wayneguow/SPARK-48719.

Authored-by: Wei Guo <guow93@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants