Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26448][SQL] retain the difference between 0.0 and -0.0 #23388

Closed
wants to merge 7 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

In #23043 , we introduced a behavior change: Spark users are not able to distinguish 0.0 and -0.0 anymore.

This PR proposes an alternative fix to the original bug, to retain the difference between 0.0 and -0.0 inside Spark.

The idea is, we can rewrite the window partition key, join key and grouping key during logical phase, to normalize the special floating numbers. Thus only operators care about special floating numbers need to pay the perf overhead, and end users can distinguish -0.0.

How was this patch tested?

existing test

@@ -295,16 +295,4 @@ class DataFrameJoinSuite extends QueryTest with SharedSQLContext {
df.join(df, df("id") <=> df("id")).queryExecution.optimizedPlan
}
}

test("NaN and -0.0 in join keys") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to JoinSuite.

@@ -124,24 +124,6 @@ abstract class QueryTest extends PlanTest {
}
}

private def compare(obj1: Any, obj2: Any): Boolean = (obj1, obj2) match {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to the object, so that we can reuse it.

@cloud-fan
Copy link
Contributor Author

@@ -198,46 +198,11 @@ protected final void writeLong(long offset, long value) {
Platform.putLong(getBuffer(), offset, value);
}

// We need to take care of NaN and -0.0 in several places:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments are moved to the new rule.

@SparkQA
Copy link

SparkQA commented Dec 26, 2018

Test build #100458 has finished for PR 23388 at commit b97c091.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with ExpectsInputTypes

@dongjoon-hyun
Copy link
Member

The error looks legitimate. Side-effects against decimal logics?

@SparkQA
Copy link

SparkQA commented Dec 26, 2018

Test build #100459 has finished for PR 23388 at commit 67c694f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with ExpectsInputTypes

@SparkQA
Copy link

SparkQA commented Dec 26, 2018

Test build #100460 has finished for PR 23388 at commit 0cd1bcb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with ExpectsInputTypes

intercept[TestFailedException] {
checkDataset(Seq(-0.0d).toDS(), -0.0d)
checkDataset(Seq(-0.0d).toDS(), 0.0d)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prove that the test framework can distinguish -0.0 and 0.0.

@@ -594,7 +594,7 @@ abstract class AggregationQuerySuite extends QueryTest with SQLTestUtils with Te
| max(distinct value1)
|FROM agg2
""".stripMargin),
Row(-60, 70.0, 101.0/9.0, 5.6, 100))
Row(-60, 70, 101.0/9.0, 5.6, 100))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum of int is long, we shouldn't use double here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh... checkAnswer was unable to detect this

sql("SELECT max(key) FROM src").collect().toSeq)
checkAnswer(
sql("SELECT percentile(key, 1) FROM src LIMIT 1"),
sql("SELECT double(max(key)) FROM src"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

percentile always return double, we need to cast max to double so that we can compare the results.

@@ -1757,7 +1757,7 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
sql("insert into tbl values ('3', '2.3')")
checkAnswer(
sql("select (cast (99 as decimal(19,6)) + cast('3' as decimal)) * cast('2.3' as decimal)"),
Row(204.0)
Row(BigDecimal(204.0))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the result is decimal type, not double.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to fix this.
Do you think this PR also affects the Spark's behavior on the existing apps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not.

This test case was wrong at the first place, my change to the checkAnswer expose it.

@@ -25,8 +25,6 @@ displayTitle: Spark SQL Upgrading Guide

- In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is non-struct type, e.g. int, string, array, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a default value of `false`.

- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation of this fix is to avoid this behavior change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for join keys and GROUP BY groups, the previous difference between 0.0 and -0.0 is treated as a bug, so we don't need to mention it in migration guide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkout the test case, "distinguish -0.0" is not about agg or join.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't 0.0 and -0.0 treated as distinct groups for agg before the recent fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and it's a bug. But if -0.0 is not used in grouping keys(and other similar places), users should still be able to distinguish it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see what you mean. Are you saying we should add migration guide for the behavior changes of grouping key/window partition key?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry for confusing. I'm not sure about if a migration guide is needed because it is a bug.

@SparkQA
Copy link

SparkQA commented Dec 27, 2018

Test build #100465 has finished for PR 23388 at commit 12886ea.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 27, 2018

Test build #100474 has finished for PR 23388 at commit 12efbf5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -25,7 +25,7 @@ displayTitle: Spark SQL Upgrading Guide

- In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is non-struct type, e.g. int, string, array, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a default value of `false`.

- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more.
- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep this migration guide because this bug is not very intuitive: literally -0.0 is not 0.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to explicitly state that outputs still distingish 0.0 and -0.0? For example, Seq(-0.0d).toDS().show() returns -0.0 in any version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need to mention the difference between new and old versions.

@SparkQA
Copy link

SparkQA commented Dec 27, 2018

Test build #100477 has finished for PR 23388 at commit 8b92191.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 28, 2018

Test build #100485 has finished for PR 23388 at commit 7810f7b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sql("SELECT percentile(key, 1) FROM src LIMIT 1"),
sql("SELECT double(max(key)) FROM src"))
checkAnswer(sql("SELECT percentile(key, 1) FROM src LIMIT 1"),
sql("SELECT max(key) FROM src").collect().toSeq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be a big behavior change in Spark testing.
So, this PR is going to enforce us to use explicit collect().toSeq for checkAnswer in some cases?

@@ -1757,7 +1757,7 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
sql("insert into tbl values ('3', '2.3')")
checkAnswer(
sql("select (cast (99 as decimal(19,6)) + cast('3' as decimal)) * cast('2.3' as decimal)"),
Row(BigDecimal(204.0))
Row(204.0)
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result type of the above SQL statement is decimal(31,6). Can we use decimal type here?

}
}

case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with ExpectsInputTypes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If NormalizeFloatingNumbers is an optimizer rule, NormalizeNaNAndZero should only go through Optimizer, so does it need to extend ExpectsInputTypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, but I do it for safety. IIUC the test framework will throw exception if a plan becomes unresolved after a rule.

@cloud-fan
Copy link
Contributor Author

@dongjoon-hyun I've implemented a safer way to let test framework distinguish -0.0, now we don't need to change a lot of existing test cases.

@dongjoon-hyun
Copy link
Member

That's a big relief. Thank you, @cloud-fan !


case _ => plan transform {
case w: Window if w.partitionSpec.exists(p => needNormalize(p.dataType)) =>
w.copy(partitionSpec = w.partitionSpec.map(normalize))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. All the window expressions in the project list also refer to the partitionSpec. Should we also normalize these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume the query is select a, a + sum(a) over (partition by a) ....

Since the project list is evaluated for each input row, I think the a in the project list should retain the different of -0.0. Thus I think only partitionSpec needs to be normalized.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then make this clear by writing it up in a comment please? If the answer to this question is not obvious to the reviewer then it may also not be obvious to a later reader of the code, so in general it is advisable to answer misguided reviewer questions by adding comments. :)

* This rule normalizes NaN and -0.0 in Window partition keys, Join keys and Aggregate grouping
* expressions.
*/
object NormalizeFloatingNumbers extends Rule[LogicalPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically an optimizer rule should not change the result of a query. This rule does exactly that. Perhaps we should add a little bit of documentation for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major reason is we create Joins during optimizaiton (for subquery), and I'm also worried about join reorder may break it. I'll add comment for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add it to nonExcludableRules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch!

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100892 has finished for PR 23388 at commit c228ad9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* treated as same.
* 2. In GROUP BY, different NaNs should belong to the same group, -0.0 and 0.0 should belong
* to the same group.
* 3. In join keys, different NaNs should be treated as same, `-0.0` and `0.0` should be

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaNs are never equal to anything including other NaNs, so there is no reason to normalize them for join keys. It is fine to do it anyway for simplicity, but it should be made clear in the comments that this is not because we have to but just because it is easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That a good point. In Spark SQL, the EQUAL operator thinks 0.0 and -0.0 are same, so we have to follow it in join keys. I'm not sure how the SQL standard defines it, but it's another topic if we want to change the equal semantic of Spark SQL.

But you are right that we don't have to do it for join, we only need to do normalization for certain types of join that do binary comparison.

* expressions.
*
* Note that, this rule should be an analyzer rule, as it must be applied to make the query result
* corrected. Currently it's executed as an optimizer rule, because the optimizer may create new

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"correct"

* This rule normalizes NaN and -0.0 in Window partition keys, Join keys and Aggregate grouping
* expressions.
*
* Note that, this rule should be an analyzer rule, as it must be applied to make the query result

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads as if the code is wrong. But it is not. The fact that we have to do this normalization for joins at least is not something that needs to be an analyzer rule for query correctness. Without the normalization the join query is perfectly fine if we execute it as a cross product with a filter applied as a post-join condition. In this case the requirement for normalization is an artifact of the fact that we use a shortcut for executing the join (binary comparison, sometimes hashing) which doesn't have the correct semantics for comparison. On the other hand for aggregation and window function partitioning the normalization is required for correctness.


case _ => plan transform {
case w: Window if w.partitionSpec.exists(p => needNormalize(p.dataType)) =>
w.copy(partitionSpec = w.partitionSpec.map(normalize))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then make this clear by writing it up in a comment please? If the answer to this question is not obvious to the reviewer then it may also not be obvious to a later reader of the code, so in general it is advisable to answer misguided reviewer questions by adding comments. :)

w.copy(partitionSpec = w.partitionSpec.map(normalize))

case j @ ExtractEquiJoinKeys(_, leftKeys, rightKeys, condition, _, _)
if leftKeys.exists(k => needNormalize(k.dataType)) =>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also check the right keys? Or is that implied by the fact that the keys of both sides have the same type? If so, please leave a comment to make it clear why the right keys are not checked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the analyzer will make sure the left and right join keys are of the same data type. I'll add a comment to explain it, thanks!

case FloatType | DoubleType => true
case StructType(fields) => fields.exists(f => needNormalize(f.dataType))
case ArrayType(et, _) => needNormalize(et)
// We don't need to handle MapType here, as it's not comparable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is not future proof against the situation where map types are comparable in the future. It could be made future proof by throwing an exception if a map type is encountered here. If I understand correctly this code should never encounter a map type unless map types are comparable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! I'll update soon.

assert(floats(0).getLong(1) == 3)
test("SPARK-26021: NaN and -0.0 in grouping expressions") {
checkAnswer(
Seq(0.0f, -0.0f, 0.0f/0.0f, Float.NaN).toDF("f").groupBy("f").count(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test relies on the specific property that division of zero by zero returns a different kind of NaN than Float.NaN. That is subtle and needs to be documented with a comment. You could also test with floatToRawIntBits that the values actually have different bits. Because if they do not, then you are actually not testing what the test is purportedly testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


// test with complicated type grouping expressions
checkAnswer(
Seq(0.0f, -0.0f, 0.0f/0.0f, Float.NaN).toDF("f").groupBy(array("f"), struct("f")).count(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe also arrays of structs and structs of arrays?

checkAnswer(
sql(
"""
|SELECT v1.f, v1.d

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also select the v2 columns for clarity on why the result is what it is?

Seq(Array(-0.0f, 0.0f) -> Tuple2(-0.0d, Double.NaN)).toDF("arr", "stru").createTempView("v3")
Seq(Array(0.0f, -0.0f) -> Tuple2(0.0d, Double.NaN)).toDF("arr", "stru").createTempView("v4")
checkAnswer(
sql("SELECT v3.arr, v3.stru FROM v3 JOIN v4 ON v3.arr = v4.arr AND v3.stru = v4.stru"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the style difference compared to the previous test cases?

* 2. In GROUP BY, different NaNs should belong to the same group, -0.0 and 0.0 should belong
* to the same group.
* 2. In aggregate grouping keys, different NaNs should belong to the same group, -0.0 and 0.0
* should belong to the same group.
* 3. In join keys, different NaNs should be treated as same, `-0.0` and `0.0` should be

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still remove "different NaNs should be treated as same" here?

@bart-samwel
Copy link

bart-samwel commented Jan 8, 2019 via email

@cloud-fan
Copy link
Contributor Author

Hi @bart-samwel , in Spark SQL, different NaN values are treated as same, -0.0 and 0.0 are treated as same. So the normalization proposed by this PR keeps the current behavior and semantic unchanged.

If later on we find that NaN values should not be treated as same, basically we need to change 2 places:

  1. the EQUAL operator should drop the special handling of NaNs, and always return false for 2 NaNs.
  2. the normalization here should not apply to NaNs.

Furthermore, even the same NaN value should not equal to itself, so the binary comparison won't work. Like you proposed we should normalize NaN to null at that time.

Anyway, I think we should think about NaN later, as it will be a behavior change. What do you think?

@bart-samwel
Copy link

bart-samwel commented Jan 8, 2019 via email

@SparkQA
Copy link

SparkQA commented Jan 8, 2019

Test build #100927 has finished for PR 23388 at commit f420820.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 8, 2019

Test build #100928 has finished for PR 23388 at commit 3e8c171.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@cloud-fan open a JIRA and revisit NaN handling before Spark 3.0?

@gatorsmile
Copy link
Member

gatorsmile commented Jan 9, 2019

Will make one pass tonight. Thanks!

@cloud-fan
Copy link
Contributor Author

@gatorsmile I have created https://issues.apache.org/jira/browse/SPARK-26575 to track the followup.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in e853afb Jan 9, 2019
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

In apache#23043 , we introduced a behavior change: Spark users are not able to distinguish 0.0 and -0.0 anymore.

This PR proposes an alternative fix to the original bug, to retain the difference between 0.0 and -0.0 inside Spark.

The idea is, we can rewrite the window partition key, join key and grouping key during logical phase, to normalize the special floating numbers. Thus only operators care about special floating numbers need to pay the perf overhead, and end users can distinguish -0.0.

## How was this patch tested?

existing test

Closes apache#23388 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…s for final aggregate

## What changes were proposed in this pull request?

A followup of apache#23388 .

`AggUtils.createAggregate` is not the right place to normalize the grouping expressions, as final aggregate is also created by it. The grouping expressions of final aggregate should be attributes which refer to the grouping expressions in partial aggregate.

This PR moves the normalization to the caller side of `AggUtils`.

## How was this patch tested?

existing tests

Closes apache#23692 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Jun 11, 2020
### What changes were proposed in this pull request?

This is a followup of #23388 .

#23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions.

This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery.

Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now.

### Why are the changes needed?

fix a bug

### Does this PR introduce _any_ user-facing change?

yes, see the newly added test.

### How was this patch tested?

new test

Closes #28785 from cloud-fan/normalize.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Jun 11, 2020
### What changes were proposed in this pull request?

This is a followup of #23388 .

#23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions.

This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery.

Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now.

### Why are the changes needed?

fix a bug

### Does this PR introduce _any_ user-facing change?

yes, see the newly added test.

### How was this patch tested?

new test

Closes #28785 from cloud-fan/normalize.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6fb9c80)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
holdenk pushed a commit to holdenk/spark that referenced this pull request Jun 25, 2020
### What changes were proposed in this pull request?

This is a followup of apache#23388 .

apache#23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions.

This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery.

Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now.

### Why are the changes needed?

fix a bug

### Does this PR introduce _any_ user-facing change?

yes, see the newly added test.

### How was this patch tested?

new test

Closes apache#28785 from cloud-fan/normalize.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6fb9c80)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants