-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26448][SQL] retain the difference between 0.0 and -0.0 #23388
Conversation
@@ -295,16 +295,4 @@ class DataFrameJoinSuite extends QueryTest with SharedSQLContext { | |||
df.join(df, df("id") <=> df("id")).queryExecution.optimizedPlan | |||
} | |||
} | |||
|
|||
test("NaN and -0.0 in join keys") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to JoinSuite
.
@@ -124,24 +124,6 @@ abstract class QueryTest extends PlanTest { | |||
} | |||
} | |||
|
|||
private def compare(obj1: Any, obj2: Any): Boolean = (obj1, obj2) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to the object, so that we can reuse it.
@@ -198,46 +198,11 @@ protected final void writeLong(long offset, long value) { | |||
Platform.putLong(getBuffer(), offset, value); | |||
} | |||
|
|||
// We need to take care of NaN and -0.0 in several places: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments are moved to the new rule.
Test build #100458 has finished for PR 23388 at commit
|
The error looks legitimate. Side-effects against decimal logics? |
67c694f
to
0cd1bcb
Compare
Test build #100459 has finished for PR 23388 at commit
|
Test build #100460 has finished for PR 23388 at commit
|
intercept[TestFailedException] { | ||
checkDataset(Seq(-0.0d).toDS(), -0.0d) | ||
checkDataset(Seq(-0.0d).toDS(), 0.0d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to prove that the test framework can distinguish -0.0 and 0.0.
@@ -594,7 +594,7 @@ abstract class AggregationQuerySuite extends QueryTest with SQLTestUtils with Te | |||
| max(distinct value1) | |||
|FROM agg2 | |||
""".stripMargin), | |||
Row(-60, 70.0, 101.0/9.0, 5.6, 100)) | |||
Row(-60, 70, 101.0/9.0, 5.6, 100)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sum of int is long, we shouldn't use double here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh... checkAnswer was unable to detect this
sql("SELECT max(key) FROM src").collect().toSeq) | ||
checkAnswer( | ||
sql("SELECT percentile(key, 1) FROM src LIMIT 1"), | ||
sql("SELECT double(max(key)) FROM src")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
percentile
always return double, we need to cast max
to double so that we can compare the results.
@@ -1757,7 +1757,7 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { | |||
sql("insert into tbl values ('3', '2.3')") | |||
checkAnswer( | |||
sql("select (cast (99 as decimal(19,6)) + cast('3' as decimal)) * cast('2.3' as decimal)"), | |||
Row(204.0) | |||
Row(BigDecimal(204.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result is decimal type, not double.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to fix this.
Do you think this PR also affects the Spark's behavior on the existing apps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not.
This test case was wrong at the first place, my change to the checkAnswer
expose it.
@@ -25,8 +25,6 @@ displayTitle: Spark SQL Upgrading Guide | |||
|
|||
- In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is non-struct type, e.g. int, string, array, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a default value of `false`. | |||
|
|||
- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation of this fix is to avoid this behavior change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for join keys and GROUP BY groups, the previous difference between 0.0 and -0.0 is treated as a bug, so we don't need to mention it in migration guide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkout the test case, "distinguish -0.0" is not about agg or join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't 0.0 and -0.0 treated as distinct groups for agg before the recent fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and it's a bug. But if -0.0 is not used in grouping keys(and other similar places), users should still be able to distinguish it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see what you mean. Are you saying we should add migration guide for the behavior changes of grouping key/window partition key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry for confusing. I'm not sure about if a migration guide is needed because it is a bug.
Test build #100465 has finished for PR 23388 at commit
|
Test build #100474 has finished for PR 23388 at commit
|
@@ -25,7 +25,7 @@ displayTitle: Spark SQL Upgrading Guide | |||
|
|||
- In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is non-struct type, e.g. int, string, array, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a default value of `false`. | |||
|
|||
- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more. | |||
- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep this migration guide because this bug is not very intuitive: literally -0.0 is not 0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to explicitly state that outputs still distingish 0.0 and -0.0? For example, Seq(-0.0d).toDS().show()
returns -0.0
in any version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need to mention the difference between new and old versions.
Test build #100477 has finished for PR 23388 at commit
|
Test build #100485 has finished for PR 23388 at commit
|
sql("SELECT percentile(key, 1) FROM src LIMIT 1"), | ||
sql("SELECT double(max(key)) FROM src")) | ||
checkAnswer(sql("SELECT percentile(key, 1) FROM src LIMIT 1"), | ||
sql("SELECT max(key) FROM src").collect().toSeq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be a big behavior change in Spark testing.
So, this PR is going to enforce us to use explicit collect().toSeq
for checkAnswer
in some cases?
@@ -1757,7 +1757,7 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { | |||
sql("insert into tbl values ('3', '2.3')") | |||
checkAnswer( | |||
sql("select (cast (99 as decimal(19,6)) + cast('3' as decimal)) * cast('2.3' as decimal)"), | |||
Row(BigDecimal(204.0)) | |||
Row(204.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result type of the above SQL statement is decimal(31,6)
. Can we use decimal type here?
} | ||
} | ||
|
||
case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with ExpectsInputTypes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If NormalizeFloatingNumbers
is an optimizer rule, NormalizeNaNAndZero
should only go through Optimizer, so does it need to extend ExpectsInputTypes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't, but I do it for safety. IIUC the test framework will throw exception if a plan becomes unresolved after a rule.
@dongjoon-hyun I've implemented a safer way to let test framework distinguish -0.0, now we don't need to change a lot of existing test cases. |
That's a big relief. Thank you, @cloud-fan ! |
|
||
case _ => plan transform { | ||
case w: Window if w.partitionSpec.exists(p => needNormalize(p.dataType)) => | ||
w.copy(partitionSpec = w.partitionSpec.map(normalize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. All the window expressions in the project list also refer to the partitionSpec. Should we also normalize these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assume the query is select a, a + sum(a) over (partition by a) ...
.
Since the project list is evaluated for each input row, I think the a
in the project list should retain the different of -0.0. Thus I think only partitionSpec
needs to be normalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then make this clear by writing it up in a comment please? If the answer to this question is not obvious to the reviewer then it may also not be obvious to a later reader of the code, so in general it is advisable to answer misguided reviewer questions by adding comments. :)
* This rule normalizes NaN and -0.0 in Window partition keys, Join keys and Aggregate grouping | ||
* expressions. | ||
*/ | ||
object NormalizeFloatingNumbers extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically an optimizer rule should not change the result of a query. This rule does exactly that. Perhaps we should add a little bit of documentation for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The major reason is we create Join
s during optimizaiton (for subquery), and I'm also worried about join reorder may break it. I'll add comment for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add it to nonExcludableRules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good catch!
retest this please |
Test build #100892 has finished for PR 23388 at commit
|
* treated as same. | ||
* 2. In GROUP BY, different NaNs should belong to the same group, -0.0 and 0.0 should belong | ||
* to the same group. | ||
* 3. In join keys, different NaNs should be treated as same, `-0.0` and `0.0` should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaNs are never equal to anything including other NaNs, so there is no reason to normalize them for join keys. It is fine to do it anyway for simplicity, but it should be made clear in the comments that this is not because we have to but just because it is easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That a good point. In Spark SQL, the EQUAL operator thinks 0.0 and -0.0 are same, so we have to follow it in join keys. I'm not sure how the SQL standard defines it, but it's another topic if we want to change the equal semantic of Spark SQL.
But you are right that we don't have to do it for join, we only need to do normalization for certain types of join that do binary comparison.
* expressions. | ||
* | ||
* Note that, this rule should be an analyzer rule, as it must be applied to make the query result | ||
* corrected. Currently it's executed as an optimizer rule, because the optimizer may create new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"correct"
* This rule normalizes NaN and -0.0 in Window partition keys, Join keys and Aggregate grouping | ||
* expressions. | ||
* | ||
* Note that, this rule should be an analyzer rule, as it must be applied to make the query result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads as if the code is wrong. But it is not. The fact that we have to do this normalization for joins at least is not something that needs to be an analyzer rule for query correctness. Without the normalization the join query is perfectly fine if we execute it as a cross product with a filter applied as a post-join condition. In this case the requirement for normalization is an artifact of the fact that we use a shortcut for executing the join (binary comparison, sometimes hashing) which doesn't have the correct semantics for comparison. On the other hand for aggregation and window function partitioning the normalization is required for correctness.
|
||
case _ => plan transform { | ||
case w: Window if w.partitionSpec.exists(p => needNormalize(p.dataType)) => | ||
w.copy(partitionSpec = w.partitionSpec.map(normalize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then make this clear by writing it up in a comment please? If the answer to this question is not obvious to the reviewer then it may also not be obvious to a later reader of the code, so in general it is advisable to answer misguided reviewer questions by adding comments. :)
w.copy(partitionSpec = w.partitionSpec.map(normalize)) | ||
|
||
case j @ ExtractEquiJoinKeys(_, leftKeys, rightKeys, condition, _, _) | ||
if leftKeys.exists(k => needNormalize(k.dataType)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also check the right keys? Or is that implied by the fact that the keys of both sides have the same type? If so, please leave a comment to make it clear why the right keys are not checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the analyzer will make sure the left and right join keys are of the same data type. I'll add a comment to explain it, thanks!
case FloatType | DoubleType => true | ||
case StructType(fields) => fields.exists(f => needNormalize(f.dataType)) | ||
case ArrayType(et, _) => needNormalize(et) | ||
// We don't need to handle MapType here, as it's not comparable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is not future proof against the situation where map types are comparable in the future. It could be made future proof by throwing an exception if a map type is encountered here. If I understand correctly this code should never encounter a map type unless map types are comparable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! I'll update soon.
assert(floats(0).getLong(1) == 3) | ||
test("SPARK-26021: NaN and -0.0 in grouping expressions") { | ||
checkAnswer( | ||
Seq(0.0f, -0.0f, 0.0f/0.0f, Float.NaN).toDF("f").groupBy("f").count(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test relies on the specific property that division of zero by zero returns a different kind of NaN than Float.NaN. That is subtle and needs to be documented with a comment. You could also test with floatToRawIntBits that the values actually have different bits. Because if they do not, then you are actually not testing what the test is purportedly testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
// test with complicated type grouping expressions | ||
checkAnswer( | ||
Seq(0.0f, -0.0f, 0.0f/0.0f, Float.NaN).toDF("f").groupBy(array("f"), struct("f")).count(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe also arrays of structs and structs of arrays?
checkAnswer( | ||
sql( | ||
""" | ||
|SELECT v1.f, v1.d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also select the v2 columns for clarity on why the result is what it is?
Seq(Array(-0.0f, 0.0f) -> Tuple2(-0.0d, Double.NaN)).toDF("arr", "stru").createTempView("v3") | ||
Seq(Array(0.0f, -0.0f) -> Tuple2(0.0d, Double.NaN)).toDF("arr", "stru").createTempView("v4") | ||
checkAnswer( | ||
sql("SELECT v3.arr, v3.stru FROM v3 JOIN v4 ON v3.arr = v4.arr AND v3.stru = v4.stru"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the style difference compared to the previous test cases?
* 2. In GROUP BY, different NaNs should belong to the same group, -0.0 and 0.0 should belong | ||
* to the same group. | ||
* 2. In aggregate grouping keys, different NaNs should belong to the same group, -0.0 and 0.0 | ||
* should belong to the same group. | ||
* 3. In join keys, different NaNs should be treated as same, `-0.0` and `0.0` should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still remove "different NaNs should be treated as same" here?
Wait. This isn't right. NaNs in joins should actually be treated as *not*
equal. That can't be done with binary value comparison. Well, maybe we can
normalize NaNs to nulls? That would do the right thing.
Op di 8 jan. 2019 13:19 schreef UCB AMPLab <notifications@github.com:
… Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/6818/
Test PASSed.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23388 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ApwVFMRmg7TPhBgGudMuE_91XHCPiwuLks5vBIzmgaJpZM4Zh-ye>
.
|
Hi @bart-samwel , in Spark SQL, different NaN values are treated as same, -0.0 and 0.0 are treated as same. So the normalization proposed by this PR keeps the current behavior and semantic unchanged. If later on we find that NaN values should not be treated as same, basically we need to change 2 places:
Furthermore, even the same NaN value should not equal to itself, so the binary comparison won't work. Like you proposed we should normalize NaN to null at that time. Anyway, I think we should think about NaN later, as it will be a behavior change. What do you think? |
Let's be consistent with the equals operator now in this PR but then we may
want to consider changing that to be consistent with IEEE floating point
standard before spark 3.0 as well.
Op di 8 jan. 2019 16:51 schreef Wenchen Fan <notifications@github.com:
… Hi @bart-samwel <https://github.com/bart-samwel> , in Spark SQL,
different NaN values are treated as same, -0.0 and 0.0 are treated as same.
So the normalization proposed by this PR keeps the current behavior and
semantic unchanged.
If later on we find that NaN values should not be treated as same,
basically we need to change 2 places:
1. the EQUAL operator should drop the special handling of NaNs, and
always return false for 2 NaNs.
2. the normalization here should not apply to NaNs.
Furthermore, even the same NaN value should not equal to itself, so the
binary comparison won't work. Like you proposed we should normalize NaN to
null at that time.
Anyway, I think we should think about NaN later, as it will be a behavior
change. What do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23388 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ApwVFEqMtsrnpSqA_LjGx-HIDlKZ4rCkks5vBL6GgaJpZM4Zh-ye>
.
|
Test build #100927 has finished for PR 23388 at commit
|
Test build #100928 has finished for PR 23388 at commit
|
@cloud-fan open a JIRA and revisit NaN handling before Spark 3.0? |
Will make one pass tonight. Thanks! |
@gatorsmile I have created https://issues.apache.org/jira/browse/SPARK-26575 to track the followup. |
LGTM Thanks! Merged to master. |
## What changes were proposed in this pull request? In apache#23043 , we introduced a behavior change: Spark users are not able to distinguish 0.0 and -0.0 anymore. This PR proposes an alternative fix to the original bug, to retain the difference between 0.0 and -0.0 inside Spark. The idea is, we can rewrite the window partition key, join key and grouping key during logical phase, to normalize the special floating numbers. Thus only operators care about special floating numbers need to pay the perf overhead, and end users can distinguish -0.0. ## How was this patch tested? existing test Closes apache#23388 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…s for final aggregate ## What changes were proposed in this pull request? A followup of apache#23388 . `AggUtils.createAggregate` is not the right place to normalize the grouping expressions, as final aggregate is also created by it. The grouping expressions of final aggregate should be attributes which refer to the grouping expressions in partial aggregate. This PR moves the normalization to the caller side of `AggUtils`. ## How was this patch tested? existing tests Closes apache#23692 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This is a followup of #23388 . #23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions. This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery. Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now. ### Why are the changes needed? fix a bug ### Does this PR introduce _any_ user-facing change? yes, see the newly added test. ### How was this patch tested? new test Closes #28785 from cloud-fan/normalize. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This is a followup of #23388 . #23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions. This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery. Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now. ### Why are the changes needed? fix a bug ### Does this PR introduce _any_ user-facing change? yes, see the newly added test. ### How was this patch tested? new test Closes #28785 from cloud-fan/normalize. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6fb9c80) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This is a followup of apache#23388 . apache#23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions. This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery. Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now. ### Why are the changes needed? fix a bug ### Does this PR introduce _any_ user-facing change? yes, see the newly added test. ### How was this patch tested? new test Closes apache#28785 from cloud-fan/normalize. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6fb9c80) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
In #23043 , we introduced a behavior change: Spark users are not able to distinguish 0.0 and -0.0 anymore.
This PR proposes an alternative fix to the original bug, to retain the difference between 0.0 and -0.0 inside Spark.
The idea is, we can rewrite the window partition key, join key and grouping key during logical phase, to normalize the special floating numbers. Thus only operators care about special floating numbers need to pay the perf overhead, and end users can distinguish -0.0.
How was this patch tested?
existing test