[SPARK-26021][SQL][followup] add test for special floating point values #23141

cloud-fan · 2018-11-26T06:44:09Z

What changes were proposed in this pull request?

a followup of #23043 . Add a test to show the minor behavior change introduced by #23043 , and add migration guide.

How was this patch tested?

a new test

cloud-fan · 2018-11-26T06:44:39Z

cc @adoron @kiszk @viirya

cloud-fan · 2018-11-26T07:25:17Z

docs/sql-migration-guide-upgrade.md

  - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is non-struct type, e.g. int, string, array, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a default value of `false`.

+  - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more.


I checked presto and postgres, the behaviors are same. Hive distinguishes -0.0 and 0.0, but it has the group by bug.

What version of hive did you test?
It was fixed in https://issues.apache.org/jira/browse/HIVE-11174

I ran few simple queries on Hive 2.1.

Simple comparison seems ok:

hive> select 1 where 0.0=-0.0; OK 1 Time taken: 0.047 seconds, Fetched: 1 row(s) hive> select 1 where -0.0<0.0; OK Time taken: 0.053 seconds

But group by behavior seems not correct:

hive> select * from test; OK 0.0 -0.0 0.0 Time taken: 0.11 seconds, Fetched: 3 row(s) hive> select * from test; OK 0.0 -0.0 0.0 Time taken: 0.11 seconds, Fetched: 3 row(s) hive> select a, count(*) from test group by a; -0.0 3 Time taken: 1.308 seconds, Fetched: 1 row(s)

SparkQA · 2018-11-26T08:05:01Z

Test build #99263 has finished for PR 23141 at commit 8a9103c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-11-26T08:09:58Z

retest this please.

SparkQA · 2018-11-26T11:30:43Z

Test build #99269 has finished for PR 23141 at commit 8a9103c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-11-26T15:12:23Z

LGTM

SparkQA · 2018-11-26T20:21:48Z

Test build #99279 has finished for PR 23141 at commit 7c590bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-28T08:21:57Z

thanks, merging to master!

dongjoon-hyun · 2018-12-05T22:05:16Z

common/unsafe/src/test/java/org/apache/spark/unsafe/PlatformUtilSuite.java

+    byte[] doubleBytes2 = new byte[Double.BYTES];
+    byte[] floatBytes2 = new byte[Float.BYTES];
+    Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, 0.0d);
+    Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, 0.0f);


Unfortunately, these should be doubleByte2 and floatBytes2.
I'll comment on the follow-up PR #23239 , too. We can fix there.

ah good catch! I'm surprised this test passed before...

## What changes were proposed in this pull request? a followup of apache#23043 . Add a test to show the minor behavior change introduced by apache#23043 , and add migration guide. ## How was this patch tested? a new test Closes apache#23141 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

add test for special floating point values

8a9103c

cloud-fan commented Nov 26, 2018

View reviewed changes

srowen approved these changes Nov 26, 2018

View reviewed changes

viirya approved these changes Nov 26, 2018

View reviewed changes

HyukjinKwon approved these changes Nov 26, 2018

View reviewed changes

more code cleanup

7c590bc

asfgit closed this in 09a91d9 Nov 28, 2018

dongjoon-hyun reviewed Dec 5, 2018

View reviewed changes

viirya mentioned this pull request Dec 7, 2018

[SPARK-26021][SQL][followup] only deal with NaN and -0.0 in UnsafeWriter #23239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26021][SQL][followup] add test for special floating point values #23141

[SPARK-26021][SQL][followup] add test for special floating point values #23141

cloud-fan commented Nov 26, 2018 •

edited

Loading

cloud-fan commented Nov 26, 2018

cloud-fan Nov 26, 2018

adoron Nov 26, 2018

viirya Nov 26, 2018

SparkQA commented Nov 26, 2018

viirya commented Nov 26, 2018

SparkQA commented Nov 26, 2018

kiszk commented Nov 26, 2018

SparkQA commented Nov 26, 2018

cloud-fan commented Nov 28, 2018

dongjoon-hyun Dec 5, 2018

cloud-fan Dec 6, 2018

		- In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is non-struct type, e.g. int, string, array, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a default value of `false`.

		- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more.

[SPARK-26021][SQL][followup] add test for special floating point values #23141

[SPARK-26021][SQL][followup] add test for special floating point values #23141

Conversation

cloud-fan commented Nov 26, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Nov 26, 2018

cloud-fan Nov 26, 2018

Choose a reason for hiding this comment

adoron Nov 26, 2018

Choose a reason for hiding this comment

viirya Nov 26, 2018

Choose a reason for hiding this comment

SparkQA commented Nov 26, 2018

viirya commented Nov 26, 2018

SparkQA commented Nov 26, 2018

kiszk commented Nov 26, 2018

SparkQA commented Nov 26, 2018

cloud-fan commented Nov 28, 2018

dongjoon-hyun Dec 5, 2018

Choose a reason for hiding this comment

cloud-fan Dec 6, 2018

Choose a reason for hiding this comment

cloud-fan commented Nov 26, 2018 •

edited

Loading