-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26021][SQL][followup] add test for special floating point values #23141
Conversation
- In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is non-struct type, e.g. int, string, array, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a default value of `false`. | ||
|
||
- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked presto and postgres, the behaviors are same. Hive distinguishes -0.0 and 0.0, but it has the group by bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What version of hive did you test?
It was fixed in https://issues.apache.org/jira/browse/HIVE-11174
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran few simple queries on Hive 2.1.
Simple comparison seems ok:
hive> select 1 where 0.0=-0.0;
OK
1
Time taken: 0.047 seconds, Fetched: 1 row(s)
hive> select 1 where -0.0<0.0;
OK
Time taken: 0.053 seconds
But group by behavior seems not correct:
hive> select * from test;
OK
0.0
-0.0
0.0
Time taken: 0.11 seconds, Fetched: 3 row(s)
hive> select * from test;
OK
0.0
-0.0
0.0
Time taken: 0.11 seconds, Fetched: 3 row(s)
hive> select a, count(*) from test group by a;
-0.0 3
Time taken: 1.308 seconds, Fetched: 1 row(s)
Test build #99263 has finished for PR 23141 at commit
|
retest this please. |
Test build #99269 has finished for PR 23141 at commit
|
LGTM |
Test build #99279 has finished for PR 23141 at commit
|
thanks, merging to master! |
byte[] doubleBytes2 = new byte[Double.BYTES]; | ||
byte[] floatBytes2 = new byte[Float.BYTES]; | ||
Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, 0.0d); | ||
Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, 0.0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, these should be doubleByte2
and floatBytes2
.
I'll comment on the follow-up PR #23239 , too. We can fix there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good catch! I'm surprised this test passed before...
## What changes were proposed in this pull request? a followup of apache#23043 . Add a test to show the minor behavior change introduced by apache#23043 , and add migration guide. ## How was this patch tested? a new test Closes apache#23141 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
a followup of #23043 . Add a test to show the minor behavior change introduced by #23043 , and add migration guide.
How was this patch tested?
a new test