Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix -NaN ordering in spec #2891

Closed

Conversation

findepi
Copy link
Member

@findepi findepi commented Jul 29, 2021

In Java, -NaN is not distinguishable from NaN (is the same value, as
introspectable with Float.floatToIntBits or
Double.doubleToLongBits).

Moreover, Java sorts all possible NaN values within single peer group.
The different NaN values compare equal, even if they are not the same.
They always compare as "greater" than positive infinity.

cc @yyanyy @rdblue @electrum

In Java, -NaN is not distinguishable from NaN (is the same value, as
introspectable with `Float.floatToIntBits` or
`Double.doubleToLongBits`).

Moreover, Java sorts all possible `NaN` values within single peer group.
The different NaN values compare equal, even if they are not the same.
They always compare as "greater" than positive infinity.
Copy link
Contributor

@yyanyy yyanyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I'm hoping Ryan could also take a look in case I miss anything. For posterity, this PR comes from this thread

@@ -267,7 +267,8 @@ A sort order is defined by an sort order id and a list of sort fields. The order

Order id `0` is reserved for the unsorted order.

Sorting floating-point numbers should produce the following behavior: `-NaN` < `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns with the implementation of Java floating-point types comparisons.
Sorting floating-point numbers should produce the following behavior: `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`.
The different `NaN` representation can be in an arbitrary order. This aligns with the implementation of Java floating-point types comparisons.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you specify more on what "order" means here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"order" was supposed to refer to sorting from the previous sentence. can i ask you for help on better wording?

@rdblue
Copy link
Contributor

rdblue commented Jul 30, 2021

Is this specific to Java? Are -NaN values ordered in other languages?

@electrum
Copy link
Contributor

This brings up the question of how different NaN representations should be handled in Iceberg. Should writers canonicalize them? What do ORC and Parquet do for non-canonical values?

@findepi
Copy link
Member Author

findepi commented Jul 30, 2021

Is this specific to Java? Are -NaN values ordered in other languages?

@rdblue this i do not know. Since the spec points at Java sorting as the 'reference', so i focused on that.

This brings up the question of how different NaN representations should be handled in Iceberg. Should writers canonicalize them?

@electrum this is a good question, and i was thinking about this too.
it seems that, from Trino perspective, it doesn't matter much, because we treat all NaN values as indistinguishable. The canonicalzation is applied at comparison time, in the engine, so storage is not required to canonicalize. Of course, it would be better to have writers canonicalize, but I am concerned we will be never able to assume that at read time, because of pre-existing data.

However, even if we follow this path, we still could want to define how NaNs interact with distinct_counts in manifest.
Or, we would ignore distinct_counts whenever nan_value_counts > 0.
(I don't know yet, whether this is important. We may or may not use distinct_counts.)

What do ORC and Parquet do for non-canonical values?

@electrum you mean the reference writer implementations? i don't know.

@rdblue
Copy link
Contributor

rdblue commented Aug 1, 2021

@electrum, I was leaving this vague. My inclination is to not to require canonicalization. It's more work and there may be legitimate uses of signalling NaNs. Plus, I just looked into signalling NaN values and they are evidently not part of the IEEE 754 spec and differ between processors:

Encodings of qNaN and sNaN are not specified in IEEE 754 and implemented differently on different processors. (wikipedia)

@findepi, in IEEE 754, the sign bit is independent of whether a value is NaN (exponent bits are all set and significand is non-zero). That means there are legitimate -NaN values and it would be reasonable for another language to sort them that way. I think that there is no need to change the handling of -NaN values in the spec.

It's also reasonable for Java to not distinguish between -NaN and NaN values and to only produce positive NaNs. The only update I would make is to ensure that the Java implementation only writes positive NaN values to align with its sort. In most cases, no changes are needed because we track NaN values through counts after identifying them with Float.isNaN or Double.isNaN. The only cases that would need to be updated are when engines write -NaN values into files and are sorting them. Those engines would be responsible for converting -NaN to NaN or to implement the sort order according to the spec rather than using the default Java comparison.

@findepi
Copy link
Member Author

findepi commented Aug 2, 2021

It's also reasonable for Java to not distinguish between -NaN and NaN values and to only produce positive NaNs.

It would be more accurate to say that java doesn't distinguish NaN values with sign bit set and those without.
In Java, you can construct a NaN value with a bit sign set, e.g. Double.longBitsToDouble(0xfff8000000000000L).

If your intention is to expect Java writers to sort these as before everything, we still need to update the spec (and Java writers).
The spec points at Java sorting as the reference implementation.

If your intention is to expect Java writers not to write NaN values with sign bit set, this is in fact the canonicalization that you wanted to avoid.

As was observed, signalling NaNs are not portable between processors, so we should exclude them from the picture.
While NaN values can carry an additional payload (fraction bits, the sign), such use is not portable as well.
There might be legitimate use-cases for these in some applications, but it seems we cannot expect them to be respected by the engines, which do not distinguish them.

Thus, if e.g. Rust-based application writes -NaN values and Java application rewrites the data file (update, compaction, format change), we should expect the NaN attributes to be lost.
That can be confusing to users and users would consider this a bug, which we probably wouldn't be able to fix.

It seems it would be safer to have just one NaN concept in the Iceberg spec and treat all NaN values as indistinguishable.

@findepi
Copy link
Member Author

findepi commented Oct 9, 2021

@rdblue i am not sure what the conclusion here is. Please help me understand.

@RussellSpitzer
Copy link
Member

I'm not sure I see the problem with just saying all "NaN" values come last. Doesn't -NaN not actually exist as a defined constant? From what I know of the standard the sign bit doesn't change the state of the NaN so we should treat them all the same. Since Java already does this seems like we would be fine just saying all NAN are > positive infinity. No need to qualify it imho.

So I'm +1 on just removing it from the spec. No need to add a detail about NaN ordering after that I think.

@findepi
Copy link
Member Author

findepi commented Dec 3, 2021

Doesn't -NaN not actually exist as a defined constant?

it doesn't exist in Java, yet you can calculate -NaN as an expression (produces a NaN value)

From what I know of the standard the sign bit doesn't change the state of the NaN so we should treat them all the same.

Well, arguably you can dissect bits of a NaN value (via Double.doubleToRawLongBits and inspecting the various bits).
Yet, this is not what reference implementation is doing, right?

Thus, one could argue to treat NaN values with sign bit set differently from NaN values with sign bit not set, but I don't believe this is what is desired, nor implemented.

@findepi
Copy link
Member Author

findepi commented Feb 5, 2022

What should happen with this PR?

@findepi findepi requested review from rdblue and RussellSpitzer and removed request for rdblue and RussellSpitzer February 13, 2022 17:11
@findepi findepi closed this Apr 11, 2022
@findepi findepi deleted the findepi/fix-nan-ordering-in-spec-494876 branch April 11, 2022 10:06
@osscm
Copy link

osscm commented May 19, 2022

@RussellSpitzer thanks for the review.
is it possible we can finalize this PR, please?

@RussellSpitzer
Copy link
Member

I think i'm good with merging the patch as it was. According to IEEE we actually have a huge number of possible byte representations that are nan,

Any Sign bit : 2 Options
All 1 for Exponent : 1 Option
Anything but all 0 for the fraction : 2^23 - 1 options (or 2 ^ 53 -1) options

so there are 2^24 -1 or 2 ^ 54 -1 possible NAN's for floats and doubles respectively.

I don't think it's very important for us to define a concrete ordering of all different byte representations. If we want to be really strict we could just say all Nan's must be sorted in lexical byte order but I don't think it makes sense to put them on either end of the Infinity Spectrum. I think they should all just be grouped together regardless of if we strictly decide how they are ordered within that group.

@rdblue Did you have strong opinions on this?

@findepi
Copy link
Member Author

findepi commented Jan 16, 2023

I think they should all just be grouped together

yes, that's practical

regardless of if we strictly decide how they are ordered within that group.

in Java it's hard (but not impossible) to discern between them, but that could come at a cost, especially if I want to have a sort key of some sort.

i'd prefer not to mandate any particular ordering among NaN values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants