-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix -NaN ordering in spec #2891
Fix -NaN ordering in spec #2891
Conversation
In Java, -NaN is not distinguishable from NaN (is the same value, as introspectable with `Float.floatToIntBits` or `Double.doubleToLongBits`). Moreover, Java sorts all possible `NaN` values within single peer group. The different NaN values compare equal, even if they are not the same. They always compare as "greater" than positive infinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I'm hoping Ryan could also take a look in case I miss anything. For posterity, this PR comes from this thread
@@ -267,7 +267,8 @@ A sort order is defined by an sort order id and a list of sort fields. The order | |||
|
|||
Order id `0` is reserved for the unsorted order. | |||
|
|||
Sorting floating-point numbers should produce the following behavior: `-NaN` < `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns with the implementation of Java floating-point types comparisons. | |||
Sorting floating-point numbers should produce the following behavior: `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. | |||
The different `NaN` representation can be in an arbitrary order. This aligns with the implementation of Java floating-point types comparisons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you specify more on what "order" means here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"order" was supposed to refer to sorting from the previous sentence. can i ask you for help on better wording?
Is this specific to Java? Are -NaN values ordered in other languages? |
This brings up the question of how different NaN representations should be handled in Iceberg. Should writers canonicalize them? What do ORC and Parquet do for non-canonical values? |
@rdblue this i do not know. Since the spec points at Java sorting as the 'reference', so i focused on that.
@electrum this is a good question, and i was thinking about this too. However, even if we follow this path, we still could want to define how NaNs interact with
@electrum you mean the reference writer implementations? i don't know. |
@electrum, I was leaving this vague. My inclination is to not to require canonicalization. It's more work and there may be legitimate uses of signalling NaNs. Plus, I just looked into signalling NaN values and they are evidently not part of the IEEE 754 spec and differ between processors:
@findepi, in IEEE 754, the sign bit is independent of whether a value is NaN (exponent bits are all set and significand is non-zero). That means there are legitimate -NaN values and it would be reasonable for another language to sort them that way. I think that there is no need to change the handling of -NaN values in the spec. It's also reasonable for Java to not distinguish between -NaN and NaN values and to only produce positive NaNs. The only update I would make is to ensure that the Java implementation only writes positive NaN values to align with its sort. In most cases, no changes are needed because we track NaN values through counts after identifying them with |
It would be more accurate to say that java doesn't distinguish NaN values with sign bit set and those without. If your intention is to expect Java writers to sort these as before everything, we still need to update the spec (and Java writers). If your intention is to expect Java writers not to write NaN values with sign bit set, this is in fact the canonicalization that you wanted to avoid. As was observed, signalling NaNs are not portable between processors, so we should exclude them from the picture. Thus, if e.g. Rust-based application writes -NaN values and Java application rewrites the data file (update, compaction, format change), we should expect the NaN attributes to be lost. It seems it would be safer to have just one NaN concept in the Iceberg spec and treat all NaN values as indistinguishable. |
@rdblue i am not sure what the conclusion here is. Please help me understand. |
I'm not sure I see the problem with just saying all "NaN" values come last. Doesn't -NaN not actually exist as a defined constant? From what I know of the standard the sign bit doesn't change the state of the NaN so we should treat them all the same. Since Java already does this seems like we would be fine just saying all NAN are > positive infinity. No need to qualify it imho. So I'm +1 on just removing it from the spec. No need to add a detail about NaN ordering after that I think. |
it doesn't exist in Java, yet you can calculate
Well, arguably you can dissect bits of a NaN value (via Thus, one could argue to treat NaN values with sign bit set differently from NaN values with sign bit not set, but I don't believe this is what is desired, nor implemented. |
What should happen with this PR? |
@RussellSpitzer thanks for the review. |
I think i'm good with merging the patch as it was. According to IEEE we actually have a huge number of possible byte representations that are nan,
so there are 2^24 -1 or 2 ^ 54 -1 possible NAN's for floats and doubles respectively. I don't think it's very important for us to define a concrete ordering of all different byte representations. If we want to be really strict we could just say all Nan's must be sorted in lexical byte order but I don't think it makes sense to put them on either end of the Infinity Spectrum. I think they should all just be grouped together regardless of if we strictly decide how they are ordered within that group. @rdblue Did you have strong opinions on this? |
yes, that's practical
in Java it's hard (but not impossible) to discern between them, but that could come at a cost, especially if I want to have a sort key of some sort. i'd prefer not to mandate any particular ordering among NaN values |
In Java, -NaN is not distinguishable from NaN (is the same value, as
introspectable with
Float.floatToIntBits
orDouble.doubleToLongBits
).Moreover, Java sorts all possible
NaN
values within single peer group.The different NaN values compare equal, even if they are not the same.
They always compare as "greater" than positive infinity.
cc @yyanyy @rdblue @electrum