-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perfect_float feature for deserialization accuracy #541
Conversation
Benchmarks using https://github.com/serde-rs/json-benchmark: master:
perfect-float:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I haven't read the implementation yet, but some comments about the tests:
("18446744073709552000.0", 18446744073709552000.0), | ||
]); | ||
|
||
#[cfg(not(feature = "perfect_float"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come these don't pass with perfect_float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtolnay The issue is because that value actually isn't the correct representation of a perfect float. That is a close-to-halfway representation, and the correct bit-pattern is actually:
0100001111110000000000000000000000000000000000000000000000000000
The annotated bit-pattern is:
Sign | Exponent | Significant Digits |
---|---|---|
0 | 10000111111 | 0000000000000000000000000000000000000000000000000000 |
As you can see, the significant digits has a 0 for the significant bit, so for anything below or equal to the halfway point (a bit set just past the significant digits), the default IEEE754 rounding scheme of round-nearest-tie-even will round down. The correct value is 18446744073709551616.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simple code to practice getting halfway representations, use the following code:
import numpy as np
class FloatMixin:
'''Mixing for floating-point methods.'''
def __init__(self, value):
self.value = self.float_type(value)
def to_bits(self):
'''Extract bitwise representation of float.'''
return np.frombuffer(self.value.tobytes(), dtype=self.int_type)[0]
@classmethod
def from_bits(cls, value):
'''Extract bitwise representation of float.'''
return cls(np.frombuffer(value.tobytes(), dtype=cls.float_type)[0])
def to_hex(self):
'''Convert double to hex.'''
return '{0:0{1}x}'.format(self.to_bits(), self.value.itemsize * 2)
def is_denormal(self):
'''Returns true if the float is a denormal.'''
return self.to_bits() & self.EXPONENT_MASK == 0
def is_special(self):
'''Returns true if the float is NaN or Infinite.'''
return self.to_bits() & self.EXPONENT_MASK == self.EXPONENT_MASK
def is_nan(self):
'''Returns true if the float is NaN.'''
return self.is_special() and self.to_bits() & self.MANTISSA_MASK != 0
def is_inf(self):
'''Returns true if the float is Infinite.'''
return self.is_special() and self.to_bits() & self.MANTISSA_MASK == 0
def exponent(self):
'''Get exponent component from the float.'''
if self.is_denormal():
return self.DENORMAL_EXPONENT
bits = self.to_bits()
exp_bits = bits & self.EXPONENT_MASK
biased_e = np.int32(exp_bits >> int_type(self.MANTISSA_SIZE))
return biased_e - self.EXPONENT_BIAS
def mantissa(self):
'''Get mantissa component from the float.'''
bits = self.to_bits()
s = bits & self.MANTISSA_MASK
if not self.is_denormal():
return s + self.HIDDEN_BIT_MASK
return s
class Float32(FloatMixin):
'''Wrapper around a 32-bit floating point value.'''
SIGN_MASK = np.uint32(0x80000000)
EXPONENT_MASK = np.uint32(0x7F800000)
HIDDEN_BIT_MASK = np.uint32(0x00800000)
MANTISSA_MASK = np.uint32(0x007FFFFF)
MANTISSA_SIZE = np.int32(23)
EXPONENT_BIAS = np.int32(127 + MANTISSA_SIZE)
DENORMAL_EXPONENT = np.int32(1 - EXPONENT_BIAS)
float_type = np.float32
int_type = np.uint32
class Float64(FloatMixin):
'''Wrapper around a 64-bit floating point value.'''
SIGN_MASK = np.uint64(0x8000000000000000)
EXPONENT_MASK = np.uint64(0x7FF0000000000000)
HIDDEN_BIT_MASK = np.uint64(0x0010000000000000)
MANTISSA_MASK = np.uint64(0x000FFFFFFFFFFFFF)
MANTISSA_SIZE = np.int32(52)
EXPONENT_BIAS = np.int32(1023 + MANTISSA_SIZE)
DENORMAL_EXPONENT = np.int32(1 - EXPONENT_BIAS)
float_type = np.float64
int_type = np.uint64
Now, to calculate b+1
, the next float float from a given float b
, and b+h
, the halfway point, do:
b = Float64(18446744073709551616.0)
b1 = Float64.from_bits(b.to_bits() + Float64.int_type(1))
bh = (int(b.value) + int(b1.value)) // 2
In this case, for b
of 18446744073709551616.0
, b+1
is 18446744073709555712.0
, and b+h
is 18446744073709553664.0
. b+h
cannot be represented as an exact floating-point number, but we can confirm it is accurate with a few tests:
>>> float("18446744073709553664.0") # b+h
18446744073709551616.0
>>> float("18446744073709553665.0") # above b+h
18446744073709555712.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtolnay The accurate float parser does not mean every decimal representation can be exactly represented as a floating-point number, however, it does mean that the closest representation will always be used, and that these conversions are stable.
I hope my explanation makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that in general not every decimal representation can be exactly represented as a floating-point number, that is obvious to me. The part I don't think you answered is why these particular values in the test no longer round trip.
My expectation for f64::to_string is that it produces a decimal representation (hopefully with as few significant digits as possible, but this is not necessary) which is mathematically closer to the original input float than to any other float.
My expectation for f64::from_str is that it produces the float which is mathematically closest to the input digits.
This test was supposed to test that pair of expectations. Of the two expectations, is one or both untrue of the standard library? Is one or both untrue of the implementation in this PR?
For the standard library my understanding is that the following line succeeds for any input float n
, and that's certainly true for the two specific floats in this test.
assert_eq!(n.to_string().parse::<f64>().unwrap(), n);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtolnay Oh I see what you mean, my mistake. However, the actual results seem to work after adding the test back in anyway:
#[cfg(not(feature = "arbitrary_precision"))]
#[cfg(feature = "perfect_float")]
test_parse_ok(vec![
("18446744073709552000.0", 18446744073709552000.0),
("31.245270191439438", 31.245270191439438),
("121.48791951161945", 121.48791951161945),
]);
I'm not why it originally failed, I have extensive tests that ensure this works in my codebase (and I test on ~20 different architectures for each release).
@ijl is it going to be possible to fix the cases where perfect_float successfully parses something that is not valid JSON? |
@dtolnay I would need to look when I have more time. I can close this as stale until then if you prefer. |
Let me know if I can be any help on integrating these changes to serde-json. There are a few issues, specifically, the format feature will need to be enabled and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. We might need to update the unittests, but the logic looks good to me internally. Let me replicate the repository and test locally and see if I can help.
# implementation is slower than the default and contains a lookup table that | ||
# increases artifact size. It does not affect serialization. This | ||
# feature has no effect if arbitrary_precision is enabled. | ||
perfect_float = ["lexical-core"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend using perfect_float
by default. @dtolnay will ultimately decide, however. Correctness should always be favored by default, and performance at the cost of correctness should be opt-in, not opt-out.
This parses floats and integers using lexical-core from https://github.com/AlexHuszagh/rust-lexical. Finding the slice that represents a number could be done in readers, in the same way as parse_str_bytes() for String. This would allow SliceRead to not copy.
I've pushed what I had been using, fixed up for stable and incompatibility with rust 1.31. This could be reviewed again. As I added in the amended commit message, avoiding copying to scratch is an obvious thing to do. It's a larger change, though. |
@ijl I might recommend using lexical-core v0.6 rather than v0.7, since v0.7 makes a breaking change and removes support for Rustc < |
In #536 it looks like @Alexhuszagh will create a minimal float parser from lexical for serde_json, so I'm happy to close this. |
This parses floats and integers using lexical-core from
https://github.com/AlexHuszagh/rust-lexical.
This fixes #536.