perfect_float feature for deserialization accuracy #541

ijl · 2019-05-24T12:56:45Z

This parses floats and integers using lexical-core from
https://github.com/AlexHuszagh/rust-lexical.

This fixes #536.

ijl · 2019-05-24T13:04:33Z

Benchmarks using https://github.com/serde-rs/json-benchmark:

master:

                                DOM                  STRUCT
======= serde_json ======= parse|stringify ===== parse|stringify ====
data/canada.json         290 MB/s   540 MB/s   700 MB/s   430 MB/s
data/citm_catalog.json   450 MB/s   720 MB/s  1040 MB/s  1030 MB/s
data/twitter.json        320 MB/s  1000 MB/s   680 MB/s  1050 MB/s

perfect-float:

                                DOM                  STRUCT
======= serde_json ======= parse|stringify ===== parse|stringify ====
data/canada.json         150 MB/s   490 MB/s   220 MB/s   370 MB/s
data/citm_catalog.json   420 MB/s   490 MB/s   860 MB/s   650 MB/s
data/twitter.json        310 MB/s   720 MB/s   650 MB/s   780 MB/s

dtolnay

Thanks! I haven't read the implementation yet, but some comments about the tests:

tests/test.rs

dtolnay · 2019-05-24T18:10:09Z

tests/test.rs

+        ("18446744073709552000.0", 18446744073709552000.0),
+    ]);
+
+    #[cfg(not(feature = "perfect_float"))]


How come these don't pass with perfect_float?

@dtolnay The issue is because that value actually isn't the correct representation of a perfect float. That is a close-to-halfway representation, and the correct bit-pattern is actually:

0100001111110000000000000000000000000000000000000000000000000000

The annotated bit-pattern is:

Sign Exponent Significant Digits

0 10000111111 0000000000000000000000000000000000000000000000000000

As you can see, the significant digits has a 0 for the significant bit, so for anything below or equal to the halfway point (a bit set just past the significant digits), the default IEEE754 rounding scheme of round-nearest-tie-even will round down. The correct value is 18446744073709551616.0.

For simple code to practice getting halfway representations, use the following code:

import numpy as np class FloatMixin: '''Mixing for floating-point methods.''' def __init__(self, value): self.value = self.float_type(value) def to_bits(self): '''Extract bitwise representation of float.''' return np.frombuffer(self.value.tobytes(), dtype=self.int_type)[0] @classmethod def from_bits(cls, value): '''Extract bitwise representation of float.''' return cls(np.frombuffer(value.tobytes(), dtype=cls.float_type)[0]) def to_hex(self): '''Convert double to hex.''' return '{0:0{1}x}'.format(self.to_bits(), self.value.itemsize * 2) def is_denormal(self): '''Returns true if the float is a denormal.''' return self.to_bits() & self.EXPONENT_MASK == 0 def is_special(self): '''Returns true if the float is NaN or Infinite.''' return self.to_bits() & self.EXPONENT_MASK == self.EXPONENT_MASK def is_nan(self): '''Returns true if the float is NaN.''' return self.is_special() and self.to_bits() & self.MANTISSA_MASK != 0 def is_inf(self): '''Returns true if the float is Infinite.''' return self.is_special() and self.to_bits() & self.MANTISSA_MASK == 0 def exponent(self): '''Get exponent component from the float.''' if self.is_denormal(): return self.DENORMAL_EXPONENT bits = self.to_bits() exp_bits = bits & self.EXPONENT_MASK biased_e = np.int32(exp_bits >> int_type(self.MANTISSA_SIZE)) return biased_e - self.EXPONENT_BIAS def mantissa(self): '''Get mantissa component from the float.''' bits = self.to_bits() s = bits & self.MANTISSA_MASK if not self.is_denormal(): return s + self.HIDDEN_BIT_MASK return s class Float32(FloatMixin): '''Wrapper around a 32-bit floating point value.''' SIGN_MASK = np.uint32(0x80000000) EXPONENT_MASK = np.uint32(0x7F800000) HIDDEN_BIT_MASK = np.uint32(0x00800000) MANTISSA_MASK = np.uint32(0x007FFFFF) MANTISSA_SIZE = np.int32(23) EXPONENT_BIAS = np.int32(127 + MANTISSA_SIZE) DENORMAL_EXPONENT = np.int32(1 - EXPONENT_BIAS) float_type = np.float32 int_type = np.uint32 class Float64(FloatMixin): '''Wrapper around a 64-bit floating point value.''' SIGN_MASK = np.uint64(0x8000000000000000) EXPONENT_MASK = np.uint64(0x7FF0000000000000) HIDDEN_BIT_MASK = np.uint64(0x0010000000000000) MANTISSA_MASK = np.uint64(0x000FFFFFFFFFFFFF) MANTISSA_SIZE = np.int32(52) EXPONENT_BIAS = np.int32(1023 + MANTISSA_SIZE) DENORMAL_EXPONENT = np.int32(1 - EXPONENT_BIAS) float_type = np.float64 int_type = np.uint64

Now, to calculate b+1, the next float float from a given float b, and b+h, the halfway point, do:

b = Float64(18446744073709551616.0) b1 = Float64.from_bits(b.to_bits() + Float64.int_type(1)) bh = (int(b.value) + int(b1.value)) // 2

In this case, for b of 18446744073709551616.0, b+1 is 18446744073709555712.0, and b+h is 18446744073709553664.0. b+h cannot be represented as an exact floating-point number, but we can confirm it is accurate with a few tests:

>>> float("18446744073709553664.0") # b+h 18446744073709551616.0 >>> float("18446744073709553665.0") # above b+h 18446744073709555712.0

@dtolnay The accurate float parser does not mean every decimal representation can be exactly represented as a floating-point number, however, it does mean that the closest representation will always be used, and that these conversions are stable.

I hope my explanation makes sense?

I understand that in general not every decimal representation can be exactly represented as a floating-point number, that is obvious to me. The part I don't think you answered is why these particular values in the test no longer round trip.

My expectation for f64::to_string is that it produces a decimal representation (hopefully with as few significant digits as possible, but this is not necessary) which is mathematically closer to the original input float than to any other float.

My expectation for f64::from_str is that it produces the float which is mathematically closest to the input digits.

This test was supposed to test that pair of expectations. Of the two expectations, is one or both untrue of the standard library? Is one or both untrue of the implementation in this PR?

For the standard library my understanding is that the following line succeeds for any input float n, and that's certainly true for the two specific floats in this test.

assert_eq!(n.to_string().parse::<f64>().unwrap(), n);

@dtolnay Oh I see what you mean, my mistake. However, the actual results seem to work after adding the test back in anyway:

#[cfg(not(feature = "arbitrary_precision"))] #[cfg(feature = "perfect_float")] test_parse_ok(vec![ ("18446744073709552000.0", 18446744073709552000.0), ("31.245270191439438", 31.245270191439438), ("121.48791951161945", 121.48791951161945), ]);

I'm not why it originally failed, I have extensive tests that ensure this works in my codebase (and I test on ~20 different architectures for each release).

dtolnay · 2019-06-22T07:44:35Z

@ijl is it going to be possible to fix the cases where perfect_float successfully parses something that is not valid JSON?

ijl · 2019-06-22T10:42:07Z

@dtolnay I would need to look when I have more time. I can close this as stale until then if you prefer.

Alexhuszagh · 2020-02-06T20:56:29Z

Let me know if I can be any help on integrating these changes to serde-json. There are a few issues, specifically, the format feature will need to be enabled and let res = lexical_core::parse_format::<f64>(&self.scratch[..], lexical_core::NumberFormat::JSON); should be used instead.

Alexhuszagh

Looks good to me. We might need to update the unittests, but the logic looks good to me internally. Let me replicate the repository and test locally and see if I can help.

Cargo.toml

Alexhuszagh · 2020-02-06T22:35:12Z

Cargo.toml

+# implementation is slower than the default and contains a lookup table that
+# increases artifact size. It does not affect serialization. This
+# feature has no effect if arbitrary_precision is enabled.
+perfect_float = ["lexical-core"]


I would recommend using perfect_float by default. @dtolnay will ultimately decide, however. Correctness should always be favored by default, and performance at the cost of correctness should be opt-in, not opt-out.

This parses floats and integers using lexical-core from https://github.com/AlexHuszagh/rust-lexical. Finding the slice that represents a number could be done in readers, in the same way as parse_str_bytes() for String. This would allow SliceRead to not copy.

ijl · 2020-02-06T22:39:18Z

I've pushed what I had been using, fixed up for stable and incompatibility with rust 1.31. This could be reviewed again.

As I added in the amended commit message, avoiding copying to scratch is an obvious thing to do. It's a larger change, though.

Alexhuszagh · 2020-02-06T22:42:13Z

@ijl I might recommend using lexical-core v0.6 rather than v0.7, since v0.7 makes a breaking change and removes support for Rustc < 1.37. v0.6 supports Rustc >= 1.24, which should be fully compatible with serde.

ijl · 2020-02-07T15:30:13Z

In #536 it looks like @Alexhuszagh will create a minimal float parser from lexical for serde_json, so I'm happy to close this.

dtolnay reviewed May 24, 2019

View reviewed changes

dtolnay added the wip label Jul 1, 2019

sjackman mentioned this pull request Feb 6, 2020

Perfect accuracy float parsing #536

Closed

Alexhuszagh approved these changes Feb 6, 2020

View reviewed changes

Alexhuszagh reviewed Feb 6, 2020

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

Alexhuszagh reviewed Feb 6, 2020

View reviewed changes

ijl closed this Feb 7, 2020

dtolnay removed the wip label Mar 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perfect_float feature for deserialization accuracy #541

perfect_float feature for deserialization accuracy #541

ijl commented May 24, 2019

ijl commented May 24, 2019

dtolnay left a comment

dtolnay May 24, 2019

Alexhuszagh Feb 6, 2020

Alexhuszagh Feb 6, 2020

Alexhuszagh Feb 6, 2020

dtolnay Feb 7, 2020

Alexhuszagh Feb 7, 2020 •

edited

Loading

dtolnay commented Jun 22, 2019

ijl commented Jun 22, 2019

Alexhuszagh commented Feb 6, 2020 •

edited

Loading

Alexhuszagh left a comment

Alexhuszagh Feb 6, 2020

ijl commented Feb 6, 2020

Alexhuszagh commented Feb 6, 2020 •

edited

Loading

ijl commented Feb 7, 2020

perfect_float feature for deserialization accuracy #541

perfect_float feature for deserialization accuracy #541

Conversation

ijl commented May 24, 2019

ijl commented May 24, 2019

dtolnay left a comment

Choose a reason for hiding this comment

dtolnay May 24, 2019

Choose a reason for hiding this comment

Alexhuszagh Feb 6, 2020

Choose a reason for hiding this comment

Alexhuszagh Feb 6, 2020

Choose a reason for hiding this comment

Alexhuszagh Feb 6, 2020

Choose a reason for hiding this comment

dtolnay Feb 7, 2020

Choose a reason for hiding this comment

Alexhuszagh Feb 7, 2020 • edited Loading

Choose a reason for hiding this comment

dtolnay commented Jun 22, 2019

ijl commented Jun 22, 2019

Alexhuszagh commented Feb 6, 2020 • edited Loading

Alexhuszagh left a comment

Choose a reason for hiding this comment

Alexhuszagh Feb 6, 2020

Choose a reason for hiding this comment

ijl commented Feb 6, 2020

Alexhuszagh commented Feb 6, 2020 • edited Loading

ijl commented Feb 7, 2020

Alexhuszagh Feb 7, 2020 •

edited

Loading

Alexhuszagh commented Feb 6, 2020 •

edited

Loading

Alexhuszagh commented Feb 6, 2020 •

edited

Loading