Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perfect_float feature for deserialization accuracy #541

Closed
wants to merge 1 commit into from
Closed

perfect_float feature for deserialization accuracy #541

wants to merge 1 commit into from

Conversation

ijl
Copy link

@ijl ijl commented May 24, 2019

This parses floats and integers using lexical-core from
https://github.com/AlexHuszagh/rust-lexical.

This fixes #536.

@ijl
Copy link
Author

ijl commented May 24, 2019

Benchmarks using https://github.com/serde-rs/json-benchmark:

master:

                                DOM                  STRUCT
======= serde_json ======= parse|stringify ===== parse|stringify ====
data/canada.json         290 MB/s   540 MB/s   700 MB/s   430 MB/s
data/citm_catalog.json   450 MB/s   720 MB/s  1040 MB/s  1030 MB/s
data/twitter.json        320 MB/s  1000 MB/s   680 MB/s  1050 MB/s

perfect-float:

                                DOM                  STRUCT
======= serde_json ======= parse|stringify ===== parse|stringify ====
data/canada.json         150 MB/s   490 MB/s   220 MB/s   370 MB/s
data/citm_catalog.json   420 MB/s   490 MB/s   860 MB/s   650 MB/s
data/twitter.json        310 MB/s   720 MB/s   650 MB/s   780 MB/s

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I haven't read the implementation yet, but some comments about the tests:

tests/test.rs Outdated Show resolved Hide resolved
("18446744073709552000.0", 18446744073709552000.0),
]);

#[cfg(not(feature = "perfect_float"))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come these don't pass with perfect_float?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtolnay The issue is because that value actually isn't the correct representation of a perfect float. That is a close-to-halfway representation, and the correct bit-pattern is actually:

0100001111110000000000000000000000000000000000000000000000000000

The annotated bit-pattern is:

Sign Exponent Significant Digits
0 10000111111 0000000000000000000000000000000000000000000000000000

As you can see, the significant digits has a 0 for the significant bit, so for anything below or equal to the halfway point (a bit set just past the significant digits), the default IEEE754 rounding scheme of round-nearest-tie-even will round down. The correct value is 18446744073709551616.0.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simple code to practice getting halfway representations, use the following code:

import numpy as np

class FloatMixin:
    '''Mixing for floating-point methods.'''

    def __init__(self, value):
        self.value = self.float_type(value)

    def to_bits(self):
        '''Extract bitwise representation of float.'''
        return np.frombuffer(self.value.tobytes(), dtype=self.int_type)[0]

    @classmethod
    def from_bits(cls, value):
        '''Extract bitwise representation of float.'''
        return cls(np.frombuffer(value.tobytes(), dtype=cls.float_type)[0])

    def to_hex(self):
        '''Convert double to hex.'''
        return '{0:0{1}x}'.format(self.to_bits(), self.value.itemsize * 2)

    def is_denormal(self):
        '''Returns true if the float is a denormal.'''
        return self.to_bits() & self.EXPONENT_MASK == 0

    def is_special(self):
        '''Returns true if the float is NaN or Infinite.'''
        return self.to_bits() & self.EXPONENT_MASK == self.EXPONENT_MASK

    def is_nan(self):
        '''Returns true if the float is NaN.'''
        return self.is_special() and self.to_bits() & self.MANTISSA_MASK != 0

    def is_inf(self):
        '''Returns true if the float is Infinite.'''
        return self.is_special() and self.to_bits() & self.MANTISSA_MASK == 0

    def exponent(self):
        '''Get exponent component from the float.'''

        if self.is_denormal():
            return self.DENORMAL_EXPONENT
        bits = self.to_bits()
        exp_bits = bits & self.EXPONENT_MASK
        biased_e = np.int32(exp_bits >> int_type(self.MANTISSA_SIZE))
        return biased_e - self.EXPONENT_BIAS

    def mantissa(self):
        '''Get mantissa component from the float.'''

        bits = self.to_bits()
        s = bits & self.MANTISSA_MASK
        if not self.is_denormal():
            return s + self.HIDDEN_BIT_MASK
        return s


class Float32(FloatMixin):
    '''Wrapper around a 32-bit floating point value.'''

    SIGN_MASK           = np.uint32(0x80000000)
    EXPONENT_MASK       = np.uint32(0x7F800000)
    HIDDEN_BIT_MASK     = np.uint32(0x00800000)
    MANTISSA_MASK       = np.uint32(0x007FFFFF)
    MANTISSA_SIZE       = np.int32(23)
    EXPONENT_BIAS       = np.int32(127 + MANTISSA_SIZE)
    DENORMAL_EXPONENT   = np.int32(1 - EXPONENT_BIAS)
    float_type = np.float32
    int_type = np.uint32


class Float64(FloatMixin):
    '''Wrapper around a 64-bit floating point value.'''

    SIGN_MASK           = np.uint64(0x8000000000000000)
    EXPONENT_MASK       = np.uint64(0x7FF0000000000000)
    HIDDEN_BIT_MASK     = np.uint64(0x0010000000000000)
    MANTISSA_MASK       = np.uint64(0x000FFFFFFFFFFFFF)
    MANTISSA_SIZE       = np.int32(52)
    EXPONENT_BIAS       = np.int32(1023 + MANTISSA_SIZE)
    DENORMAL_EXPONENT   = np.int32(1 - EXPONENT_BIAS)
    float_type = np.float64
    int_type = np.uint64

Now, to calculate b+1, the next float float from a given float b, and b+h, the halfway point, do:

b = Float64(18446744073709551616.0)
b1 = Float64.from_bits(b.to_bits() + Float64.int_type(1))
bh = (int(b.value) + int(b1.value)) // 2

In this case, for b of 18446744073709551616.0, b+1 is 18446744073709555712.0, and b+h is 18446744073709553664.0. b+h cannot be represented as an exact floating-point number, but we can confirm it is accurate with a few tests:

>>> float("18446744073709553664.0")  # b+h
18446744073709551616.0
>>> float("18446744073709553665.0")  # above b+h
18446744073709555712.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtolnay The accurate float parser does not mean every decimal representation can be exactly represented as a floating-point number, however, it does mean that the closest representation will always be used, and that these conversions are stable.

I hope my explanation makes sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that in general not every decimal representation can be exactly represented as a floating-point number, that is obvious to me. The part I don't think you answered is why these particular values in the test no longer round trip.

My expectation for f64::to_string is that it produces a decimal representation (hopefully with as few significant digits as possible, but this is not necessary) which is mathematically closer to the original input float than to any other float.

My expectation for f64::from_str is that it produces the float which is mathematically closest to the input digits.

This test was supposed to test that pair of expectations. Of the two expectations, is one or both untrue of the standard library? Is one or both untrue of the implementation in this PR?

For the standard library my understanding is that the following line succeeds for any input float n, and that's certainly true for the two specific floats in this test.

assert_eq!(n.to_string().parse::<f64>().unwrap(), n);

Copy link

@Alexhuszagh Alexhuszagh Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtolnay Oh I see what you mean, my mistake. However, the actual results seem to work after adding the test back in anyway:

    #[cfg(not(feature = "arbitrary_precision"))]
    #[cfg(feature = "perfect_float")]
    test_parse_ok(vec![
        ("18446744073709552000.0", 18446744073709552000.0),
        ("31.245270191439438", 31.245270191439438),
        ("121.48791951161945", 121.48791951161945),
    ]);

I'm not why it originally failed, I have extensive tests that ensure this works in my codebase (and I test on ~20 different architectures for each release).

@dtolnay
Copy link
Member

dtolnay commented Jun 22, 2019

@ijl is it going to be possible to fix the cases where perfect_float successfully parses something that is not valid JSON?

@ijl
Copy link
Author

ijl commented Jun 22, 2019

@dtolnay I would need to look when I have more time. I can close this as stale until then if you prefer.

@Alexhuszagh
Copy link

Alexhuszagh commented Feb 6, 2020

Let me know if I can be any help on integrating these changes to serde-json. There are a few issues, specifically, the format feature will need to be enabled and let res = lexical_core::parse_format::<f64>(&self.scratch[..], lexical_core::NumberFormat::JSON); should be used instead.

Copy link

@Alexhuszagh Alexhuszagh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. We might need to update the unittests, but the logic looks good to me internally. Let me replicate the repository and test locally and see if I can help.

Cargo.toml Outdated Show resolved Hide resolved
# implementation is slower than the default and contains a lookup table that
# increases artifact size. It does not affect serialization. This
# feature has no effect if arbitrary_precision is enabled.
perfect_float = ["lexical-core"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using perfect_float by default. @dtolnay will ultimately decide, however. Correctness should always be favored by default, and performance at the cost of correctness should be opt-in, not opt-out.

This parses floats and integers using lexical-core from
https://github.com/AlexHuszagh/rust-lexical.

Finding the slice that represents a number could be done
in readers, in the same way as parse_str_bytes() for String.
This would allow SliceRead to not copy.
@ijl
Copy link
Author

ijl commented Feb 6, 2020

I've pushed what I had been using, fixed up for stable and incompatibility with rust 1.31. This could be reviewed again.

As I added in the amended commit message, avoiding copying to scratch is an obvious thing to do. It's a larger change, though.

@Alexhuszagh
Copy link

Alexhuszagh commented Feb 6, 2020

@ijl I might recommend using lexical-core v0.6 rather than v0.7, since v0.7 makes a breaking change and removes support for Rustc < 1.37. v0.6 supports Rustc >= 1.24, which should be fully compatible with serde.

@ijl
Copy link
Author

ijl commented Feb 7, 2020

In #536 it looks like @Alexhuszagh will create a minimal float parser from lexical for serde_json, so I'm happy to close this.

@ijl ijl closed this Feb 7, 2020
@dtolnay dtolnay removed the wip label Mar 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Perfect accuracy float parsing
3 participants