Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Performance for Disguised Fast-Path Cases in Float Parsing #85198

Closed
Alexhuszagh opened this issue May 11, 2021 · 2 comments · Fixed by #86761
Closed

Improved Performance for Disguised Fast-Path Cases in Float Parsing #85198

Alexhuszagh opened this issue May 11, 2021 · 2 comments · Fixed by #86761
Labels
A-floating-point Area: Floating point numbers and arithmetic C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@Alexhuszagh
Copy link
Contributor

Summary

Rust's float-parsing algorithm dec2flt uses a slower parsing algorithm than necessary than required to parse numbers like "1.2345e30", which can slow down parsing times by nearly 300%. Adding trivial changes to dec2flt leads to dramatically improved parsing times, without increasing binary sizes, or slowing down other parse cases. Please see the "Sample Repository" below for the exact specifics, or in order to replicate these changes. This is an initial attempt as part of an ongoing effort to speed up float parsing in Rust, and aims to integrate algorithms I've implemented (currently used in nom and serde-json) back in the core library.

Issue

When parsing floating-point numbers, there is a fast-path algorithm that uses native floats to parse the float if applicable. This only occurs if:

  • The significant digits of the float, or mantissa, can be represented in mantissa_size+1 bits.
  • The exponent can be exactly represented, or the absolute value is less than ⌊(mantissa_size+1) / log2(5) ⌋.

Please note that this is the exponent relative to the significant digits, for example, for "1.2345e5", this exponent would be 1, but for "12345e5" this exponent would be 5.

The reason why we use mantissa_size+1 is due to the implicit, hidden bit of the float. A longer post detailing the attempts to improve float parsing on rust-internals can be found here. The exact values for f32 are as follows:

f32:

  • significant digit bits: 24
  • exponent range: [-10, 10]

f64:

  • significant digit bits: 53
  • exponent range: [-22, 22]

However, there is an exception: if the value has less significant bits than the maximum, but has an exponent larger than our range, we can shift powers-of-10 from the exponent to the significant digits. For example, "1.2345e30" would have significant digits of 12345 and an exponent of 26, which is outside our range of [-22, 22]. However, if we shift 10^4 from the exponent to the significant digits, we get significant digits of 123450000 and an exponent of 22, which is a valid fast-path case. This leads to a massive performance improvement with a large number of real-world float cases, and has an insignificant impact on other cases.

Binary Sizes

These were compiled on a target of x86_64-unknown-linux-gnu, running kernel version 5.11.16-100, on a Rust version of rustc 1.53.0-nightly (132b4e5d1 2021-04-13). The sizes reflect the binary sizes reported by ls -sh, both before and after running the strip command. The debug profile was used for opt-levels 0 and 1, and was as follows:

[profile.dev]
opt-level = "..."
debug = true
lto = false

The release profile was used for opt-levels 2, 3, s and z and was as follows:

[profile.release]
opt-level = "..."
debug = false
debug-assertions = false
lto = true

core

These are the binary sizes prior to making changes.

opt-level size size(stripped)
0 3.6M 360K
1 3.5M 316K
2 1.3M 236K
3 1.3M 248K
s 1.3M 244K
z 1.3M 248K

disguised

These are the binary sizes after making changes to speed up disguised fast-path cases.

opt-level size size(stripped)
0 3.6M 360K
1 3.5M 316K
2 1.3M 236K
3 1.3M 248K
s 1.3M 252K
z 1.3M 248K

Performance

Overall, the changes to speed up disguised fast-path cases led to ~-75% change in performance relative to core, without impacting any other benchmarks.

These benchmarks were run on an i7-6560U CPU @ 2.20GHz, on a target of x86_64-unknown-linux-gnu, running kernel version 5.11.16-100, on a Rust version of rustc 1.53.0-nightly (132b4e5d1 2021-04-13). The performance CPU governor was used for all benchmarks, and were run consecutively on A/C power with only tmux and Sublime Text open for all benchmarks. The floats that were parsed are as follows:

// Example fast-path value.
const FAST: &str = "1.2345e22";
// Example disguised fast-path value.
const DISGUISED: &str = "1.2345e30";
// Example moderate path value: clearly not halfway `1 << 53`.
const MODERATE: &str = "9007199254740992.0";
// Example exactly-halfway value `(1<<53) + 1`.
const HALFWAY: &str = "9007199254740993.0";
// Example large, near-halfway value.
const LARGE: &str = "8.988465674311580536566680e307";
// Example denormal, near-halfway value.
const DENORMAL: &str = "8.442911973260991817129021e-309";

core

These are the benchmarks prior to making changes.

float speed
fast 32.952ns
disguised 129.86ns
moderate 237.08ns
halfway 371.21ns
large 287.81us
denormal 122.36us

disguised

These are the benchmarks after making changes to speed up disguised fast-path cases.

float speed
fast 32.572ns
disguised 33.813ns
moderate 233.03ns
halfway 350.99ns
large 300.29us
denormal 129.36us

Correctness Concerns

None, since this merely transfer powers-of-10 from the exponent to the significant digits, using integer multiplication, and therefore can trivially be verified for correctness.

Changes

The diff, which would be relative to library/core/src/num, is as follows:

diff --git a/src/dec2flt/algorithm.rs b/src/dec2flt/algorithm.rs
index 2b0b4cb..76d8105 100644
--- a/src/dec2flt/algorithm.rs
+++ b/src/dec2flt/algorithm.rs
@@ -110,7 +110,7 @@ mod fpu_precision {
 ///
 /// This is extracted into a separate function so that it can be attempted before constructing
 /// a bignum.
-pub fn fast_path<T: RawFloat>(integral: &[u8], fractional: &[u8], e: i64) -> Option<T> {
+pub fn fast_path<T: RawFloat>(integral: &[u8], fractional: &[u8], mut e: i64) -> Option<T> {
     let num_digits = integral.len() + fractional.len();
     // log_10(f64::MAX_SIG) ~ 15.95. We compare the exact value to MAX_SIG near the end,
     // this is just a quick, cheap rejection (and also frees the rest of the code from
@@ -118,14 +118,29 @@ pub fn fast_path<T: RawFloat>(integral: &[u8], fractional: &[u8], e: i64) -> Opt
     if num_digits > 16 {
         return None;
     }
-    if e.abs() >= T::CEIL_LOG5_OF_MAX_SIG as i64 {
+    let max_exp = T::FLOOR_LOG5_OF_MAX_SIG as i64;
+    let min_exp = -max_exp;
+    let shift_exp = T::FLOOR_LOG10_OF_MAX_SIG as i64;
+    let disguised_exp = max_exp + shift_exp;
+    if e < min_exp || e > disguised_exp {
         return None;
     }
-    let f = num::from_str_unchecked(integral.iter().chain(fractional.iter()));
+    let mut f = num::from_str_unchecked(integral.iter().chain(fractional.iter()));
     if f > T::MAX_SIG {
         return None;
     }
 
+    // Handle a disguised fast path case here.
+    if e > max_exp {
+        let shift = e - max_exp;
+        let value = f.checked_mul(T::short_int_pow10(shift as usize))?;
+        if value > T::MAX_SIG {
+            return None;
+        }
+        f = value;
+        e = max_exp;
+    }
+
     // The fast path crucially depends on arithmetic being rounded to the correct number of bits
     // without any intermediate rounding. On x86 (without SSE or SSE2) this requires the precision
     // of the x87 FPU stack to be changed so that it directly rounds to 64/32 bit.
diff --git a/src/dec2flt/rawfp.rs b/src/dec2flt/rawfp.rs
index a3acf3d..15a5839 100644
--- a/src/dec2flt/rawfp.rs
+++ b/src/dec2flt/rawfp.rs
@@ -73,13 +73,21 @@ pub trait RawFloat:
     /// represented, the other code in this module makes sure to never let that happen.
     fn from_int(x: u64) -> Self;
 
+    fn short_int_pow10(e: usize) -> u64 {
+        table::SHORT_POWERS[e]
+    }
+
     /// Gets the value 10<sup>e</sup> from a pre-computed table.
-    /// Panics for `e >= CEIL_LOG5_OF_MAX_SIG`.
+    /// Panics for `e >= FLOOR_LOG5_OF_MAX_SIG`.
     fn short_fast_pow10(e: usize) -> Self;
 
     /// What the name says. It's easier to hard code than juggling intrinsics and
     /// hoping LLVM constant folds it.
-    const CEIL_LOG5_OF_MAX_SIG: i16;
+    const FLOOR_LOG5_OF_MAX_SIG: i16;
+
+    /// What the name says. It's easier to hard code than juggling intrinsics and
+    /// hoping LLVM constant folds it.
+    const FLOOR_LOG10_OF_MAX_SIG: i16;
 
     // A conservative bound on the decimal digits of inputs that can't produce overflow or zero or
     /// subnormals. Probably the decimal exponent of the maximum normal value, hence the name.
@@ -147,7 +155,8 @@ impl RawFloat for f32 {
 
     const SIG_BITS: u8 = 24;
     const EXP_BITS: u8 = 8;
-    const CEIL_LOG5_OF_MAX_SIG: i16 = 11;
+    const FLOOR_LOG5_OF_MAX_SIG: i16 = 10;
+    const FLOOR_LOG10_OF_MAX_SIG: i16 = 7;
     const MAX_NORMAL_DIGITS: usize = 35;
     const INF_CUTOFF: i64 = 40;
     const ZERO_CUTOFF: i64 = -48;
@@ -196,7 +205,8 @@ impl RawFloat for f64 {
 
     const SIG_BITS: u8 = 53;
     const EXP_BITS: u8 = 11;
-    const CEIL_LOG5_OF_MAX_SIG: i16 = 23;
+    const FLOOR_LOG5_OF_MAX_SIG: i16 = 22;
+    const FLOOR_LOG10_OF_MAX_SIG: i16 = 15;
     const MAX_NORMAL_DIGITS: usize = 305;
     const INF_CUTOFF: i64 = 310;
     const ZERO_CUTOFF: i64 = -326;
diff --git a/src/dec2flt/table.rs b/src/dec2flt/table.rs
index 97b497e..bd9e53d 100644
--- a/src/dec2flt/table.rs
+++ b/src/dec2flt/table.rs
@@ -1234,6 +1234,30 @@ pub static POWERS: ([u64; 611], [i16; 611]) = (
     ],
 );
 
+#[rustfmt::skip]
+pub const SHORT_POWERS: [u64; 20] = [
+    1,
+    10,
+    100,
+    1000,
+    10000,
+    100000,
+    1000000,
+    10000000,
+    100000000,
+    1000000000,
+    10000000000,
+    100000000000,
+    1000000000000,
+    10000000000000,
+    100000000000000,
+    1000000000000000,
+    10000000000000000,
+    100000000000000000,
+    1000000000000000000,
+    10000000000000000000,
+];
+
 #[rustfmt::skip]
 pub const F32_SHORT_POWERS: [f32; 11] = [
     1e0,

I'd be happy to submit a pull request with these changes, if they are satisfactory to you.

Sample Repository

I've created a simple, minimal repository tracking these changes on rust-dec2flt, which has a core branch that is identical to Rust's current implementation in the core library. The disguised branch contains the changes to improve parsing speeds for disguised fast-path cases. I will also, if there is interest, gradually be making changes for the moderate and slow-path algorithms.

@aldanor
Copy link

aldanor commented May 12, 2021

@Alexhuszagh Just out of wonder, I've run comparison benches of the 'disguised' branch here with fast-float, if it's of any interest (fast path performance, that is):

decflt/disguised fast-float
fast 22.8ns 14.1ns
disguised 24.1ns 21.5ns
moderate 181.0ns 28.3ns
halfway 284.2ns 27.7ns
large 202.8us 11.7us
denormal 87276.0ns 49.8ns

(I know there's other concerns here like binary size etc; this is just from the perspective of pure speed)

@Alexhuszagh
Copy link
Contributor Author

@aldanor Yes, this doesn't cover other cases, this is merely a fix for the disguised fast-path. I'm committing other fixes addressing these other scenarios as well later. This is merely an attempt to correct that single issue, and not other cases. I'm attempting to patch, without major changes to binary size, the changes in minimal-lexical, which is a minimal fork of lexical-core.

The goal is incremental changes, so we can get float parsing fixed, although the benchmarks from fast-float look very convincing too.

@bjorn3 bjorn3 added A-floating-point Area: Floating point numbers and arithmetic C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels May 12, 2021
bors added a commit to rust-lang-ci/rust that referenced this issue Jul 17, 2021
Update Rust Float-Parsing Algorithms to use the Eisel-Lemire algorithm.

# Summary

Rust, although it implements a correct float parser, has major performance issues in float parsing. Even for common floats, the performance can be 3-10x [slower](https://arxiv.org/pdf/2101.11408.pdf) than external libraries such as [lexical](https://github.com/Alexhuszagh/rust-lexical) and [fast-float-rust](https://github.com/aldanor/fast-float-rust).

Recently, major advances in float-parsing algorithms have been developed by Daniel Lemire, along with others, and implement a fast, performant, and correct float parser, with speeds up to 1200 MiB/s on Apple's M1 architecture for the [canada](https://github.com/lemire/simple_fastfloat_benchmark/blob/0e2b5d163d4074cc0bde2acdaae78546d6e5c5f1/data/canada.txt) dataset, 10x faster than Rust's 130 MiB/s.

In addition, [edge-cases](rust-lang#85234) in Rust's [dec2flt](https://github.com/rust-lang/rust/tree/868c702d0c9a471a28fb55f0148eb1e3e8b1dcc5/library/core/src/num/dec2flt) algorithm can lead to over a 1600x slowdown relative to efficient algorithms. This is due to the use of Clinger's correct, but slow [AlgorithmM and Bellepheron](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45.4152&rep=rep1&type=pdf), which have been improved by faster big-integer algorithms and the Eisel-Lemire algorithm, respectively.

Finally, this algorithm provides substantial improvements in the number of floats the Rust core library can parse. Denormal floats with a large number of digits cannot be parsed, due to use of the `Big32x40`, which simply does not have enough digits to round a float correctly. Using a custom decimal class, with much simpler logic, we can parse all valid decimal strings of any digit count.

```rust
// Issue in Rust's dec2fly.
"2.47032822920623272088284396434110686182e-324".parse::<f64>();   // Err(ParseFloatError { kind: Invalid })
```

# Solution

This pull request implements the Eisel-Lemire algorithm, modified from [fast-float-rust](https://github.com/aldanor/fast-float-rust) (which is licensed under Apache 2.0/MIT), along with numerous modifications to make it more amenable to inclusion in the Rust core library. The following describes both features in fast-float-rust and improvements in fast-float-rust for inclusion in core.

**Documentation**

Extensive documentation has been added to ensure the code base may be maintained by others, which explains the algorithms as well as various associated constants and routines. For example, two seemingly magical constants include documentation to describe how they were derived as follows:

```rust
    // Round-to-even only happens for negative values of q
    // when q ≥ −4 in the 64-bit case and when q ≥ −17 in
    // the 32-bitcase.
    //
    // When q ≥ 0,we have that 5^q ≤ 2m+1. In the 64-bit case,we
    // have 5^q ≤ 2m+1 ≤ 2^54 or q ≤ 23. In the 32-bit case,we have
    // 5^q ≤ 2m+1 ≤ 2^25 or q ≤ 10.
    //
    // When q < 0, we have w ≥ (2m+1)×5^−q. We must have that w < 2^64
    // so (2m+1)×5^−q < 2^64. We have that 2m+1 > 2^53 (64-bit case)
    // or 2m+1 > 2^24 (32-bit case). Hence,we must have 2^53×5^−q < 2^64
    // (64-bit) and 2^24×5^−q < 2^64 (32-bit). Hence we have 5^−q < 2^11
    // or q ≥ −4 (64-bit case) and 5^−q < 2^40 or q ≥ −17 (32-bitcase).
    //
    // Thus we have that we only need to round ties to even when
    // we have that q ∈ [−4,23](in the 64-bit case) or q∈[−17,10]
    // (in the 32-bit case). In both cases,the power of five(5^|q|)
    // fits in a 64-bit word.
    const MIN_EXPONENT_ROUND_TO_EVEN: i32;
    const MAX_EXPONENT_ROUND_TO_EVEN: i32;
```

This ensures maintainability of the code base.

**Improvements for Disguised Fast-Path Cases**

The fast path in float parsing algorithms attempts to use native, machine floats to represent both the significant digits and the exponent, which is only possible if both can be exactly represented without rounding. In practice, this means that the significant digits must be 53-bits or less and the then exponent must be in the range `[-22, 22]` (for an f64). This is similar to the existing dec2flt implementation.

However, disguised fast-path cases exist, where there are few significant digits and an exponent above the valid range, such as `1.23e25`. In this case, powers-of-10 may be shifted from the exponent to the significant digits, discussed at length in rust-lang#85198.

**Digit Parsing Improvements**

Typically, integers are parsed from string 1-at-a-time, requiring unnecessary multiplications which can slow down parsing. An approach to parse 8 digits at a time using only 3 multiplications is described in length [here](https://johnnylee-sde.github.io/Fast-numeric-string-to-int/). This leads to significant performance improvements, and is implemented for both big and little-endian systems.

**Unsafe Changes**

Relative to fast-float-rust, this library makes less use of unsafe functionality and clearly documents it. This includes the refactoring and documentation of numerous unsafe methods undesirably marked as safe. The original code would look something like this, which is deceptively marked as safe for unsafe functionality.

```rust
impl AsciiStr {
    #[inline]
    pub fn step_by(&mut self, n: usize) -> &mut Self {
        unsafe { self.ptr = self.ptr.add(n) };
        self
    }
}

...

#[inline]
fn parse_scientific(s: &mut AsciiStr<'_>) -> i64 {
    // the first character is 'e'/'E' and scientific mode is enabled
    let start = *s;
    s.step();
    ...
}
```

The new code clearly documents safety concerns, and does not mark unsafe functionality as safe, leading to better safety guarantees.

```rust
impl AsciiStr {
    /// Advance the view by n, advancing it in-place to (n..).
    pub unsafe fn step_by(&mut self, n: usize) -> &mut Self {
        // SAFETY: same as step_by, safe as long n is less than the buffer length
        self.ptr = unsafe { self.ptr.add(n) };
        self
    }
}

...

/// Parse the scientific notation component of a float.
fn parse_scientific(s: &mut AsciiStr<'_>) -> i64 {
    let start = *s;
    // SAFETY: the first character is 'e'/'E' and scientific mode is enabled
    unsafe {
        s.step();
    }
    ...
}
```

This allows us to trivially demonstrate the new implementation of dec2flt is safe.

**Inline Annotations Have Been Removed**

In the previous implementation of dec2flt, inline annotations exist practically nowhere in the entire module. Therefore, these annotations have been removed, which mostly does not impact [performance](aldanor/fast-float-rust#15 (comment)).

**Fixed Correctness Tests**

Numerous compile errors in `src/etc/test-float-parse` were present, due to deprecation of `time.clock()`, as well as the crate dependencies with `rand`. The tests have therefore been reworked as a [crate](https://github.com/Alexhuszagh/rust/tree/master/src/etc/test-float-parse), and any errors in `runtests.py` have been patched.

**Undefined Behavior**

An implementation of `check_len` which relied on undefined behavior (in fast-float-rust) has been refactored, to ensure that the behavior is well-defined. The original code is as follows:

```rust
    #[inline]
    pub fn check_len(&self, n: usize) -> bool {
        unsafe { self.ptr.add(n) <= self.end }
    }
```

And the new implementation is as follows:

```rust
    /// Check if the slice at least `n` length.
    fn check_len(&self, n: usize) -> bool {
        n <= self.as_ref().len()
    }
```

Note that this has since been fixed in [fast-float-rust](aldanor/fast-float-rust#29).

**Inferring Binary Exponents**

Rather than explicitly store binary exponents, this new implementation infers them from the decimal exponent, reducing the amount of static storage required. This removes the requirement to store [611 i16s](https://github.com/rust-lang/rust/blob/868c702d0c9a471a28fb55f0148eb1e3e8b1dcc5/library/core/src/num/dec2flt/table.rs#L8).

# Code Size

The code size, for all optimizations, does not considerably change relative to before for stripped builds, however it is **significantly** smaller prior to stripping the resulting binaries. These binary sizes were calculated on x86_64-unknown-linux-gnu.

**new**

Using rustc version 1.55.0-dev.

opt-level|size|size(stripped)
|:-:|:-:|:-:|
0|400k|300K
1|396k|292K
2|392k|292K
3|392k|296K
s|396k|292K
z|396k|292K

**old**

Using rustc version 1.53.0-nightly.

opt-level|size|size(stripped)
|:-:|:-:|:-:|
0|3.2M|304K
1|3.2M|292K
2|3.1M|284K
3|3.1M|284K
s|3.1M|284K
z|3.1M|284K

# Correctness

The dec2flt implementation passes all of Rust's unittests and comprehensive float parsing tests, along with numerous other tests such as Nigel Toa's comprehensive float [tests](https://github.com/nigeltao/parse-number-fxx-test-data) and Hrvoje Abraham  [strtod_tests](https://github.com/ahrvoje/numerics/blob/master/strtod/strtod_tests.toml). Therefore, it is unlikely that this algorithm will incorrectly round parsed floats.

# Issues Addressed

This will fix and close the following issues:

- resolves rust-lang#85198
- resolves rust-lang#85214
- resolves rust-lang#85234
- fixes rust-lang#31407
- fixes rust-lang#31109
- fixes rust-lang#53015
- resolves rust-lang#68396
- closes aldanor/fast-float-rust#15
@bors bors closed this as completed in 8752b40 Jul 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-floating-point Area: Floating point numbers and arithmetic C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants