Implement the big-integer arithmetic algorithm. #96

Alexhuszagh · 2021-08-22T02:56:13Z

Fixes #93. This implements the big-integer arithmetic algorithms that replace the decimal algorithm, for major performance improvements.

Other Changes/Anti-Patterns

8-digit parsing optimizations uses a while loop and occurs for both integer and fraction digits.

On my end, this didn't lead to any performance hits for common floats, but dramatically improved performance for floats with a large number of digits. However, this could be a pessimization, so feel free to request this being removed.

Store the integer and fraction digits.

A slice (pointer and length, see below) containing the parsed integer and fraction digits is added to parsed_number_string, so the new definition is:

struct parsed_number_string {
  int64_t exponent{0};
  uint64_t mantissa{0};
  const char *lastmatch{nullptr};
  bool negative{false};
  bool valid{false};
  bool too_many_digits{false};
  // contains the range of the significant digits
  byte_slice integer{};  // non-nullable
  byte_slice fraction{}; // nullable
};

This is effectively free, since this work is already done by the parsing algorithm, and means on overflow and parsing the significant digits for the slow algorithm we don't have to check if the characters are valid digits, since they have previously been validated. This simplifies internal logic, especially for the big-integer algorithm.

A few Rust-isms have been added, such as slices and memory safety guarantees for the big-integer code.

A slice type has been added, as follows, with two typedefs for byte and limb slices:

// a pointer and a length to a contiguous block of memory
template <typename T>
struct slice {
  const T* ptr;
  size_t length;
};

Array bounds checking is done for big-integer arithmetic when extended the stack vector. For example, for shifting the limbs left in a big-integer, we use the following code:

  // move the limbs left by `n` limbs.
  bool shl_limbs(size_t n) noexcept {
    FASTFLOAT_DEBUG_ASSERT(n != 0);
    if (n + vec.len() > vec.capacity()) {
      return false;
    } else {
      ...
      return true;
    }
  }

If the return value is false, this propagates, so the function that calls it, shl has this:

  // move the limbs left by `n` bits.
  bool shl(size_t n) noexcept {
    size_t rem = n % limb_bits;
    size_t div = n / limb_bits;
    if (rem != 0) {
      FASTFLOAT_TRY(shl_bits(rem));
    }
    if (div != 0) {
      FASTFLOAT_TRY(shl_limbs(div));
    }
    return true;
  }

And then any function that calls shl will use a run-time assertion (even in release builds) that ensures there is not a buffer overwrite:

#ifndef FASTFLOAT_ASSERT
#define FASTFLOAT_ASSERT(x)  { if (!(x)) abort(); }
#endif

#ifndef FASTFLOAT_DEBUG_ASSERT
#include <cassert>
#define FASTFLOAT_DEBUG_ASSERT(x) assert(x)
#endif

// rust style `try!()` macro, or `?` operator
#define FASTFLOAT_TRY(x) { if (!(x)) return false; }

FASTFLOAT_ASSERT(theor_digits.shl(uint32_t(pow2_exp)));

Out-values are used extensively.

This can be considered an anti-pattern in modern C++, but some functions such as parse_infnan seemed to use it, and using std::pair seemed to lead to worse assembly generation, so a lot of the functions use outvalues like, where truncated is meant to be set by the function, returning the high 64-bits of the big integer.

fastfloat_really_inline
uint64_t uint64_hi64(uint64_t r0, bool& truncated) noexcept {
  truncated = false;
  int shl = leading_zeroes(r0);
  return r0 << shl;
}

Implementation Notes

Replaces the existing decimal implementation, for substantial performance improvements with near-halfway cases. This is especially fast with a large number of digits.

Big Integer Implementation

A small subset of big-integer arithmetic has been added, with the bigint struct. It uses a stack-allocated vector with enough bits to store the float with the large number of significant digits. This is log2(10^(769 + 342)), to account for the largest possible magnitude exponent, and number of digits (3600 bits), and then rounded up to 4k bits.

The limb size is determined by the architecture: most 64-bit architectures have efficient 128-bit multiplication, either by a single hardware instruction or 2 native multiplications for the high and low bits. This includes x86_64, mips64, s390x, aarch64, powerpc64, riscv64, and the only known exception is sparcv8 and sparcv9. Therefore, we define a limb size of 64-bits on 64-bit architectures except SPARC, otherwise we fallback to 32-bit limbs.

A simple stackvector is used, which just has operations to add elements, index, and truncate the vector.

bigint is then just a wrapper around this, with methods for big-integer arithmetic. For our algorithms, we just need multiplication by a power (x * b^N), multiplication by a bigint or scalar value, and addition by a bigint or scalar value. Scalar addition and multiplication uses compiler extensions when possible (__builtin_add_overflow and __uint128_t), if not, then we implement simple logic shown to optimize well on MSVC. Big-integer multiplication is done via grade school multiplication, which is more efficient than any asymptotically faster algorithms. Multiplication by a power is then done via bitshifts for powers-of-two, and by iterative multiplications of a large and then scalar value for powers-of-5.

compute_float

Compute float has been slightly modified so if the algorithm cannot round correctly, it returns a normalized, extended-precision adjusted mantissa with the power2 shifted by INT16_MIN so the exponent is always negative. compute_error and compute_error_scaled have been added.

Digit Optimiations

To improve performance for numbers with many digits, parse_eight_digits_unrolled is used for both integers and fractions, and uses a while loop than two nested if statements. This adds no noticeable performance cost for common floats, but dramatically improves performance for numbers with large digits (without these optimizations, ~65% of the total runtime cost is in parse_number_string).

Parsed Number

Two fields have been added to parsed_number_string, which contains a slice of the integer and fraction digits. This is extremely cheap, since the work is already done, and the strings are pre-tokenized during parsing. This allows us on overflow to re-parse these tokenized strings, without checking if each character is an integer. Likewise, for the big-integer algorithms, we can merely re-parse the pre-tokenized strings.

Slow Algorithm

The new algorithm is digit_comp, which takes the parsed number string and the adjusted_mantissa from compute_float. The significant digits are parsed into a big integer, and the exponent relative to the significant digits is calculated. If the exponent is >= 0, we use positive_digit_comp, otherwise, we use negative_digit_comp. positive_digit_comp is quite simple: we scale the significant digits to the exponent, and then we get the high 64-bits for the native float, determine if any lower bits were truncated, and use that to direct rounding.

negative_digit_comp is a little more complex, but also quite trivial: we use the parsed significant digits as the real digits, and calculate the theoretical digits from b+h, the halfway point between b and b+u, the next-positive float. To get b, we round the adjusted mantissa down, create an extended-precision representation, and calculate the halfway point. We now have a base-10 exponent for the real digits, and a base-2 exponent for the theoretical digits. We scale these two to the same exponent by multiplying the theoretixal digits by 5**-real_exp. We then get the base-2 exponent as theor_exp - real_exp, and if this is positive, we multipy the theoretical digits by it, otherwise, we multiply the real digits by it. Now, both are scaled to the same magnitude, and we simply compare the digits in the big integer, and use that to direct rounding.

Alexhuszagh · 2021-08-22T03:02:19Z

For the slices, these could trivially be replaced with a pointer range, which is typically the more standard C++ way to do things. For the bounds assertions, these should be safe to remove, since they've been battle tested in comparable code for millions of downloads in numerous production systems. However, I'm generally a decently defensive programmer by nature, so I left them in currently. I'm willing to make any changes, as you see fit.

Alexhuszagh · 2021-08-22T03:04:43Z

It seems we have an off-by-one error for MSVC (weird, because I though I explicitly checked this case). I'll run it through a full conformance test, including on MSVC, and update the PR, because it clearly is not ready yet.

Alexhuszagh · 2021-08-22T06:06:21Z

Ok the bugs were relatively trivial: I'll run a full conformance check tomorrow just as an extra precaution. The full list of issues was:

Forgetting to normalize bigint after using the bigint(uint64_t) constructor.
Forgetting to assign the addition of the carry for scalar_mult on 64-bit limbs.
FIx uint32_hi64, which accidentally shifted the most-significant limb with 3 32-bit limbs.
Fix compare, which was comparing in big-endian order, not little-endian order (which the limbs are stored in).

lemire · 2021-08-22T15:26:20Z

That's fantastic. I trust you and I trust your code, but I will read it all carefully before merging. This could be a few days at least. Please note that if there is a delay, it is not due to a lack of interest or a disregard for the importance of your work.

Alexhuszagh · 2021-08-23T14:01:01Z

The exhausted conformance tests just finished running, and everything passed, so I'm much more content about the state of the PR.

lemire · 2021-08-24T14:11:47Z

Thank you.

lemire · 2021-09-01T03:00:02Z

Still on my radar. This is important.

Alexhuszagh · 2021-09-01T03:41:39Z

No worries, this is a large amount of code that needs to be meticulously reviewed. I'm happy to help out in any way I can, and I realize that between research, classes, and open-source work, you're likely extremely busy.

lemire · 2021-09-10T19:00:28Z

include/fast_float/float_common.h

-} // namespace
+// a pointer and a length to a contiguous block of memory
+template <typename T>
+struct slice {


This is fine but there is already a standard std::slice type, see https://www.cplusplus.com/reference/valarray/slice/slice/

I'd like to have your thoughts on this... are we concerned that the name overload could be confusing?

Furthermore, why should it be a template if it is always templated with the same type? Generality is great, when it is actually put into good use, but if we never use the extra freedom, then it is not so beneficial.

I'll rename it. There's numerous other names possible. And I do actually use it for one other template: it's for both a limb_slice and byte_slice

And I do actually use it for one other template: it's for both a limb_slice and byte_slice

Yes. I noticed that later.

Does span sound good? C++20 has the same concept, and uses the same name for it. They're implemented practically the same way, as a pointer and a size. It would basically be minimal backport for C++11, essentially. If not, I can think of another name. chunk could also work, since it's descriptive, and not in the standard library at all.

Span seems much better and less confusing.

I am less concerned about us using the same name, but more about the conceptual confusion. When you read code, it helps if the same name refers to the same usage. Otherwise it makes the code harder to read than it needs to.

(I got stuck on your code because I was assuming that you had used std::slice... and I could not understand the code at first.)

lemire · 2021-09-10T19:02:34Z

include/fast_float/slow.h

@@ -0,0 +1,420 @@
+#ifndef FASTFLOAT_SLOW_H


I guess 'slow' is fine, but it is not the best name. I do not mind having something called 'slow' but I'd like to have you thoughts on whether there could be a better name for your approach...? Note that we can need to refer to it in other context and saying "well, we use the slow approach" is not... always great.

Sure, I can rename it. It's a digit-comparison algorithm, so maybe digitcmp is better?

@Alexhuszagh It is your approach so you get to name it. I personally like things that I can pronounce so, digit-comparison or digit_comparison sounds good?

Think about this... you are in a meeting and you want to refer to "this"... now, it helps if you can just speak it out.

(I realize that this is a minor point.)

@Alexhuszagh It is your approach so you get to name it. I personally like things that I can pronounce so, digit-comparison or digit_comparison sounds good?

Think about this... you are in a meeting and you want to refer to "this"... now, it helps if you can just speak it out.

(I realize that this is a minor point.)

That sounds great, I'll implement all these recommendations then.

lemire · 2021-09-10T19:17:37Z

include/fast_float/slow.h

+}
+
+template <typename T>
+fastfloat_really_inline adjusted_mantissa to_b(T value) noexcept {


to_b could probably be made more expressive.

lemire · 2021-09-10T19:17:58Z

include/fast_float/slow.h

+
+// get the extended precision value of the halfway point between b and b+u
+template <typename T>
+fastfloat_really_inline adjusted_mantissa to_bh(T value) noexcept {


same comment here, I think to_bh could be made more expressive.

include/fast_float/bigint.h

lemire · 2021-09-10T20:23:34Z

Out-values are used extensively. (...) This can be considered an anti-pattern in modern C++

I am not sure that's true.

There is nothing wrong with it in principle. It does not offend me.

lemire · 2021-09-10T20:26:03Z

This code looks very good to me.

@Alexhuszagh I would invite you to check out my comments and see whether you want to address them. They are minor comments and I found nothing significant to object to. However, I would like to consider my proposals.

Replaces the existing decimal implementation, for substantial performance improvements with near-halfway cases. This is especially fast with a large number of digits. **Big Integer Implementation** A small subset of big-integer arithmetic has been added, with the `bigint` struct. It uses a stack-allocated vector with enough bits to store the float with the large number of significant digits. This is log2(10^(769 + 342)), to account for the largest possible magnitude exponent, and number of digits (3600 bits), and then rounded up to 4k bits. The limb size is determined by the architecture: most 64-bit architectures have efficient 128-bit multiplication, either by a single hardware instruction or 2 native multiplications for the high and low bits. This includes x86_64, mips64, s390x, aarch64, powerpc64, riscv64, and the only known exception is sparcv8 and sparcv9. Therefore, we define a limb size of 64-bits on 64-bit architectures except SPARC, otherwise we fallback to 32-bit limbs. A simple stackvector is used, which just has operations to add elements, index, and truncate the vector. `bigint` is then just a wrapper around this, with methods for big-integer arithmetic. For our algorithms, we just need multiplication by a power (x * b^N), multiplication by a bigint or scalar value, and addition by a bigint or scalar value. Scalar addition and multiplication uses compiler extensions when possible (__builtin_add_overflow and __uint128_t), if not, then we implement simple logic shown to optimize well on MSVC. Big-integer multiplication is done via grade school multiplication, which is more efficient than any asymptotically faster algorithms. Multiplication by a power is then done via bitshifts for powers-of-two, and by iterative multiplications of a large and then scalar value for powers-of-5. **compute_float** Compute float has been slightly modified so if the algorithm cannot round correctly, it returns a normalized, extended-precision adjusted mantissa with the power2 shifted by INT16_MIN so the exponent is always negative. `compute_error` and `compute_error_scaled` have been added. **Digit Optimiations** To improve performance for numbers with many digits, `parse_eight_digits_unrolled` is used for both integers and fractions, and uses a while loop than two nested if statements. This adds no noticeable performance cost for common floats, but dramatically improves performance for numbers with large digits (without these optimizations, ~65% of the total runtime cost is in parse_number_string). **Parsed Number** Two fields have been added to `parsed_number_string`, which contains a slice of the integer and fraction digits. This is extremely cheap, since the work is already done, and the strings are pre-tokenized during parsing. This allows us on overflow to re-parse these tokenized strings, without checking if each character is an integer. Likewise, for the big-integer algorithms, we can merely re-parse the pre-tokenized strings. **Slow Algorithm** The new algorithm is `digit_comp`, which takes the parsed number string and the `adjusted_mantissa` from `compute_float`. The significant digits are parsed into a big integer, and the exponent relative to the significant digits is calculated. If the exponent is >= 0, we use `positive_digit_comp`, otherwise, we use `negative_digit_comp`. `positive_digit_comp` is quite simple: we scale the significant digits to the exponent, and then we get the high 64-bits for the native float, determine if any lower bits were truncated, and use that to direct rounding. `negative_digit_comp` is a little more complex, but also quite trivial: we use the parsed significant digits as the real digits, and calculate the theoretical digits from `b+h`, the halfway point between `b` and `b+u`, the next-positive float. To get `b`, we round the adjusted mantissa down, create an extended-precision representation, and calculate the halfway point. We now have a base-10 exponent for the real digits, and a base-2 exponent for the theoretical digits. We scale these two to the same exponent by multiplying the theoretixal digits by `5**-real_exp`. We then get the base-2 exponent as `theor_exp - real_exp`, and if this is positive, we multipy the theoretical digits by it, otherwise, we multiply the real digits by it. Now, both are scaled to the same magnitude, and we simply compare the digits in the big integer, and use that to direct rounding. **Rust-Isms** A few Rust-isms have been added, since it simplifies logic assertions. These can be trivially removed or reworked, as needed. - a `slice` type has been added, which is a pointer and length. - `FASTFLOAT_ASSERT`, `FASTFLOAT_DEBUG_ASSERT`, and `FASTFLOAT_TRY` have been added - `FASTFLOAT_ASSERT` aborts, even in release builds, if the condition fails. - `FASTFLOAT_DEBUG_ASSERT` defaults to `assert`, for logic errors. - `FASTFLOAT_TRY` is like a Rust `Option` type, which propagates errors. Specifically, `FASTFLOAT_TRY` is useful in combination with `FASTFLOAT_ASSERT` to ensure there are no memory corruption errors possible in the big-integer arithmetic. Although the `bigint` type ensures we have enough storage for all valid floats, memory issues are quite a severe class of vulnerabilities, and due to the low performance cost of checks, we abort if we would have out-of-bounds writes. This can only occur when we are adding items to the vector, which is a very small number of steps. Therefore, we abort if our memory safety guarantees ever fail. lexical has never aborted, so it's unlikely we will ever fail these guarantees.

Alexhuszagh · 2021-09-10T23:54:37Z

I've implemented the above proposals (thanks for the feedback), and rebased for automatic merging.

lemire · 2021-09-11T01:03:36Z

Thanks. I will make a few more checks. Expect me to merge early next week.

It is great work.

This was referenced Aug 22, 2021

Various bug fixes and added contrived benchmarks. lemire/simple_fastfloat_benchmark#9

Merged

Optimize Performance for Slow-Path Cases aldanor/fast-float-rust#32

Open

Alexhuszagh force-pushed the bigint branch from 6889bb1 to 6758fe1 Compare August 22, 2021 20:59

Alexhuszagh mentioned this pull request Sep 6, 2021

lexical and fast-float might soon not be needed. pola-rs/polars#1010

Closed

lemire reviewed Sep 10, 2021

View reviewed changes

include/fast_float/bigint.h Outdated Show resolved Hide resolved

Alexhuszagh force-pushed the bigint branch from 6758fe1 to 0c5e2ea Compare September 10, 2021 23:52

Alexhuszagh force-pushed the bigint branch from 0c5e2ea to fc0c868 Compare September 10, 2021 23:54

lemire changed the base branch from main to dlemire/bigint September 14, 2021 01:22

lemire merged commit 3f0ba09 into fastfloat:dlemire/bigint Sep 14, 2021

Alexhuszagh mentioned this pull request Oct 31, 2024

Optimize Performance for Slow-Path Cases Alexhuszagh/fast-float-rust#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the big-integer arithmetic algorithm. #96

Implement the big-integer arithmetic algorithm. #96

Alexhuszagh commented Aug 22, 2021 •

edited

Loading

Alexhuszagh commented Aug 22, 2021

Alexhuszagh commented Aug 22, 2021

Alexhuszagh commented Aug 22, 2021 •

edited

Loading

lemire commented Aug 22, 2021

Alexhuszagh commented Aug 23, 2021 •

edited

Loading

lemire commented Aug 24, 2021

lemire commented Sep 1, 2021

Alexhuszagh commented Sep 1, 2021 •

edited

Loading

lemire Sep 10, 2021

Alexhuszagh Sep 10, 2021

lemire Sep 10, 2021

Alexhuszagh Sep 10, 2021 •

edited

Loading

lemire Sep 10, 2021

lemire Sep 10, 2021

Alexhuszagh Sep 10, 2021

lemire Sep 10, 2021

Alexhuszagh Sep 10, 2021

lemire Sep 10, 2021

lemire Sep 10, 2021

lemire commented Sep 10, 2021

lemire commented Sep 10, 2021

Alexhuszagh commented Sep 10, 2021 •

edited

Loading

lemire commented Sep 11, 2021

Implement the big-integer arithmetic algorithm. #96

Implement the big-integer arithmetic algorithm. #96

Conversation

Alexhuszagh commented Aug 22, 2021 • edited Loading

Other Changes/Anti-Patterns

Implementation Notes

Alexhuszagh commented Aug 22, 2021

Alexhuszagh commented Aug 22, 2021

Alexhuszagh commented Aug 22, 2021 • edited Loading

lemire commented Aug 22, 2021

Alexhuszagh commented Aug 23, 2021 • edited Loading

lemire commented Aug 24, 2021

lemire commented Sep 1, 2021

Alexhuszagh commented Sep 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alexhuszagh Sep 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemire commented Sep 10, 2021

lemire commented Sep 10, 2021

Alexhuszagh commented Sep 10, 2021 • edited Loading

lemire commented Sep 11, 2021

Alexhuszagh commented Aug 22, 2021 •

edited

Loading

Alexhuszagh commented Aug 22, 2021 •

edited

Loading

Alexhuszagh commented Aug 23, 2021 •

edited

Loading

Alexhuszagh commented Sep 1, 2021 •

edited

Loading

Alexhuszagh Sep 10, 2021 •

edited

Loading

Alexhuszagh commented Sep 10, 2021 •

edited

Loading