Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the big-integer arithmetic algorithm. #96

Merged
merged 1 commit into from
Sep 14, 2021

Conversation

Alexhuszagh
Copy link
Contributor

@Alexhuszagh Alexhuszagh commented Aug 22, 2021

Fixes #93. This implements the big-integer arithmetic algorithms that replace the decimal algorithm, for major performance improvements.

Other Changes/Anti-Patterns

  • 8-digit parsing optimizations uses a while loop and occurs for both integer and fraction digits.

On my end, this didn't lead to any performance hits for common floats, but dramatically improved performance for floats with a large number of digits. However, this could be a pessimization, so feel free to request this being removed.

  • Store the integer and fraction digits.

A slice (pointer and length, see below) containing the parsed integer and fraction digits is added to parsed_number_string, so the new definition is:

struct parsed_number_string {
  int64_t exponent{0};
  uint64_t mantissa{0};
  const char *lastmatch{nullptr};
  bool negative{false};
  bool valid{false};
  bool too_many_digits{false};
  // contains the range of the significant digits
  byte_slice integer{};  // non-nullable
  byte_slice fraction{}; // nullable
};

This is effectively free, since this work is already done by the parsing algorithm, and means on overflow and parsing the significant digits for the slow algorithm we don't have to check if the characters are valid digits, since they have previously been validated. This simplifies internal logic, especially for the big-integer algorithm.

  • A few Rust-isms have been added, such as slices and memory safety guarantees for the big-integer code.

A slice type has been added, as follows, with two typedefs for byte and limb slices:

// a pointer and a length to a contiguous block of memory
template <typename T>
struct slice {
  const T* ptr;
  size_t length;
};

Array bounds checking is done for big-integer arithmetic when extended the stack vector. For example, for shifting the limbs left in a big-integer, we use the following code:

  // move the limbs left by `n` limbs.
  bool shl_limbs(size_t n) noexcept {
    FASTFLOAT_DEBUG_ASSERT(n != 0);
    if (n + vec.len() > vec.capacity()) {
      return false;
    } else {
      ...
      return true;
    }
  }

If the return value is false, this propagates, so the function that calls it, shl has this:

  // move the limbs left by `n` bits.
  bool shl(size_t n) noexcept {
    size_t rem = n % limb_bits;
    size_t div = n / limb_bits;
    if (rem != 0) {
      FASTFLOAT_TRY(shl_bits(rem));
    }
    if (div != 0) {
      FASTFLOAT_TRY(shl_limbs(div));
    }
    return true;
  }

And then any function that calls shl will use a run-time assertion (even in release builds) that ensures there is not a buffer overwrite:

#ifndef FASTFLOAT_ASSERT
#define FASTFLOAT_ASSERT(x)  { if (!(x)) abort(); }
#endif

#ifndef FASTFLOAT_DEBUG_ASSERT
#include <cassert>
#define FASTFLOAT_DEBUG_ASSERT(x) assert(x)
#endif

// rust style `try!()` macro, or `?` operator
#define FASTFLOAT_TRY(x) { if (!(x)) return false; }

FASTFLOAT_ASSERT(theor_digits.shl(uint32_t(pow2_exp)));
  • Out-values are used extensively.

This can be considered an anti-pattern in modern C++, but some functions such as parse_infnan seemed to use it, and using std::pair seemed to lead to worse assembly generation, so a lot of the functions use outvalues like, where truncated is meant to be set by the function, returning the high 64-bits of the big integer.

fastfloat_really_inline
uint64_t uint64_hi64(uint64_t r0, bool& truncated) noexcept {
  truncated = false;
  int shl = leading_zeroes(r0);
  return r0 << shl;
}

Implementation Notes

Replaces the existing decimal implementation, for substantial performance improvements with near-halfway cases. This is especially fast with a large number of digits.

Big Integer Implementation

A small subset of big-integer arithmetic has been added, with the bigint struct. It uses a stack-allocated vector with enough bits to store the float with the large number of significant digits. This is log2(10^(769 + 342)), to account for the largest possible magnitude exponent, and number of digits (3600 bits), and then rounded up to 4k bits.

The limb size is determined by the architecture: most 64-bit architectures have efficient 128-bit multiplication, either by a single hardware instruction or 2 native multiplications for the high and low bits. This includes x86_64, mips64, s390x, aarch64, powerpc64, riscv64, and the only known exception is sparcv8 and sparcv9. Therefore, we define a limb size of 64-bits on 64-bit architectures except SPARC, otherwise we fallback to 32-bit limbs.

A simple stackvector is used, which just has operations to add elements, index, and truncate the vector.

bigint is then just a wrapper around this, with methods for big-integer arithmetic. For our algorithms, we just need multiplication by a power (x * b^N), multiplication by a bigint or scalar value, and addition by a bigint or scalar value. Scalar addition and multiplication uses compiler extensions when possible (__builtin_add_overflow and __uint128_t), if not, then we implement simple logic shown to optimize well on MSVC. Big-integer multiplication is done via grade school multiplication, which is more efficient than any asymptotically faster algorithms. Multiplication by a power is then done via bitshifts for powers-of-two, and by iterative multiplications of a large and then scalar value for powers-of-5.

compute_float

Compute float has been slightly modified so if the algorithm cannot round correctly, it returns a normalized, extended-precision adjusted mantissa with the power2 shifted by INT16_MIN so the exponent is always negative. compute_error and compute_error_scaled have been added.

Digit Optimiations

To improve performance for numbers with many digits, parse_eight_digits_unrolled is used for both integers and fractions, and uses a while loop than two nested if statements. This adds no noticeable performance cost for common floats, but dramatically improves performance for numbers with large digits (without these optimizations, ~65% of the total runtime cost is in parse_number_string).

Parsed Number

Two fields have been added to parsed_number_string, which contains a slice of the integer and fraction digits. This is extremely cheap, since the work is already done, and the strings are pre-tokenized during parsing. This allows us on overflow to re-parse these tokenized strings, without checking if each character is an integer. Likewise, for the big-integer algorithms, we can merely re-parse the pre-tokenized strings.

Slow Algorithm

The new algorithm is digit_comp, which takes the parsed number string and the adjusted_mantissa from compute_float. The significant digits are parsed into a big integer, and the exponent relative to the significant digits is calculated. If the exponent is >= 0, we use positive_digit_comp, otherwise, we use negative_digit_comp. positive_digit_comp is quite simple: we scale the significant digits to the exponent, and then we get the high 64-bits for the native float, determine if any lower bits were truncated, and use that to direct rounding.

negative_digit_comp is a little more complex, but also quite trivial: we use the parsed significant digits as the real digits, and calculate the theoretical digits from b+h, the halfway point between b and b+u, the next-positive float. To get b, we round the adjusted mantissa down, create an extended-precision representation, and calculate the halfway point. We now have a base-10 exponent for the real digits, and a base-2 exponent for the theoretical digits. We scale these two to the same exponent by multiplying the theoretixal digits by 5**-real_exp. We then get the base-2 exponent as theor_exp - real_exp, and if this is positive, we multipy the theoretical digits by it, otherwise, we multiply the real digits by it. Now, both are scaled to the same magnitude, and we simply compare the digits in the big integer, and use that to direct rounding.

@Alexhuszagh
Copy link
Contributor Author

For the slices, these could trivially be replaced with a pointer range, which is typically the more standard C++ way to do things. For the bounds assertions, these should be safe to remove, since they've been battle tested in comparable code for millions of downloads in numerous production systems. However, I'm generally a decently defensive programmer by nature, so I left them in currently. I'm willing to make any changes, as you see fit.

@Alexhuszagh
Copy link
Contributor Author

It seems we have an off-by-one error for MSVC (weird, because I though I explicitly checked this case). I'll run it through a full conformance test, including on MSVC, and update the PR, because it clearly is not ready yet.

@Alexhuszagh
Copy link
Contributor Author

Alexhuszagh commented Aug 22, 2021

Ok the bugs were relatively trivial: I'll run a full conformance check tomorrow just as an extra precaution. The full list of issues was:

  1. Forgetting to normalize bigint after using the bigint(uint64_t) constructor.
  2. Forgetting to assign the addition of the carry for scalar_mult on 64-bit limbs.
  3. FIx uint32_hi64, which accidentally shifted the most-significant limb with 3 32-bit limbs.
  4. Fix compare, which was comparing in big-endian order, not little-endian order (which the limbs are stored in).

@lemire
Copy link
Member

lemire commented Aug 22, 2021

That's fantastic. I trust you and I trust your code, but I will read it all carefully before merging. This could be a few days at least. Please note that if there is a delay, it is not due to a lack of interest or a disregard for the importance of your work.

@Alexhuszagh
Copy link
Contributor Author

Alexhuszagh commented Aug 23, 2021

The exhausted conformance tests just finished running, and everything passed, so I'm much more content about the state of the PR.

@lemire
Copy link
Member

lemire commented Aug 24, 2021

Thank you.

@lemire
Copy link
Member

lemire commented Sep 1, 2021

Still on my radar. This is important.

@Alexhuszagh
Copy link
Contributor Author

Alexhuszagh commented Sep 1, 2021

No worries, this is a large amount of code that needs to be meticulously reviewed. I'm happy to help out in any way I can, and I realize that between research, classes, and open-source work, you're likely extremely busy.

} // namespace
// a pointer and a length to a contiguous block of memory
template <typename T>
struct slice {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine but there is already a standard std::slice type, see https://www.cplusplus.com/reference/valarray/slice/slice/

I'd like to have your thoughts on this... are we concerned that the name overload could be confusing?

Furthermore, why should it be a template if it is always templated with the same type? Generality is great, when it is actually put into good use, but if we never use the extra freedom, then it is not so beneficial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename it. There's numerous other names possible. And I do actually use it for one other template: it's for both a limb_slice and byte_slice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I do actually use it for one other template: it's for both a limb_slice and byte_slice

Yes. I noticed that later.

Copy link
Contributor Author

@Alexhuszagh Alexhuszagh Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does span sound good? C++20 has the same concept, and uses the same name for it. They're implemented practically the same way, as a pointer and a size. It would basically be minimal backport for C++11, essentially. If not, I can think of another name. chunk could also work, since it's descriptive, and not in the standard library at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Span seems much better and less confusing.

I am less concerned about us using the same name, but more about the conceptual confusion. When you read code, it helps if the same name refers to the same usage. Otherwise it makes the code harder to read than it needs to.

(I got stuck on your code because I was assuming that you had used std::slice... and I could not understand the code at first.)

@@ -0,0 +1,420 @@
#ifndef FASTFLOAT_SLOW_H
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess 'slow' is fine, but it is not the best name. I do not mind having something called 'slow' but I'd like to have you thoughts on whether there could be a better name for your approach...? Note that we can need to refer to it in other context and saying "well, we use the slow approach" is not... always great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can rename it. It's a digit-comparison algorithm, so maybe digitcmp is better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alexhuszagh It is your approach so you get to name it. I personally like things that I can pronounce so, digit-comparison or digit_comparison sounds good?

Think about this... you are in a meeting and you want to refer to "this"... now, it helps if you can just speak it out.

(I realize that this is a minor point.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alexhuszagh It is your approach so you get to name it. I personally like things that I can pronounce so, digit-comparison or digit_comparison sounds good?

Think about this... you are in a meeting and you want to refer to "this"... now, it helps if you can just speak it out.

(I realize that this is a minor point.)

That sounds great, I'll implement all these recommendations then.

}

template <typename T>
fastfloat_really_inline adjusted_mantissa to_b(T value) noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_b could probably be made more expressive.


// get the extended precision value of the halfway point between b and b+u
template <typename T>
fastfloat_really_inline adjusted_mantissa to_bh(T value) noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here, I think to_bh could be made more expressive.

include/fast_float/bigint.h Outdated Show resolved Hide resolved
@lemire
Copy link
Member

lemire commented Sep 10, 2021

Out-values are used extensively. (...) This can be considered an anti-pattern in modern C++

I am not sure that's true.

There is nothing wrong with it in principle. It does not offend me.

@lemire
Copy link
Member

lemire commented Sep 10, 2021

This code looks very good to me.

@Alexhuszagh I would invite you to check out my comments and see whether you want to address them. They are minor comments and I found nothing significant to object to. However, I would like to consider my proposals.

Replaces the existing decimal implementation, for substantial
performance improvements with near-halfway cases. This is especially
fast with a large number of digits.

**Big Integer Implementation**

A small subset of big-integer arithmetic has been added, with the
`bigint` struct. It uses a stack-allocated vector with enough bits to
store the float with the large number of significant digits. This is
log2(10^(769 + 342)), to account for the largest possible magnitude
exponent, and number of digits (3600 bits), and then rounded up to 4k bits.

The limb size is determined by the architecture: most 64-bit
architectures have efficient 128-bit multiplication, either by a single
hardware instruction or 2 native multiplications for the high and low
bits. This includes x86_64, mips64, s390x, aarch64, powerpc64, riscv64,
and the only known exception is sparcv8 and sparcv9. Therefore, we
define a limb size of 64-bits on 64-bit architectures except SPARC,
otherwise we fallback to 32-bit limbs.

A simple stackvector is used, which just has operations to add elements,
index, and truncate the vector.

`bigint` is then just a wrapper around this, with methods for
big-integer arithmetic. For our algorithms, we just need multiplication
by a power (x * b^N), multiplication by a bigint or scalar value, and
addition by a bigint or scalar value. Scalar addition and multiplication
uses compiler extensions when possible (__builtin_add_overflow and
__uint128_t), if not, then we implement simple logic shown to optimize
well on MSVC. Big-integer multiplication is done via grade school
multiplication, which is more efficient than any asymptotically faster
algorithms. Multiplication by a power is then done via bitshifts for
powers-of-two, and by iterative multiplications of a large and then
scalar value for powers-of-5.

**compute_float**

Compute float has been slightly modified so if the algorithm cannot
round correctly, it returns a normalized, extended-precision adjusted
mantissa with the power2 shifted by INT16_MIN so the exponent is always
negative. `compute_error` and `compute_error_scaled` have been added.

**Digit Optimiations**

To improve performance for numbers with many digits,
`parse_eight_digits_unrolled` is used for both integers and fractions,
and uses a while loop than two nested if statements. This adds no
noticeable performance cost for common floats, but dramatically improves
performance for numbers with large digits (without these optimizations,
~65% of the total runtime cost is in parse_number_string).

**Parsed Number**

Two fields have been added to `parsed_number_string`, which contains a
slice of the integer and fraction digits. This is extremely cheap, since
the work is already done, and the strings are pre-tokenized during
parsing. This allows us on overflow to re-parse these tokenized strings,
without checking if each character is an integer. Likewise, for the
big-integer algorithms, we can merely re-parse the pre-tokenized
strings.

**Slow Algorithm**

The new algorithm is `digit_comp`, which takes the parsed number string
and the `adjusted_mantissa` from `compute_float`. The significant digits
are parsed into a big integer, and the exponent relative to the
significant digits is calculated. If the exponent is >= 0, we use
`positive_digit_comp`, otherwise, we use `negative_digit_comp`.

`positive_digit_comp` is quite simple: we scale the significant digits
to the exponent, and then we get the high 64-bits for the native float,
determine if any lower bits were truncated, and use that to direct
rounding.

`negative_digit_comp` is a little more complex, but also quite trivial:
we use the parsed significant digits as the real digits, and calculate
the theoretical digits from `b+h`, the halfway point between `b` and
`b+u`, the next-positive float. To get `b`, we round the adjusted
mantissa down, create an extended-precision representation, and
calculate the halfway point. We now have a base-10 exponent for the real
digits, and a base-2 exponent for the theoretical digits. We scale these
two to the same exponent by multiplying the theoretixal digits by
`5**-real_exp`. We then get the base-2 exponent as `theor_exp -
real_exp`, and if this is positive, we multipy the theoretical digits by
it, otherwise, we multiply the real digits by it. Now, both are scaled
to the same magnitude, and we simply compare the digits in the big
integer, and use that to direct rounding.

**Rust-Isms**

A few Rust-isms have been added, since it simplifies logic assertions.
These can be trivially removed or reworked, as needed.

- a `slice` type has been added, which is a pointer and length.
- `FASTFLOAT_ASSERT`, `FASTFLOAT_DEBUG_ASSERT`, and `FASTFLOAT_TRY` have
  been added
  - `FASTFLOAT_ASSERT` aborts, even in release builds, if the condition
    fails.
  - `FASTFLOAT_DEBUG_ASSERT` defaults to `assert`, for logic errors.
  - `FASTFLOAT_TRY` is like a Rust `Option` type, which propagates
    errors.

Specifically, `FASTFLOAT_TRY` is useful in combination with
`FASTFLOAT_ASSERT` to ensure there are no memory corruption errors
possible in the big-integer arithmetic. Although the `bigint` type
ensures we have enough storage for all valid floats, memory issues are
quite a severe class of vulnerabilities, and due to the low performance
cost of checks, we abort if we would have out-of-bounds writes. This can
only occur when we are adding items to the vector, which is a very small
number of steps. Therefore, we abort if our memory safety guarantees
ever fail. lexical has never aborted, so it's unlikely we will ever fail
these guarantees.
@Alexhuszagh
Copy link
Contributor Author

Alexhuszagh commented Sep 10, 2021

I've implemented the above proposals (thanks for the feedback), and rebased for automatic merging.

@lemire
Copy link
Member

lemire commented Sep 11, 2021

Thanks. I will make a few more checks. Expect me to merge early next week.

It is great work.

@lemire lemire changed the base branch from main to dlemire/bigint September 14, 2021 01:22
@lemire lemire merged commit 3f0ba09 into fastfloat:dlemire/bigint Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider Optimizing Decimal Operations
2 participants