-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the big-integer arithmetic algorithm. #96
Conversation
For the slices, these could trivially be replaced with a pointer range, which is typically the more standard C++ way to do things. For the bounds assertions, these should be safe to remove, since they've been battle tested in comparable code for millions of downloads in numerous production systems. However, I'm generally a decently defensive programmer by nature, so I left them in currently. I'm willing to make any changes, as you see fit. |
It seems we have an off-by-one error for MSVC (weird, because I though I explicitly checked this case). I'll run it through a full conformance test, including on MSVC, and update the PR, because it clearly is not ready yet. |
Ok the bugs were relatively trivial: I'll run a full conformance check tomorrow just as an extra precaution. The full list of issues was:
|
That's fantastic. I trust you and I trust your code, but I will read it all carefully before merging. This could be a few days at least. Please note that if there is a delay, it is not due to a lack of interest or a disregard for the importance of your work. |
The exhausted conformance tests just finished running, and everything passed, so I'm much more content about the state of the PR. |
Thank you. |
Still on my radar. This is important. |
No worries, this is a large amount of code that needs to be meticulously reviewed. I'm happy to help out in any way I can, and I realize that between research, classes, and open-source work, you're likely extremely busy. |
include/fast_float/float_common.h
Outdated
} // namespace | ||
// a pointer and a length to a contiguous block of memory | ||
template <typename T> | ||
struct slice { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine but there is already a standard std::slice type, see https://www.cplusplus.com/reference/valarray/slice/slice/
I'd like to have your thoughts on this... are we concerned that the name overload could be confusing?
Furthermore, why should it be a template if it is always templated with the same type? Generality is great, when it is actually put into good use, but if we never use the extra freedom, then it is not so beneficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rename it. There's numerous other names possible. And I do actually use it for one other template: it's for both a limb_slice and byte_slice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I do actually use it for one other template: it's for both a limb_slice and byte_slice
Yes. I noticed that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does span
sound good? C++20 has the same concept, and uses the same name for it. They're implemented practically the same way, as a pointer and a size. It would basically be minimal backport for C++11, essentially. If not, I can think of another name. chunk
could also work, since it's descriptive, and not in the standard library at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Span seems much better and less confusing.
I am less concerned about us using the same name, but more about the conceptual confusion. When you read code, it helps if the same name refers to the same usage. Otherwise it makes the code harder to read than it needs to.
(I got stuck on your code because I was assuming that you had used std::slice... and I could not understand the code at first.)
include/fast_float/slow.h
Outdated
@@ -0,0 +1,420 @@ | |||
#ifndef FASTFLOAT_SLOW_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess 'slow' is fine, but it is not the best name. I do not mind having something called 'slow' but I'd like to have you thoughts on whether there could be a better name for your approach...? Note that we can need to refer to it in other context and saying "well, we use the slow approach" is not... always great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can rename it. It's a digit-comparison algorithm, so maybe digitcmp
is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Alexhuszagh It is your approach so you get to name it. I personally like things that I can pronounce so, digit-comparison or digit_comparison
sounds good?
Think about this... you are in a meeting and you want to refer to "this"... now, it helps if you can just speak it out.
(I realize that this is a minor point.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Alexhuszagh It is your approach so you get to name it. I personally like things that I can pronounce so, digit-comparison or
digit_comparison
sounds good?Think about this... you are in a meeting and you want to refer to "this"... now, it helps if you can just speak it out.
(I realize that this is a minor point.)
That sounds great, I'll implement all these recommendations then.
include/fast_float/slow.h
Outdated
} | ||
|
||
template <typename T> | ||
fastfloat_really_inline adjusted_mantissa to_b(T value) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_b
could probably be made more expressive.
include/fast_float/slow.h
Outdated
|
||
// get the extended precision value of the halfway point between b and b+u | ||
template <typename T> | ||
fastfloat_really_inline adjusted_mantissa to_bh(T value) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here, I think to_bh
could be made more expressive.
I am not sure that's true. There is nothing wrong with it in principle. It does not offend me. |
This code looks very good to me. @Alexhuszagh I would invite you to check out my comments and see whether you want to address them. They are minor comments and I found nothing significant to object to. However, I would like to consider my proposals. |
Replaces the existing decimal implementation, for substantial performance improvements with near-halfway cases. This is especially fast with a large number of digits. **Big Integer Implementation** A small subset of big-integer arithmetic has been added, with the `bigint` struct. It uses a stack-allocated vector with enough bits to store the float with the large number of significant digits. This is log2(10^(769 + 342)), to account for the largest possible magnitude exponent, and number of digits (3600 bits), and then rounded up to 4k bits. The limb size is determined by the architecture: most 64-bit architectures have efficient 128-bit multiplication, either by a single hardware instruction or 2 native multiplications for the high and low bits. This includes x86_64, mips64, s390x, aarch64, powerpc64, riscv64, and the only known exception is sparcv8 and sparcv9. Therefore, we define a limb size of 64-bits on 64-bit architectures except SPARC, otherwise we fallback to 32-bit limbs. A simple stackvector is used, which just has operations to add elements, index, and truncate the vector. `bigint` is then just a wrapper around this, with methods for big-integer arithmetic. For our algorithms, we just need multiplication by a power (x * b^N), multiplication by a bigint or scalar value, and addition by a bigint or scalar value. Scalar addition and multiplication uses compiler extensions when possible (__builtin_add_overflow and __uint128_t), if not, then we implement simple logic shown to optimize well on MSVC. Big-integer multiplication is done via grade school multiplication, which is more efficient than any asymptotically faster algorithms. Multiplication by a power is then done via bitshifts for powers-of-two, and by iterative multiplications of a large and then scalar value for powers-of-5. **compute_float** Compute float has been slightly modified so if the algorithm cannot round correctly, it returns a normalized, extended-precision adjusted mantissa with the power2 shifted by INT16_MIN so the exponent is always negative. `compute_error` and `compute_error_scaled` have been added. **Digit Optimiations** To improve performance for numbers with many digits, `parse_eight_digits_unrolled` is used for both integers and fractions, and uses a while loop than two nested if statements. This adds no noticeable performance cost for common floats, but dramatically improves performance for numbers with large digits (without these optimizations, ~65% of the total runtime cost is in parse_number_string). **Parsed Number** Two fields have been added to `parsed_number_string`, which contains a slice of the integer and fraction digits. This is extremely cheap, since the work is already done, and the strings are pre-tokenized during parsing. This allows us on overflow to re-parse these tokenized strings, without checking if each character is an integer. Likewise, for the big-integer algorithms, we can merely re-parse the pre-tokenized strings. **Slow Algorithm** The new algorithm is `digit_comp`, which takes the parsed number string and the `adjusted_mantissa` from `compute_float`. The significant digits are parsed into a big integer, and the exponent relative to the significant digits is calculated. If the exponent is >= 0, we use `positive_digit_comp`, otherwise, we use `negative_digit_comp`. `positive_digit_comp` is quite simple: we scale the significant digits to the exponent, and then we get the high 64-bits for the native float, determine if any lower bits were truncated, and use that to direct rounding. `negative_digit_comp` is a little more complex, but also quite trivial: we use the parsed significant digits as the real digits, and calculate the theoretical digits from `b+h`, the halfway point between `b` and `b+u`, the next-positive float. To get `b`, we round the adjusted mantissa down, create an extended-precision representation, and calculate the halfway point. We now have a base-10 exponent for the real digits, and a base-2 exponent for the theoretical digits. We scale these two to the same exponent by multiplying the theoretixal digits by `5**-real_exp`. We then get the base-2 exponent as `theor_exp - real_exp`, and if this is positive, we multipy the theoretical digits by it, otherwise, we multiply the real digits by it. Now, both are scaled to the same magnitude, and we simply compare the digits in the big integer, and use that to direct rounding. **Rust-Isms** A few Rust-isms have been added, since it simplifies logic assertions. These can be trivially removed or reworked, as needed. - a `slice` type has been added, which is a pointer and length. - `FASTFLOAT_ASSERT`, `FASTFLOAT_DEBUG_ASSERT`, and `FASTFLOAT_TRY` have been added - `FASTFLOAT_ASSERT` aborts, even in release builds, if the condition fails. - `FASTFLOAT_DEBUG_ASSERT` defaults to `assert`, for logic errors. - `FASTFLOAT_TRY` is like a Rust `Option` type, which propagates errors. Specifically, `FASTFLOAT_TRY` is useful in combination with `FASTFLOAT_ASSERT` to ensure there are no memory corruption errors possible in the big-integer arithmetic. Although the `bigint` type ensures we have enough storage for all valid floats, memory issues are quite a severe class of vulnerabilities, and due to the low performance cost of checks, we abort if we would have out-of-bounds writes. This can only occur when we are adding items to the vector, which is a very small number of steps. Therefore, we abort if our memory safety guarantees ever fail. lexical has never aborted, so it's unlikely we will ever fail these guarantees.
I've implemented the above proposals (thanks for the feedback), and rebased for automatic merging. |
Thanks. I will make a few more checks. Expect me to merge early next week. It is great work. |
Fixes #93. This implements the big-integer arithmetic algorithms that replace the decimal algorithm, for major performance improvements.
Other Changes/Anti-Patterns
On my end, this didn't lead to any performance hits for common floats, but dramatically improved performance for floats with a large number of digits. However, this could be a pessimization, so feel free to request this being removed.
A slice (pointer and length, see below) containing the parsed integer and fraction digits is added to
parsed_number_string
, so the new definition is:This is effectively free, since this work is already done by the parsing algorithm, and means on overflow and parsing the significant digits for the slow algorithm we don't have to check if the characters are valid digits, since they have previously been validated. This simplifies internal logic, especially for the big-integer algorithm.
A slice type has been added, as follows, with two typedefs for byte and limb slices:
Array bounds checking is done for big-integer arithmetic when extended the stack vector. For example, for shifting the limbs left in a big-integer, we use the following code:
If the return value is
false
, this propagates, so the function that calls it,shl
has this:And then any function that calls
shl
will use a run-time assertion (even in release builds) that ensures there is not a buffer overwrite:This can be considered an anti-pattern in modern C++, but some functions such as
parse_infnan
seemed to use it, and usingstd::pair
seemed to lead to worse assembly generation, so a lot of the functions use outvalues like, wheretruncated
is meant to be set by the function, returning the high 64-bits of the big integer.Implementation Notes
Replaces the existing decimal implementation, for substantial performance improvements with near-halfway cases. This is especially fast with a large number of digits.
Big Integer Implementation
A small subset of big-integer arithmetic has been added, with the
bigint
struct. It uses a stack-allocated vector with enough bits to store the float with the large number of significant digits. This is log2(10^(769 + 342)), to account for the largest possible magnitude exponent, and number of digits (3600 bits), and then rounded up to 4k bits.The limb size is determined by the architecture: most 64-bit architectures have efficient 128-bit multiplication, either by a single hardware instruction or 2 native multiplications for the high and low bits. This includes x86_64, mips64, s390x, aarch64, powerpc64, riscv64, and the only known exception is sparcv8 and sparcv9. Therefore, we define a limb size of 64-bits on 64-bit architectures except SPARC, otherwise we fallback to 32-bit limbs.
A simple stackvector is used, which just has operations to add elements, index, and truncate the vector.
bigint
is then just a wrapper around this, with methods for big-integer arithmetic. For our algorithms, we just need multiplication by a power (x * b^N), multiplication by a bigint or scalar value, and addition by a bigint or scalar value. Scalar addition and multiplication uses compiler extensions when possible (__builtin_add_overflow and __uint128_t), if not, then we implement simple logic shown to optimize well on MSVC. Big-integer multiplication is done via grade school multiplication, which is more efficient than any asymptotically faster algorithms. Multiplication by a power is then done via bitshifts for powers-of-two, and by iterative multiplications of a large and then scalar value for powers-of-5.compute_float
Compute float has been slightly modified so if the algorithm cannot round correctly, it returns a normalized, extended-precision adjusted mantissa with the power2 shifted by INT16_MIN so the exponent is always negative.
compute_error
andcompute_error_scaled
have been added.Digit Optimiations
To improve performance for numbers with many digits,
parse_eight_digits_unrolled
is used for both integers and fractions, and uses a while loop than two nested if statements. This adds no noticeable performance cost for common floats, but dramatically improves performance for numbers with large digits (without these optimizations, ~65% of the total runtime cost is in parse_number_string).Parsed Number
Two fields have been added to
parsed_number_string
, which contains a slice of the integer and fraction digits. This is extremely cheap, since the work is already done, and the strings are pre-tokenized during parsing. This allows us on overflow to re-parse these tokenized strings, without checking if each character is an integer. Likewise, for the big-integer algorithms, we can merely re-parse the pre-tokenized strings.Slow Algorithm
The new algorithm is
digit_comp
, which takes the parsed number string and theadjusted_mantissa
fromcompute_float
. The significant digits are parsed into a big integer, and the exponent relative to the significant digits is calculated. If the exponent is >= 0, we usepositive_digit_comp
, otherwise, we usenegative_digit_comp
.positive_digit_comp
is quite simple: we scale the significant digits to the exponent, and then we get the high 64-bits for the native float, determine if any lower bits were truncated, and use that to direct rounding.negative_digit_comp
is a little more complex, but also quite trivial: we use the parsed significant digits as the real digits, and calculate the theoretical digits fromb+h
, the halfway point betweenb
andb+u
, the next-positive float. To getb
, we round the adjusted mantissa down, create an extended-precision representation, and calculate the halfway point. We now have a base-10 exponent for the real digits, and a base-2 exponent for the theoretical digits. We scale these two to the same exponent by multiplying the theoretixal digits by5**-real_exp
. We then get the base-2 exponent astheor_exp - real_exp
, and if this is positive, we multipy the theoretical digits by it, otherwise, we multiply the real digits by it. Now, both are scaled to the same magnitude, and we simply compare the digits in the big integer, and use that to direct rounding.