Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reflect the new paper on Dragonbox #2750

Merged
merged 15 commits into from
Feb 13, 2022
Merged

Reflect the new paper on Dragonbox #2750

merged 15 commits into from
Feb 13, 2022

Conversation

jk-jeon
Copy link
Contributor

@jk-jeon jk-jeon commented Feb 9, 2022

The new paper is here: https://github.com/jk-jeon/dragonbox/blob/master/other_files/Dragonbox.pdf
This PR is mostly the reflection of the new paper that has not been done in the previous PR.

  1. Implemented the "divisibility check and divide" trick based on Lemire's paper mentioned in the previous PR.
  2. Magic numbers for log / division computations are changed. I removed binary expansions of log10(2) and friends because now the magic numbers a priori do not need to have anything to do with the binary expansions. More comprehensive explanation can be found in the paper.
  3. In the new paper, we define beta to be one less of beta from the previous paper. So I renamed beta_minus_1 into beta.
  4. The branch that removes the trailing zeros is the branch chosen for most short inputs, so it would be better to check for that branch first.
  5. I did some experiments on trailing zero removal, and concluded that the implementation in this PR is the fastest. Separating the divisibility checks for powers of 2 and powers of 5 does not seem to be a good idea, so I removed the usage of ctz and friends, and instead introduced rotr. I think Dragonbox is the only one who uses ctz in fmt, so probably you can remove ctz and friends from fmt now. Also, interestingly simple loop seems to outperform binary search. I think this makes a lot of sense for binary64 because 64-bit constants in general must be first loaded into register and cannot be used as immediates, but it is quite surprising that even for binary32 still loop is better. Looking at the assembly, it seems that the reason is because for the binary search, the compiler is trying too hard to get rid of branches and ends up generating more instructions. However, it should be noted that now remove_trailing_zeros traps into infinite loop if the input is 0. I believe 0 cannot be fed into remove_trailing_zeros in the current implementation, but you should make sure nobody feeds 0 into Dragonbox.

Copy link
Contributor

@vitaut vitaut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR! Looks great, just a few minor comments inline.

@@ -819,6 +817,16 @@ struct uint128_wrapper {
}
};

// Compilers should be able to optimize this into the ror instruction.
inline std::uint32_t rotr(uint32_t n, uint32_t r) noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I suggest dropping std:: in the return type for consistency (and the same in the overload below).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is not constexpr?

Copy link
Contributor Author

@jk-jeon jk-jeon Feb 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitaut Sorry, I think I was in rush for some reason 😅 corrected now.
@miscco Added, thanks for pointing out.

EDIT: @miscco actually, I reverted the change to make the code C++11-compatible. There should be ways to workaround that of course, but I don't think doing so is a good idea because that will likely to make compilers more unlikely to recognize the pattern and reduce it down to ror.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jk-jeon
FMT_CONSTEXPR - macro for C++14 constexpr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phprus oooohhh didn't know that, thanks for letting me know=)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I thought this should have been possible with C++11. Thanks @phprus for pointing out the workaround

@@ -895,86 +903,72 @@ inline uint64_t umul96_lower64(uint32_t x, uint64_t y) noexcept {
// Computes floor(log10(pow(2, e))) for e in [-1700, 1700] using the method from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[-1700, 1700] -> [-2620, 2620]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

@@ -895,86 +903,72 @@ inline uint64_t umul96_lower64(uint32_t x, uint64_t y) noexcept {
// Computes floor(log10(pow(2, e))) for e in [-1700, 1700] using the method from
// https://fmt.dev/papers/Grisu-Exact.pdf#page=5, section 3.4.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should link to the Dragonbox paper now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I linked to your copy of the paper (fmt.dev/papers/Dragonbox.pdf) assuming you will replace the copy in a near future.

Comment on lines 1846 to 1851
if (q <= std::numeric_limits<uint32_t>::max() / 100) {
n = q;
s += 2;
} else {
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use early exit, i.e.

      if (q > std::numeric_limits<uint32_t>::max() / 100) break;
      n = q;
      s += 2;

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, modified the code according to it.

@vitaut vitaut merged commit 083510f into fmtlib:master Feb 13, 2022
@vitaut
Copy link
Contributor

vitaut commented Feb 13, 2022

Merged, thank you!

@vitaut
Copy link
Contributor

vitaut commented Feb 17, 2022

@jk-jeon, have you thought about handling fixed precision in Dragonbox? Is it even possible, at least for small precision?

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Feb 18, 2022

@jk-jeon, have you thought about handling fixed precision in Dragonbox?

I haven't, at least recently.

Is it even possible, at least for small precision?

Might be, but I'm not sure. I don't even know how do you do that with Grisu 😋.

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Feb 18, 2022

@jk-jeon, have you thought about handling fixed precision in Dragonbox? Is it even possible, at least for small precision?

Okay, so I did some thought experiment meanwhile, and I think something like that should be possible for small precision.

It seems quite simple actually. Just find an appropriate exponent k so that when we multiply 10^k to our floating-point number the integer part of the resulting number is representable in uint32_t or uint64_t. Find the minimum possible integer part of the result, then we get a lower bound on how many decimal digits we can precisely compute. To make this lower bound not too small, we may need to normalize subnormal numbers.

To compute the integer part of the result of the multiplication, we do what we have been doing in Dragonbox: multiply with the cached 10^k. And then measure the decimal length of the resulting number and cut as many digits as needed and perform proper rounding. In order to do that we may need to know if our number is an integer or not, which we can already figure out from the computation of the integer part (which is what the previous PR was mainly about).

We may need to rerun the correctness analysis, but I'm pretty sure it will be alright.

Once this is materialized (which I have no plan for doing right now, but I can assist/discuss things with anyone who wants to try it or something similar), I guess it even should not be really called Dragonbox anymore, as it doesn't do anything similar. Except for the multiplication with 10^k, but that's basically something that all the modern floating-point parsing/formatting algorithms developed recently are doing.

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Aug 9, 2022

Hi Victor,

I'm recently giving some shots on fixed-precision formatting in my spare time. A good news is that I think it is possible to come up with a configurable method of trading between the cache table size and performance. What do you think is the maximum size of the table (which will be there in addition to the Dragonbox table) which fmt can afford?

@vitaut
Copy link
Contributor

vitaut commented Aug 9, 2022

Ideally the smaller the better but I think we can live with a few more kiB of data. There is no exact budget.

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Jan 2, 2023

So I've been thinking about what can I do with the fixed-precision case. Here is what I thought.

The algorithm I recently advertised is composed of two parts, one for the first few digits (covered by the Dragonbox table) and another for further digits (covered by an additional table). I think it is probably too early to adopt the second part, but meanwhile we can discuss the adoption of the first part. (The "garbage digits" case is less important anyway.)

Assuming double, for normal inputs, the first part can generate at least 18 digits (up to 19 digits, which is the largest number of digits for 64-bit integers), and for subnormal inputs, it can generate at least 3 digits. It is actually possible to make it to generate 18 digits even for the subnormal inputs, if we augment the Dragonbox table a little bit. (I think the compressed version of the table only need to have one more entry. Or maybe two more.)

I don't know if this adoption will make the performance better or worse though. Grisu uses fewer precision so it might be faster. But we do not need to rely on Dragon4 fallback anymore if the number of digits is at most 18. Also we can get rid of the Grisu table.

What do you think?

Thanks.

@vitaut
Copy link
Contributor

vitaut commented Jan 2, 2023

This sounds like a good plan. The performance will be more predictable without the fallback for the common case and we could get rid of Grisu tables which is an improvement. A regression for the non-fallback case is OK if it's not too big.

vitaut pushed a commit that referenced this pull request Jan 14, 2023
Implement the formatting algorithm for small given precision discussed in #3262 and #2750
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants