Reflect the new paper on Dragonbox #2750

jk-jeon · 2022-02-09T02:55:31Z

The new paper is here: https://github.com/jk-jeon/dragonbox/blob/master/other_files/Dragonbox.pdf
This PR is mostly the reflection of the new paper that has not been done in the previous PR.

Implemented the "divisibility check and divide" trick based on Lemire's paper mentioned in the previous PR.
Magic numbers for log / division computations are changed. I removed binary expansions of log10(2) and friends because now the magic numbers a priori do not need to have anything to do with the binary expansions. More comprehensive explanation can be found in the paper.
In the new paper, we define beta to be one less of beta from the previous paper. So I renamed beta_minus_1 into beta.
The branch that removes the trailing zeros is the branch chosen for most short inputs, so it would be better to check for that branch first.
I did some experiments on trailing zero removal, and concluded that the implementation in this PR is the fastest. Separating the divisibility checks for powers of 2 and powers of 5 does not seem to be a good idea, so I removed the usage of ctz and friends, and instead introduced rotr. I think Dragonbox is the only one who uses ctz in fmt, so probably you can remove ctz and friends from fmt now. Also, interestingly simple loop seems to outperform binary search. I think this makes a lot of sense for binary64 because 64-bit constants in general must be first loaded into register and cannot be used as immediates, but it is quite surprising that even for binary32 still loop is better. Looking at the assembly, it seems that the reason is because for the binary search, the compiler is trying too hard to get rid of branches and ends up generating more instructions. However, it should be noted that now remove_trailing_zeros traps into infinite loop if the input is 0. I believe 0 cannot be fed into remove_trailing_zeros in the current implementation, but you should make sure nobody feeds 0 into Dragonbox.

- Change constants appearing in log & division computations - Rename beta_minus_1 to beta

…hort inputs

vitaut

Thanks a lot for the PR! Looks great, just a few minor comments inline.

vitaut · 2022-02-12T16:20:40Z

include/fmt/format-inl.h

@@ -819,6 +817,16 @@ struct uint128_wrapper {
  }
 };

+// Compilers should be able to optimize this into the ror instruction.
+inline std::uint32_t rotr(uint32_t n, uint32_t r) noexcept {


nit: I suggest dropping std:: in the return type for consistency (and the same in the overload below).

Any reason this is not constexpr?

@vitaut Sorry, I think I was in rush for some reason 😅 corrected now.
@miscco Added, thanks for pointing out.

EDIT: @miscco actually, I reverted the change to make the code C++11-compatible. There should be ways to workaround that of course, but I don't think doing so is a good idea because that will likely to make compilers more unlikely to recognize the pattern and reduce it down to ror.

@jk-jeon
FMT_CONSTEXPR - macro for C++14 constexpr.

@phprus oooohhh didn't know that, thanks for letting me know=)

Ah sorry, I thought this should have been possible with C++11. Thanks @phprus for pointing out the workaround

vitaut · 2022-02-12T16:23:42Z

include/fmt/format-inl.h

@@ -895,86 +903,72 @@ inline uint64_t umul96_lower64(uint32_t x, uint64_t y) noexcept {
 // Computes floor(log10(pow(2, e))) for e in [-1700, 1700] using the method from


[-1700, 1700] -> [-2620, 2620]

vitaut · 2022-02-12T16:24:42Z

include/fmt/format-inl.h

@@ -895,86 +903,72 @@ inline uint64_t umul96_lower64(uint32_t x, uint64_t y) noexcept {
 // Computes floor(log10(pow(2, e))) for e in [-1700, 1700] using the method from
 // https://fmt.dev/papers/Grisu-Exact.pdf#page=5, section 3.4.


I guess we should link to the Dragonbox paper now.

Yeah, I linked to your copy of the paper (fmt.dev/papers/Dragonbox.pdf) assuming you will replace the copy in a near future.

vitaut · 2022-02-12T16:29:55Z

include/fmt/format-inl.h

+    if (q <= std::numeric_limits<uint32_t>::max() / 100) {
+      n = q;
+      s += 2;
+    } else {
+      break;
+    }


Why not use early exit, i.e.

if (q > std::numeric_limits<uint32_t>::max() / 100) break; n = q; s += 2;

?

Great suggestion, modified the code according to it.

vitaut · 2022-02-13T14:15:27Z

Merged, thank you!

vitaut · 2022-02-17T23:12:57Z

@jk-jeon, have you thought about handling fixed precision in Dragonbox? Is it even possible, at least for small precision?

jk-jeon · 2022-02-18T01:14:48Z

@jk-jeon, have you thought about handling fixed precision in Dragonbox?

I haven't, at least recently.

Is it even possible, at least for small precision?

Might be, but I'm not sure. I don't even know how do you do that with Grisu 😋.

jk-jeon · 2022-02-18T12:05:33Z

@jk-jeon, have you thought about handling fixed precision in Dragonbox? Is it even possible, at least for small precision?

Okay, so I did some thought experiment meanwhile, and I think something like that should be possible for small precision.

It seems quite simple actually. Just find an appropriate exponent k so that when we multiply 10^k to our floating-point number the integer part of the resulting number is representable in uint32_t or uint64_t. Find the minimum possible integer part of the result, then we get a lower bound on how many decimal digits we can precisely compute. To make this lower bound not too small, we may need to normalize subnormal numbers.

To compute the integer part of the result of the multiplication, we do what we have been doing in Dragonbox: multiply with the cached 10^k. And then measure the decimal length of the resulting number and cut as many digits as needed and perform proper rounding. In order to do that we may need to know if our number is an integer or not, which we can already figure out from the computation of the integer part (which is what the previous PR was mainly about).

We may need to rerun the correctness analysis, but I'm pretty sure it will be alright.

Once this is materialized (which I have no plan for doing right now, but I can assist/discuss things with anyone who wants to try it or something similar), I guess it even should not be really called Dragonbox anymore, as it doesn't do anything similar. Except for the multiplication with 10^k, but that's basically something that all the modern floating-point parsing/formatting algorithms developed recently are doing.

jk-jeon · 2022-08-09T03:44:12Z

Hi Victor,

I'm recently giving some shots on fixed-precision formatting in my spare time. A good news is that I think it is possible to come up with a configurable method of trading between the cache table size and performance. What do you think is the maximum size of the table (which will be there in addition to the Dragonbox table) which fmt can afford?

vitaut · 2022-08-09T16:09:42Z

Ideally the smaller the better but I think we can live with a few more kiB of data. There is no exact budget.

jk-jeon · 2023-01-02T01:41:37Z

So I've been thinking about what can I do with the fixed-precision case. Here is what I thought.

The algorithm I recently advertised is composed of two parts, one for the first few digits (covered by the Dragonbox table) and another for further digits (covered by an additional table). I think it is probably too early to adopt the second part, but meanwhile we can discuss the adoption of the first part. (The "garbage digits" case is less important anyway.)

Assuming double, for normal inputs, the first part can generate at least 18 digits (up to 19 digits, which is the largest number of digits for 64-bit integers), and for subnormal inputs, it can generate at least 3 digits. It is actually possible to make it to generate 18 digits even for the subnormal inputs, if we augment the Dragonbox table a little bit. (I think the compressed version of the table only need to have one more entry. Or maybe two more.)

I don't know if this adoption will make the performance better or worse though. Grisu uses fewer precision so it might be faster. But we do not need to rely on Dragon4 fallback anymore if the number of digits is at most 18. Also we can get rid of the Grisu table.

What do you think?

Thanks.

vitaut · 2023-01-02T15:19:43Z

This sounds like a good plan. The performance will be more predictable without the fallback for the common case and we could get rid of Grisu tables which is an improvement. A regression for the non-fallback case is OK if it's not too big.

Implement the formatting algorithm for small given precision discussed in #3262 and #2750

jk-jeon added 8 commits February 8, 2022 18:21

Reflect the new paper

ddaac49

- Change constants appearing in log & division computations - Rename beta_minus_1 to beta

Check r < deltai first, because that is the major branch chosen for s…

fc6ceea

…hort inputs

Add rotr

3b803dc

Optimize remove_trailing_zeros

b2208fa

Recover log10_2_significand

cdb0d18

Remove literal separator to satisfy some compilers

2e53828

Fix typo

d758274

Fix some conversion issues

a662666

vitaut reviewed Feb 12, 2022

View reviewed changes

jk-jeon added 7 commits February 13, 2022 03:23

Remove std:: infront of uint32_t/64_t & add constexpr to rotr

a71005a

Fix wrong comment/refer to a correct reference

fc907d2

Simplify remove_trailing_zeros

cba730d

Remove some C-style casts for consistency

56139c3

Simplify remove_trailing_zeros

96017bc

Revert adding constexpr to rotr to satisfy C++11 compilers

b3b7b69

Add FMT_CONSTEXPR to rotr instead

ba752b9

vitaut merged commit 083510f into fmtlib:master Feb 13, 2022

This was referenced Jan 6, 2023

Try a new formatting algorithm for float/double with a small given precision #3262

Closed

Implement a new formatting algorithm for small given precision #3269

Merged

vitaut pushed a commit that referenced this pull request Jan 14, 2023

Implement a new formatting algorithm for small given precision (#3269)

0f42c17

Implement the formatting algorithm for small given precision discussed in #3262 and #2750

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reflect the new paper on Dragonbox #2750

Reflect the new paper on Dragonbox #2750

jk-jeon commented Feb 9, 2022

vitaut left a comment

vitaut Feb 12, 2022

miscco Feb 12, 2022

jk-jeon Feb 13, 2022 •

edited

Loading

phprus Feb 13, 2022

jk-jeon Feb 13, 2022

miscco Feb 13, 2022

vitaut Feb 12, 2022

jk-jeon Feb 13, 2022

vitaut Feb 12, 2022

jk-jeon Feb 13, 2022

vitaut Feb 12, 2022

jk-jeon Feb 13, 2022

vitaut commented Feb 13, 2022

vitaut commented Feb 17, 2022

jk-jeon commented Feb 18, 2022

jk-jeon commented Feb 18, 2022 •

edited

Loading

jk-jeon commented Aug 9, 2022

vitaut commented Aug 9, 2022

jk-jeon commented Jan 2, 2023 •

edited

Loading

vitaut commented Jan 2, 2023

		@@ -895,86 +903,72 @@ inline uint64_t umul96_lower64(uint32_t x, uint64_t y) noexcept {
		// Computes floor(log10(pow(2, e))) for e in [-1700, 1700] using the method from

		@@ -895,86 +903,72 @@ inline uint64_t umul96_lower64(uint32_t x, uint64_t y) noexcept {
		// Computes floor(log10(pow(2, e))) for e in [-1700, 1700] using the method from
		// https://fmt.dev/papers/Grisu-Exact.pdf#page=5, section 3.4.

Reflect the new paper on Dragonbox #2750

Reflect the new paper on Dragonbox #2750

Conversation

jk-jeon commented Feb 9, 2022

vitaut left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jk-jeon Feb 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitaut commented Feb 13, 2022

vitaut commented Feb 17, 2022

jk-jeon commented Feb 18, 2022

jk-jeon commented Feb 18, 2022 • edited Loading

jk-jeon commented Aug 9, 2022

vitaut commented Aug 9, 2022

jk-jeon commented Jan 2, 2023 • edited Loading

vitaut commented Jan 2, 2023

jk-jeon Feb 13, 2022 •

edited

Loading

jk-jeon commented Feb 18, 2022 •

edited

Loading

jk-jeon commented Jan 2, 2023 •

edited

Loading