Simplify Dragonbox #2713

jk-jeon · 2022-01-13T01:29:22Z

Main differences:

Cache entries are slightly changed to allow simpler cache recovery and integer checks.
Cache recovery from the compressed cache is simplified a little bit, thanks to 1.
check_divisibility_and_divide_by_pow5() and related code segments are simplified. I expect them to be now compiled into fewer instructions.
Integer checks are simplified a lot; all subroutines for integer checks are now completely eliminated.

In my machine, it seems that the performance is not very much affected, though I expected a little bit of improvement. (I ran Milo's benchmark several times, but the results were largely inconsistent.) Probably because bottlenecks are on other places. But the code size will be a bit smaller, because now we do not have the divisibility check tables.

It usually generates slower code than manual handling.

… checks

vitaut

Thanks for the PR!

vitaut · 2022-01-13T22:10:27Z

include/fmt/format-inl.h

+#if FMT_HAS_BUILTIN(__builtin_addcll)
+    unsigned long long carry;
+    low_ = __builtin_addcll(low_, n, 0, &carry);
+    high_ = __builtin_addcll(high_, 0, carry, &carry);


Can

high_ = __builtin_addcll(high_, 0, carry, &carry);

be replaced with

high_ += carry;

or is there any benefit to using a builtin?

And similar comment in other conditionally compiled branches.

Did some experiment: https://godbolt.org/z/Yzf1T9ed8
Seems like they generate the same code.

vitaut · 2022-01-13T22:16:38Z

include/fmt/format-inl.h

+  constexpr uint64_t high() const noexcept { return high_; }
+  constexpr uint64_t low() const noexcept { return low_; }
+
+  uint128_wrapper& operator+=(uint64_t n) & noexcept {


& is probably not needed.

vitaut · 2022-01-13T22:20:53Z

include/fmt/format-inl.h

-  uint128_wrapper& operator+=(uint64_t n) FMT_NOEXCEPT {
-#  if defined(_MSC_VER) && defined(_M_X64)
-    unsigned char carry = _addcarry_u64(0, low_, n, &low_);
+  constexpr uint128_wrapper(uint64_t high, uint64_t low) noexcept


Here and elsewhere please keep FMT_NOEXCEPT.

Sorry, did the change.

include/fmt/format-inl.h

jk-jeon · 2022-01-14T00:19:30Z

Hmm. Probably I have to do some experiment, but I guess the cache recovery can be made even simpler, because the error it introduces might not introduce actual error. Please hold this for a while if you are satisfied otherwise!

To be more precise, what I'm talking about is this:

As you already know, the gigantic table that Dragonbox relies on is consisting of approximations to 5^k's (or equivalently 10^k's) with the leading 1 aligned at the MSB. The way the table is compressed down to a much smaller one is that we just store 5^k's for certain k's, and in the runtime it tries to recover a possibly missing entry it needs by multiplying appropriate 5^a to an existing entry and then realign it so that the leading 1 is still at the MSB. Since the table entries are already approximations, this will inevitably introduce some amount of error, so the resulting value is slightly off compared to what's in the full table, so what I did is to store those errors into a separate table so that we can add those errors back to correctly recover the missing entry we are interested in. However, it is likely that this small amount of error actually doesn't matter as the computed value is probably still a sufficiently good approximation to the true value. So let me do some analysis to see if I can prove this claim.

jk-jeon · 2022-01-15T13:12:05Z

Sorry for the delay, I found a severe mistake in my proof of correctness of the new integer checks, and in order to fix that I had to learn some more refined mathematical tools and come up with some more sophisticated arguments. It was quite substantial, actually. But now I believe the proof is correct. And yeah, we indeed do not need to add the recovery error back, which is great :)

jk-jeon · 2022-01-18T22:38:16Z

Hold on, I need to verify several more cases.

jk-jeon · 2022-01-18T23:41:40Z

Okay, turned out to be not a problem.

vitaut · 2022-01-20T00:08:13Z

include/fmt/format-inl.h

  static constexpr struct {
    uint32_t magic_number;
-    int bits_for_comparison;
-    uint32_t threshold;
-    int shift_amount;
-  } infos[] = {{0xcccd, 16, 0x3333, 18}, {0xa429, 8, 0x0a, 20}};
+    int margin_bits;
+    int divisibility_check_bits;
+  } infos[] = {{0x199a, 8, 8}, {0xa3d71, 10, 16}};
  constexpr auto info = infos[N - 1];
  n *= info.magic_number;
-  const uint32_t comparison_mask = (1u << info.bits_for_comparison) - 1;
-  bool result = (n & comparison_mask) <= info.threshold;
-  n >>= info.shift_amount;
+  n >>= info.margin_bits;
+  const uint32_t comparison_mask = (1u << info.divisibility_check_bits) - 1;
+  bool result = (n & comparison_mask) == 0;
+  n >>= info.divisibility_check_bits;


Is there an explanation of this logic somewhere, particularly what magic numbers are?

Not yet, I'm still working on rewriting the paper. But it's probably simple enough so that I can explain it here. Roughly speaking, those magic numbers are still binary digits of 1/5 and 1/25.

What we are trying to do here is to find integers m and k and l such that

floor(n/d) = floor(nm / 2^(k+l)) where d=10 or d=100,

floor(nm/2^k) mod 2^l = 0 if and only if n is divisible by d.

(Here, m is magic_number, k is margin_bits, and l is divisibility_check_bits.)

So if these are possible, then we first multiply n with the magic number m and shift to right by k. The lower l bits of the resulting number is zero if and only if n is divisible by d, and then we shift to right again by l to wipe out those bits, then the result must be precisely floor(n/d).

The first item is nothing but what the classical paper by Granlund-Montgomery tells us, so let's take it as granted; we set m to be the ceiling of 2^(k+l)/d for large enough k+l. (To be a bit more specific, k+l >= 16 for float and k+l >= 26 for double are more than enough.)

The rationale behind the second item is that, since m is almost equal to 2^(k+l)/d, so if n is divisible by d, then nm should be almost a multiple of 2^(k+l). And 2^k is how far nm can deviate from truly being a multiple of 2^(k+l).

To make this idea precise, note that 1 is equivalent to

floor(n/d) <= nm/2^(k+l) < floor(n/d) + 1.

Now, let n = dq + r with 0 <= r < d, then this becomes

q <= nm/2^(k+l) < q + 1,

so multiplying 2^l yields

q * 2^l <= nm/2^k < (q+1) * 2^l.

Hence, floor(nm/2^k) mod 2^l is just floor(nm/2^k) - q * 2^l.

Now, we want two things:

floor(nm/2^k) - q * 2^l = 0 if r = 0, and

floor(nm/2^k) - q * 2^l != 0 if r != 0.

The first one can be rewritten as

q * 2^l <= nm/2^k < q * 2^l + 1,

and since we already have the first inequality, it is equivalent to

nm < q * 2^(k+l) + 2^k.

Using n = dq, we can rewrite it as

q(dm - 2^(k+l)) < 2^k,

and since m = ceil(2^(k+l)/d), we have 0 <= dm - 2^(k+l) < d, so it is sufficient to have

qd = n < 2^k.

Since we know the range of n (0~100 for float and 0~1000 for double), we can easily figure out that k >= 7 for float and k >= 10 for double are sufficient to ensure that r = 0 implies floor(nm/2^k) mod 2^l = 0.

On the other hand, the second one can be rewritten as

nm/2^k >= q * 2^l + 1,

or equivalently,

(dq + r) * m >= q * 2^(k+l) + 2^k,

or equivalently,

rm >= q(2^(k+l) - dm) + 2^k.

Since dm - 2^(k+l) > 0, we can see that having

m >= 2^k

is sufficient to ensure that r != 0 implies floor(nm/2^k) mod 2^l != 0. Or to be a bit more lenient, it is sufficient to have

2^(k+l)/d >= 2^k,

or equivalently

2^l >= d.

Hence, l >= 4 for float and l >= 7 for double are enough. (Of course, we set l = 8 for float and l = 16 for double to make mod 2^l a trivial operation.)

I also did an exhaustive test of this here to make sure things are done correctly.

Ah, by the way, I got this idea from Schubfach. I don't know where the author of Schubfach got this, but I guess this kind of idea is probably already common and can be found here and there. Things like this and this are possibly related.

Thanks a lot for the detailed explanation!

My pleasure! I'll get back to you when I finish writing things up. The paper will contain several sharper results generalizing/unifying this and several other tricks I have been relying on.

Also, if time permits, I will work on an alternative idea about fixed-precision printing, based on a generalization of Dr. Lemire's paper I linked. I hope it ends up with something I can contribute again.

So in this paper, a better method is introduced which allows us to save yet one more instruction:

const uint32_t magic_number = (1u << 16) / divisor + 1; const uint32_t comparison_mask = (1u << 16) - 1; n *= magic_number; bool result = (n & comparison_mask) < magic_number; n >>= 16;

(This method, the one I explained, and the classical divisibility test by Granlund-Montgomery using modular inverse, are in fact all just special cases of a more general result that will appear in the paper I'm writing. Basically, what this method is doing is not very different from what I explained. Indeed, floor(nm/2^k) mod 2^l = 0 is equivalent to nm mod 2^(k+l) < 2^k, and the thing is that there is no terribly important reason to have a power of 2 on the right-hand side now. And it turns out that we can replace it by m.)

I don't think this code will perform measurably better, but it's shorter and presumably better, so why not.

vitaut · 2022-01-20T01:03:57Z

Merged, thanks!

#2713 (comment)

jk-jeon added 8 commits January 12, 2022 15:00

Minimize the usage of built-in 128-bit ints

23eeacc

It usually generates slower code than manual handling.

Simplify Dragonbox Step 3.

c19fddf

Replace Dragonbox cache which allows simpler cache recovery & integer…

eb79cb4

… checks

Simplify integer checks

5be03aa

Remove now-unused stuffs

e79a9d7

Fix typo

5bd18bd

Remove now-unused stuffs

54a6b6b

Fix syntax errors

3da98e4

vitaut reviewed Jan 13, 2022

View reviewed changes

include/fmt/format-inl.h Show resolved Hide resolved

jk-jeon added 3 commits January 13, 2022 15:19

Replace noexcept with FMT_NOEXCEPT

33dc682

Simplify lines with __builtin_addcll and friends

238333d

Make a fallback path more compiler-friendly

b4f3892

jk-jeon added 2 commits January 15, 2022 04:57

Simplify cache recovery

365d081

Improve comments

15be3b0

Improve comments

6e3dd0c

vitaut approved these changes Jan 20, 2022

View reviewed changes

vitaut merged commit 6240d02 into fmtlib:master Jan 20, 2022

vitaut added a commit that referenced this pull request Jan 23, 2022

Clarify the choice of magic numbers and compute the most magic one

f7a809b

#2713 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify Dragonbox #2713

Simplify Dragonbox #2713

jk-jeon commented Jan 13, 2022 •

edited

Loading

vitaut left a comment

vitaut Jan 13, 2022

jk-jeon Jan 13, 2022

vitaut Jan 13, 2022

jk-jeon Jan 13, 2022

vitaut Jan 13, 2022

jk-jeon Jan 13, 2022

jk-jeon commented Jan 14, 2022 •

edited

Loading

jk-jeon commented Jan 15, 2022

jk-jeon commented Jan 18, 2022

jk-jeon commented Jan 18, 2022

vitaut Jan 20, 2022

jk-jeon Jan 20, 2022 •

edited

Loading

vitaut Jan 23, 2022

jk-jeon Jan 23, 2022

jk-jeon Jan 26, 2022

vitaut commented Jan 20, 2022

Simplify Dragonbox #2713

Simplify Dragonbox #2713

Conversation

jk-jeon commented Jan 13, 2022 • edited Loading

vitaut left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jk-jeon commented Jan 14, 2022 • edited Loading

jk-jeon commented Jan 15, 2022

jk-jeon commented Jan 18, 2022

jk-jeon commented Jan 18, 2022

Choose a reason for hiding this comment

jk-jeon Jan 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitaut commented Jan 20, 2022

jk-jeon commented Jan 13, 2022 •

edited

Loading

jk-jeon commented Jan 14, 2022 •

edited

Loading

jk-jeon Jan 20, 2022 •

edited

Loading