Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify Dragonbox #2713

Merged
merged 14 commits into from
Jan 20, 2022
Merged

Simplify Dragonbox #2713

merged 14 commits into from
Jan 20, 2022

Conversation

jk-jeon
Copy link
Contributor

@jk-jeon jk-jeon commented Jan 13, 2022

Main differences:

  1. Cache entries are slightly changed to allow simpler cache recovery and integer checks.
  2. Cache recovery from the compressed cache is simplified a little bit, thanks to 1.
  3. check_divisibility_and_divide_by_pow5() and related code segments are simplified. I expect them to be now compiled into fewer instructions.
  4. Integer checks are simplified a lot; all subroutines for integer checks are now completely eliminated.

In my machine, it seems that the performance is not very much affected, though I expected a little bit of improvement. (I ran Milo's benchmark several times, but the results were largely inconsistent.) Probably because bottlenecks are on other places. But the code size will be a bit smaller, because now we do not have the divisibility check tables.

Copy link
Contributor

@vitaut vitaut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

#if FMT_HAS_BUILTIN(__builtin_addcll)
unsigned long long carry;
low_ = __builtin_addcll(low_, n, 0, &carry);
high_ = __builtin_addcll(high_, 0, carry, &carry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can

high_ = __builtin_addcll(high_, 0, carry, &carry);

be replaced with

high_ += carry;

or is there any benefit to using a builtin?

And similar comment in other conditionally compiled branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some experiment: https://godbolt.org/z/Yzf1T9ed8
Seems like they generate the same code.

constexpr uint64_t high() const noexcept { return high_; }
constexpr uint64_t low() const noexcept { return low_; }

uint128_wrapper& operator+=(uint64_t n) & noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

& is probably not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

uint128_wrapper& operator+=(uint64_t n) FMT_NOEXCEPT {
# if defined(_MSC_VER) && defined(_M_X64)
unsigned char carry = _addcarry_u64(0, low_, n, &low_);
constexpr uint128_wrapper(uint64_t high, uint64_t low) noexcept
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere please keep FMT_NOEXCEPT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, did the change.

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Jan 14, 2022

Hmm. Probably I have to do some experiment, but I guess the cache recovery can be made even simpler, because the error it introduces might not introduce actual error. Please hold this for a while if you are satisfied otherwise!

To be more precise, what I'm talking about is this:

As you already know, the gigantic table that Dragonbox relies on is consisting of approximations to 5^k's (or equivalently 10^k's) with the leading 1 aligned at the MSB. The way the table is compressed down to a much smaller one is that we just store 5^k's for certain k's, and in the runtime it tries to recover a possibly missing entry it needs by multiplying appropriate 5^a to an existing entry and then realign it so that the leading 1 is still at the MSB. Since the table entries are already approximations, this will inevitably introduce some amount of error, so the resulting value is slightly off compared to what's in the full table, so what I did is to store those errors into a separate table so that we can add those errors back to correctly recover the missing entry we are interested in. However, it is likely that this small amount of error actually doesn't matter as the computed value is probably still a sufficiently good approximation to the true value. So let me do some analysis to see if I can prove this claim.

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Jan 15, 2022

Sorry for the delay, I found a severe mistake in my proof of correctness of the new integer checks, and in order to fix that I had to learn some more refined mathematical tools and come up with some more sophisticated arguments. It was quite substantial, actually. But now I believe the proof is correct. And yeah, we indeed do not need to add the recovery error back, which is great :)

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Jan 18, 2022

Hold on, I need to verify several more cases.

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Jan 18, 2022

Okay, turned out to be not a problem.

Comment on lines 931 to +941
static constexpr struct {
uint32_t magic_number;
int bits_for_comparison;
uint32_t threshold;
int shift_amount;
} infos[] = {{0xcccd, 16, 0x3333, 18}, {0xa429, 8, 0x0a, 20}};
int margin_bits;
int divisibility_check_bits;
} infos[] = {{0x199a, 8, 8}, {0xa3d71, 10, 16}};
constexpr auto info = infos[N - 1];
n *= info.magic_number;
const uint32_t comparison_mask = (1u << info.bits_for_comparison) - 1;
bool result = (n & comparison_mask) <= info.threshold;
n >>= info.shift_amount;
n >>= info.margin_bits;
const uint32_t comparison_mask = (1u << info.divisibility_check_bits) - 1;
bool result = (n & comparison_mask) == 0;
n >>= info.divisibility_check_bits;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an explanation of this logic somewhere, particularly what magic numbers are?

Copy link
Contributor Author

@jk-jeon jk-jeon Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, I'm still working on rewriting the paper. But it's probably simple enough so that I can explain it here. Roughly speaking, those magic numbers are still binary digits of 1/5 and 1/25.

What we are trying to do here is to find integers m and k and l such that

  1. floor(n/d) = floor(nm / 2^(k+l)) where d=10 or d=100,
  2. floor(nm/2^k) mod 2^l = 0 if and only if n is divisible by d.

(Here, m is magic_number, k is margin_bits, and l is divisibility_check_bits.)

So if these are possible, then we first multiply n with the magic number m and shift to right by k. The lower l bits of the resulting number is zero if and only if n is divisible by d, and then we shift to right again by l to wipe out those bits, then the result must be precisely floor(n/d).

The first item is nothing but what the classical paper by Granlund-Montgomery tells us, so let's take it as granted; we set m to be the ceiling of 2^(k+l)/d for large enough k+l. (To be a bit more specific, k+l >= 16 for float and k+l >= 26 for double are more than enough.)

The rationale behind the second item is that, since m is almost equal to 2^(k+l)/d, so if n is divisible by d, then nm should be almost a multiple of 2^(k+l). And 2^k is how far nm can deviate from truly being a multiple of 2^(k+l).

To make this idea precise, note that 1 is equivalent to

  floor(n/d) <= nm/2^(k+l) < floor(n/d) + 1.

Now, let n = dq + r with 0 <= r < d, then this becomes

  q <= nm/2^(k+l) < q + 1,

so multiplying 2^l yields

  q * 2^l <= nm/2^k < (q+1) * 2^l.

Hence, floor(nm/2^k) mod 2^l is just floor(nm/2^k) - q * 2^l.

Now, we want two things:

  1. floor(nm/2^k) - q * 2^l = 0 if r = 0, and
  2. floor(nm/2^k) - q * 2^l != 0 if r != 0.

The first one can be rewritten as

  q * 2^l <= nm/2^k < q * 2^l + 1,

and since we already have the first inequality, it is equivalent to

  nm < q * 2^(k+l) + 2^k.

Using n = dq, we can rewrite it as

  q(dm - 2^(k+l)) < 2^k,

and since m = ceil(2^(k+l)/d), we have 0 <= dm - 2^(k+l) < d, so it is sufficient to have

  qd = n < 2^k.

Since we know the range of n (0~100 for float and 0~1000 for double), we can easily figure out that k >= 7 for float and k >= 10 for double are sufficient to ensure that r = 0 implies floor(nm/2^k) mod 2^l = 0.

On the other hand, the second one can be rewritten as

  nm/2^k >= q * 2^l + 1,

or equivalently,

  (dq + r) * m >= q * 2^(k+l) + 2^k,

or equivalently,

  rm >= q(2^(k+l) - dm) + 2^k.

Since dm - 2^(k+l) > 0, we can see that having

  m >= 2^k

is sufficient to ensure that r != 0 implies floor(nm/2^k) mod 2^l != 0. Or to be a bit more lenient, it is sufficient to have

  2^(k+l)/d >= 2^k,

or equivalently

  2^l >= d.

Hence, l >= 4 for float and l >= 7 for double are enough. (Of course, we set l = 8 for float and l = 16 for double to make mod 2^l a trivial operation.)

I also did an exhaustive test of this here to make sure things are done correctly.

Ah, by the way, I got this idea from Schubfach. I don't know where the author of Schubfach got this, but I guess this kind of idea is probably already common and can be found here and there. Things like this and this are possibly related.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the detailed explanation!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My pleasure! I'll get back to you when I finish writing things up. The paper will contain several sharper results generalizing/unifying this and several other tricks I have been relying on.

Also, if time permits, I will work on an alternative idea about fixed-precision printing, based on a generalization of Dr. Lemire's paper I linked. I hope it ends up with something I can contribute again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this paper, a better method is introduced which allows us to save yet one more instruction:

  const uint32_t magic_number = (1u << 16) / divisor + 1;
  const uint32_t comparison_mask = (1u << 16) - 1;
  n *= magic_number;
  bool result = (n & comparison_mask) < magic_number;
  n >>= 16;

(This method, the one I explained, and the classical divisibility test by Granlund-Montgomery using modular inverse, are in fact all just special cases of a more general result that will appear in the paper I'm writing. Basically, what this method is doing is not very different from what I explained. Indeed, floor(nm/2^k) mod 2^l = 0 is equivalent to nm mod 2^(k+l) < 2^k, and the thing is that there is no terribly important reason to have a power of 2 on the right-hand side now. And it turns out that we can replace it by m.)

I don't think this code will perform measurably better, but it's shorter and presumably better, so why not.

@vitaut vitaut merged commit 6240d02 into fmtlib:master Jan 20, 2022
@vitaut
Copy link
Contributor

vitaut commented Jan 20, 2022

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants