An optimal algorithm for bounded random integers #39143

stephentyrone · 2021-09-02T18:53:09Z

Everyone knows that generating an unbiased random integer in a range 0 ..< upperBound, where upperBound is not a power of two, requires rejection sampling. What if I told you that Big Random Number has lied to us for decades, and we have been played for absolute fools?

Previously Swift used Lemire's "nearly divisionless" method (https://arxiv.org/abs/1805.10941) for this operation. This PR introduces a novel algorithm that:

never divides
avoids rejection sampling entirely
achieves a theoretically optimal bound on the amount of randomness consumed to generate a sample
delivers actual performance improvements for most real cases

Lemire interprets each word from the random source as a fixed-point number in [0, 1), multiplies by upperBound, and takes the floor. Up to this point, this is the algorithm suggested by Knuth in TAoCP vol 2, and as observed by Knuth, it is slightly biased. Lemire cleverly corrects this bias via rejection sampling, which requires one division in the general case (hence, "nearly divisionless").

Our new algorithm takes a different approach. Rather than using rejection sampling, we observe that the bias decreases exponentially in the number of bits used for the computation. In the limit we are interpreting the bitstream from the random source as a uniform real number r in [0, 1) and ⌊r * upperBound⌋ provides an unbiased sample in 0 ..< upperBound. The only challenge, then, is to know when we have computed enough bits of the product to know what the result is.

Observe that we can split the random stream at any bit position i, yielding r = r₀ + r₁ with r₀ a fixed-point number in [0,1) and 0 ≤ r₁ < 2⁻ⁱ. Further observe that:

result = ⌊r * upperBound⌋
       = ⌊r₀ * upperBound⌋ + ⌊frac(r₀*upperBound) + r₁*upperBound⌋

The first term of this expression is just Knuth's biased sample, which is computed with just a full-width multiply.

If i > log₂(upperBound), both summands in the second term are smaller than 1, so the second term is either 0 or 1. Applying the bound on r₁, we see that if frac(r₀ * upperBound) <= 1 - upperBound * 2⁻ⁱ, the second term is necessarily zero, and the first term is the unbiased result. Happily, this is also a trivial computation on the low-order part of the full-width multiply.

If the test fails, we do not reject the sample, throwing away the bits we have already consumed from the random source; instead we increase i by a convenient amount, computing more bits of the product. This is the criticial improvement; while Lemire has a probability of 1/2 to reject for each word consumed in the worst case, we have a probability of terminating of 1/2 for each bit consumed. This reduces the worst-case expected number of random bits required from O(log₂(upperBound)) to log₂(upperBound) + O(1), which is optimal[1].

Of more practical interest, this new algorithm opens an intriguing possibility: we can compute just 64 extra bits, and have a probability of 1 - 2⁻⁶⁴ of terminating. This is so close to certainty that we can simply stop unconditionally without introducing any measurable bias (detecting any difference would require about 2¹²⁸ samples, which is prohibitive). This is a significant performance improvement for slow random generators, since it asymptotically reduces the number of bits required by a factor of two for bignums, while matching the expected number of bits required for smaller numbers. This is the algorithm implemented by this PR (the formally-uniform method is not much more complex to implement and is only a little bit slower, but there's no reason to do so).

More intriguing still, this algorithm can be made unconditional by removing the early out, so that every value computed requires word size + 64 bits from the stream, which breaks the loop-carried dependency for fast generators, unlocking vectorization and parallelization where it was previously impossible. This is an especially powerful advantage when paired with bitstream generators that allow skip-ahead such as counter-based generators or PCG.

Note that it is possible to employ Lemire's tighter early-out check that involves a division with this algorithm as well; this is beneficial in some cases when upperBound is a constant and the generator is slow, but we do not think it necessary with the new algorithm and other planned improvements.

[1] We can actually achieve log₂(upperBound) + ε for any ε > 0 by generating multiple random samples at once, but that is only of theoretical interest--it is still interesting, however, since I don't think anyone has described how to attain it previously.

Everyone knows that generating an unbiased random integer in a range 0 ..< upperBound, where upperBound is not a power of two, requires rejection sampling. What if I told you that Big Random Number has lied to us for decades, and we have been played for absolute fools? Previously Swift used Lemire's "nearly divisionless" method (https://arxiv.org/abs/1805.10941) for this operation. This PR introduces a novel algorithm that: - never divides - avoids rejection sampling entirely - achieves a theoretically optimal bound on the amount of randomness consumed to generate a sample - delivers actual performance improvements for most real cases Lemire interprets each word from the random source as a fixed-point number in [0, 1), multiplies by upperBound, and takes the floor. Up to this point, this is the algorithm suggested by Knuth in TAoCP vol 2, and as observed by Knuth, it is slightly biased. Lemire cleverly corrects this bias via rejection sampling, which requires one division in the general case (hence, "nearly divisionless"). Our new algorithm takes a different approach. Rather than using rejection sampling, we observe that the bias decreases exponentially in the number of bits used for the computation. In the limit we are interpreting the bitstream from the random source as a uniform real number r in [0, 1) and ⌊r * upperBound⌋ provides an unbiased sample in 0 ..< upperBound. The only challenge, then, is to know when we have computed enough bits of the product to know what the result is. Observe that we can split the random stream at any bit position i, yielding r = r₀ + r₁ with r₀ a fixed-point number in [0,1) and 0 ≤ r₁ < 2⁻ⁱ. Further observe that: result = ⌊r * upperBound⌋ = ⌊r₀ * upperBound⌋ + ⌊frac(r₀*upperBound) + r₁*upperBound⌋ The first term of this expression is just Knuth's biased sample, which is computed with just a full-width multiply. If i > log₂(upperBound), both summands in the second term are smaller than 1, so the second term is either 0 or 1. Applying the bound on r₁, we see that if frac(r₀ * upperBound) <= 1 - upperBound * 2⁻ⁱ, the second term is necessarily zero, and the first term is the unbiased result. Happily, this is _also_ a trivial computation on the low-order part of the full-width multiply. If the test fails, we do not reject the sample, throwing away the bits we have already consumed from the random source; instead we increase i by a convenient amount, computing more bits of the product. This is the criticial improvement; while Lemire has a probability of 1/2 to reject for each word consumed in the worst case, we have a probability of terminating of 1/2 for each _bit_ consumed. This reduces the worst-case expected number of random bits required from O(log₂(upperBound)) to log₂(upperBound) + O(1), which is optimal[1]. Of more practical interest, this new algorithm opens an intriguing possibility: we can compute just 64 extra bits, and have a probability of 1 - 2⁻⁶⁴ of terminating. This is so close to certainty that we can simply stop unconditionally without introducing any measurable bias (detecting any difference would require about 2¹²⁸ samples, which is prohibitive). This is a significant performance improvement for slow random generators, since it asymptotically reduces the number of bits required by a factor of two for bignums, while matching the expected number of bits required for smaller numbers. This is the algorithm implemented by this PR (the formally-uniform method is not much more complex to implement and is only a little bit slower, but there's no reason to do so). More intriguing still, this algorithm can be made unconditional by removing the early out, so that every value computed requires word size + 64 bits from the stream, which breaks the loop-carried dependency for fast generators, unlocking vectorization and parallelization where it was previously impossible. This is an especially powerful advantage when paired with bitstream generators that allow skip-ahead such as counter-based generators or PCG. Note that it is _possible_ to employ Lemire's tighter early-out check that involves a division with this algorithm as well; this is beneficial in some cases when upperBound is a constant and the generator is slow, but we do not think it necessary with the new algorithm and other planned improvements. [1] We can actually achieve log₂(upperBound) + ε for any ε > 0 by generating multiple random samples at once, but that is only of theoretical interest--it is still interesting, however, since I don't think anyone has described how to attain it previously.

lemire · 2021-09-02T19:57:45Z

Related idea at: https://github.com/KWillets/range_generator

The approach is an application of the concept of short product. I recently made use of it for a fast number parser... see paper at https://arxiv.org/pdf/2101.11408.pdf (section 7)

xwu · 2021-09-02T20:21:53Z

stdlib/public/core/Random.swift

+  ///
+  /// Requires T.bitWidth >= 64.
+  @_transparent @usableFromInline
+  internal func multiplyHigh<T:FixedWidthInteger & UnsignedInteger>(


Could delete the ABI footprint of this by making it @_alwaysEmitIntoClient, I think. (Would also underscore the name.)

stephentyrone · 2021-09-03T01:39:53Z

Since this is "doing numbers" it's worth pointing out here that the versions of this algorithm with early-outs can leak timing information, even if implemented with constant-time primitives (because the distribution of values that require an additional word be computed is not uniform in the set of outputs). For Swift's standard library that's not a major issue, but it would be problematic for some other places such algorithms can be deployed. (This does not apply to the unconditional "just use 64 extra bits no matter what" version.)

James-E-A · 2022-02-26T23:09:54Z

In the limit we are interpreting the bitstream from the random source as a uniform real number r in [0, 1) and ⌊r * upperBound⌋ provides an unbiased sample in 0 ..< upperBound.

nb: isn't this the "Fast Dice Roller" presented in Lumbroso (2013) "Optimal discrete uniform generation from coin flips, and applications"?

In any case, congrats on beating CPython, PHP, C stdlib, OpenSSL, OpenJDK, Node, Rust, and Go to the punch. This thread helped me find the existence of this algorithm far faster than I would have otherwise.

Refer: swiftlang/swift#39143 for a description of the algorithm. It is optimal in the sense of having: * no divisions * minimal number of blocks of random bits from the generator

Refer: swiftlang/swift#39143 for a description of the algorithm. It is optimal in the sense of having: * no divisions * minimal number of blocks of random bits from the generator Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #22499)

Refer: swiftlang/swift#39143 for a description of the algorithm. It is optimal in the sense of having: * no divisions * minimal number of blocks of random bits from the generator Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #22499) (cherry picked from commit 55755fb)

Refer: swiftlang/swift#39143 for a description of the algorithm. It is optimal in the sense of having: * no divisions * minimal number of blocks of random bits from the generator Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from openssl/openssl#22499) Signed-off-by: fly2x <fly2x@hitls.org>

…oposed in swiftlang/swift#39143 based on OpenSSL's implementation of it in https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99

@IanButterworth

Implement optimal uniform random number generator using the method proposed in swiftlang/swift#39143 based on OpenSSL's implementation of it in https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99 This PR also fixes some bugs found while developing it. This is a replacement for #50203 and fixes the issues found by @IanButterworth with both rngs C rng <img width="1011" alt="image" src="https://github.com/user-attachments/assets/0dd9d5f2-17ef-4a70-b275-1d12692be060"> New scheduler rng <img width="985" alt="image" src="https://github.com/user-attachments/assets/4abd0a57-a1d9-46ec-99a5-535f366ecafa"> ~On my benchmarks the julia implementation seems to be almost 50% faster than the current implementation.~ With oscars suggestion of removing the debiasing this is now almost 5x faster than the original implementation. And almost fully branchless We might want to backport the two previous commits since they technically fix bugs. --------- Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

@IanButterworth

Implement optimal uniform random number generator using the method proposed in swiftlang/swift#39143 based on OpenSSL's implementation of it in https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99 This PR also fixes some bugs found while developing it. This is a replacement for #50203 and fixes the issues found by @IanButterworth with both rngs C rng <img width="1011" alt="image" src="https://github.com/user-attachments/assets/0dd9d5f2-17ef-4a70-b275-1d12692be060"> New scheduler rng <img width="985" alt="image" src="https://github.com/user-attachments/assets/4abd0a57-a1d9-46ec-99a5-535f366ecafa"> ~On my benchmarks the julia implementation seems to be almost 50% faster than the current implementation.~ With oscars suggestion of removing the debiasing this is now almost 5x faster than the original implementation. And almost fully branchless We might want to backport the two previous commits since they technically fix bugs. --------- Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

xwu reviewed Sep 2, 2021

View reviewed changes

ctsrc mentioned this pull request Sep 2, 2021

Look into this alleged optimal algorithm for bounded random integers rust-random/rand#1172

Closed

paulidale mentioned this pull request Sep 2, 2021

Improve BN_rand_range() openssl/openssl#16500

Open

sorawee mentioned this pull request Sep 3, 2021

random improvement racket/racket#3984

Open

plajjan mentioned this pull request Sep 3, 2021

Implement an optimal algorithm for bounded random integers? (borrow Swift proposal) actonlang/acton#187

Open

xueyuanl mentioned this pull request Sep 3, 2021

Daily Hacker News 03-09-2021 xueyuanl/daily-hackernews#366

Open

NevinBR mentioned this pull request Sep 7, 2021

[stdlib] Floating-point random-number improvements #33560

Open

tarcieri mentioned this pull request Sep 13, 2021

Feature wishlist tracking ticket RustCrypto/crypto-bigint#1

Open

42 tasks

kees mentioned this pull request Sep 25, 2021

Evaluate bounded RNG uses for selection bias KSPP/linux#159

Open

NumberFour8 mentioned this pull request Nov 22, 2021

Updated randomInteger function to be cryptographically safe hoprnet/hoprnet#2949

Merged

MatthiasKohl mentioned this pull request Dec 3, 2021

[ENH] Add new methods of generating bounded random integers rapidsai/cugraph#1979

Closed

danmoseley mentioned this pull request Sep 10, 2022

Try improved method for generating bounded random integers dotnet/runtime#75395

Closed

moon-chilled mentioned this pull request Oct 3, 2022

Accurate/faster random number generation in range jsoftware/jsource#154

Open

paulidale mentioned this pull request Oct 25, 2023

HPKE hardening openssl/openssl#22493

Closed

1 task

paulidale mentioned this pull request Oct 25, 2023

rand: implement an unbiased random integer from a range openssl/openssl#22499

Closed

2 tasks

MineCake147E mentioned this pull request Nov 25, 2023

Math:rnd(min, max)等の分布が偏っている The Math:rnd(min, max) and its seeded variant have slight biases in distribution aiscript-dev/aiscript#489

Closed

Artoria2e5 mentioned this pull request May 11, 2024

Not just [0,1), but also random int tc39/proposal-seeded-random#22

Closed

bakkot mentioned this pull request Aug 7, 2024

Algorithm for getting an integer in a range tc39/proposal-seeded-random#23

Open

gbaraldi mentioned this pull request Aug 15, 2024

Implement faster thread local rng for scheduler JuliaLang/julia#55501

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An optimal algorithm for bounded random integers #39143

An optimal algorithm for bounded random integers #39143

stephentyrone commented Sep 2, 2021

lemire commented Sep 2, 2021

xwu Sep 2, 2021 •

edited

Loading

stephentyrone commented Sep 3, 2021 •

edited

Loading

James-E-A commented Feb 26, 2022 •

edited

Loading

An optimal algorithm for bounded random integers #39143

Are you sure you want to change the base?

An optimal algorithm for bounded random integers #39143

Conversation

stephentyrone commented Sep 2, 2021

lemire commented Sep 2, 2021

xwu Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

stephentyrone commented Sep 3, 2021 • edited Loading

James-E-A commented Feb 26, 2022 • edited Loading

xwu Sep 2, 2021 •

edited

Loading

stephentyrone commented Sep 3, 2021 •

edited

Loading

James-E-A commented Feb 26, 2022 •

edited

Loading