-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An optimal algorithm for bounded random integers #39143
base: main
Are you sure you want to change the base?
An optimal algorithm for bounded random integers #39143
Conversation
Everyone knows that generating an unbiased random integer in a range 0 ..< upperBound, where upperBound is not a power of two, requires rejection sampling. What if I told you that Big Random Number has lied to us for decades, and we have been played for absolute fools? Previously Swift used Lemire's "nearly divisionless" method (https://arxiv.org/abs/1805.10941) for this operation. This PR introduces a novel algorithm that: - never divides - avoids rejection sampling entirely - achieves a theoretically optimal bound on the amount of randomness consumed to generate a sample - delivers actual performance improvements for most real cases Lemire interprets each word from the random source as a fixed-point number in [0, 1), multiplies by upperBound, and takes the floor. Up to this point, this is the algorithm suggested by Knuth in TAoCP vol 2, and as observed by Knuth, it is slightly biased. Lemire cleverly corrects this bias via rejection sampling, which requires one division in the general case (hence, "nearly divisionless"). Our new algorithm takes a different approach. Rather than using rejection sampling, we observe that the bias decreases exponentially in the number of bits used for the computation. In the limit we are interpreting the bitstream from the random source as a uniform real number r in [0, 1) and ⌊r * upperBound⌋ provides an unbiased sample in 0 ..< upperBound. The only challenge, then, is to know when we have computed enough bits of the product to know what the result is. Observe that we can split the random stream at any bit position i, yielding r = r₀ + r₁ with r₀ a fixed-point number in [0,1) and 0 ≤ r₁ < 2⁻ⁱ. Further observe that: result = ⌊r * upperBound⌋ = ⌊r₀ * upperBound⌋ + ⌊frac(r₀*upperBound) + r₁*upperBound⌋ The first term of this expression is just Knuth's biased sample, which is computed with just a full-width multiply. If i > log₂(upperBound), both summands in the second term are smaller than 1, so the second term is either 0 or 1. Applying the bound on r₁, we see that if frac(r₀ * upperBound) <= 1 - upperBound * 2⁻ⁱ, the second term is necessarily zero, and the first term is the unbiased result. Happily, this is _also_ a trivial computation on the low-order part of the full-width multiply. If the test fails, we do not reject the sample, throwing away the bits we have already consumed from the random source; instead we increase i by a convenient amount, computing more bits of the product. This is the criticial improvement; while Lemire has a probability of 1/2 to reject for each word consumed in the worst case, we have a probability of terminating of 1/2 for each _bit_ consumed. This reduces the worst-case expected number of random bits required from O(log₂(upperBound)) to log₂(upperBound) + O(1), which is optimal[1]. Of more practical interest, this new algorithm opens an intriguing possibility: we can compute just 64 extra bits, and have a probability of 1 - 2⁻⁶⁴ of terminating. This is so close to certainty that we can simply stop unconditionally without introducing any measurable bias (detecting any difference would require about 2¹²⁸ samples, which is prohibitive). This is a significant performance improvement for slow random generators, since it asymptotically reduces the number of bits required by a factor of two for bignums, while matching the expected number of bits required for smaller numbers. This is the algorithm implemented by this PR (the formally-uniform method is not much more complex to implement and is only a little bit slower, but there's no reason to do so). More intriguing still, this algorithm can be made unconditional by removing the early out, so that every value computed requires word size + 64 bits from the stream, which breaks the loop-carried dependency for fast generators, unlocking vectorization and parallelization where it was previously impossible. This is an especially powerful advantage when paired with bitstream generators that allow skip-ahead such as counter-based generators or PCG. Note that it is _possible_ to employ Lemire's tighter early-out check that involves a division with this algorithm as well; this is beneficial in some cases when upperBound is a constant and the generator is slow, but we do not think it necessary with the new algorithm and other planned improvements. [1] We can actually achieve log₂(upperBound) + ε for any ε > 0 by generating multiple random samples at once, but that is only of theoretical interest--it is still interesting, however, since I don't think anyone has described how to attain it previously.
Related idea at: https://github.com/KWillets/range_generator The approach is an application of the concept of short product. I recently made use of it for a fast number parser... see paper at https://arxiv.org/pdf/2101.11408.pdf (section 7) |
/// | ||
/// Requires T.bitWidth >= 64. | ||
@_transparent @usableFromInline | ||
internal func multiplyHigh<T:FixedWidthInteger & UnsignedInteger>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could delete the ABI footprint of this by making it @_alwaysEmitIntoClient
, I think. (Would also underscore the name.)
Since this is "doing numbers" it's worth pointing out here that the versions of this algorithm with early-outs can leak timing information, even if implemented with constant-time primitives (because the distribution of values that require an additional word be computed is not uniform in the set of outputs). For Swift's standard library that's not a major issue, but it would be problematic for some other places such algorithms can be deployed. (This does not apply to the unconditional "just use 64 extra bits no matter what" version.) |
nb: isn't this the "Fast Dice Roller" presented in Lumbroso (2013) "Optimal discrete uniform generation from coin flips, and applications"? In any case, congrats on beating CPython, PHP, C stdlib, OpenSSL, OpenJDK, Node, Rust, and Go to the punch. This thread helped me find the existence of this algorithm far faster than I would have otherwise. |
Refer: swiftlang/swift#39143 for a description of the algorithm. It is optimal in the sense of having: * no divisions * minimal number of blocks of random bits from the generator
Refer: swiftlang/swift#39143 for a description of the algorithm. It is optimal in the sense of having: * no divisions * minimal number of blocks of random bits from the generator
Refer: swiftlang/swift#39143 for a description of the algorithm. It is optimal in the sense of having: * no divisions * minimal number of blocks of random bits from the generator
Refer: swiftlang/swift#39143 for a description of the algorithm. It is optimal in the sense of having: * no divisions * minimal number of blocks of random bits from the generator
Refer: swiftlang/swift#39143 for a description of the algorithm. It is optimal in the sense of having: * no divisions * minimal number of blocks of random bits from the generator Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #22499)
Refer: swiftlang/swift#39143 for a description of the algorithm. It is optimal in the sense of having: * no divisions * minimal number of blocks of random bits from the generator Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #22499) (cherry picked from commit 55755fb)
Refer: swiftlang/swift#39143 for a description of the algorithm. It is optimal in the sense of having: * no divisions * minimal number of blocks of random bits from the generator Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from openssl/openssl#22499) Signed-off-by: fly2x <fly2x@hitls.org>
…oposed in swiftlang/swift#39143 based on OpenSSL's implementation of it in https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99
…oposed in swiftlang/swift#39143 based on OpenSSL's implementation of it in https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99
Implement optimal uniform random number generator using the method proposed in swiftlang/swift#39143 based on OpenSSL's implementation of it in https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99 This PR also fixes some bugs found while developing it. This is a replacement for #50203 and fixes the issues found by @IanButterworth with both rngs C rng <img width="1011" alt="image" src="https://github.com/user-attachments/assets/0dd9d5f2-17ef-4a70-b275-1d12692be060"> New scheduler rng <img width="985" alt="image" src="https://github.com/user-attachments/assets/4abd0a57-a1d9-46ec-99a5-535f366ecafa"> ~On my benchmarks the julia implementation seems to be almost 50% faster than the current implementation.~ With oscars suggestion of removing the debiasing this is now almost 5x faster than the original implementation. And almost fully branchless We might want to backport the two previous commits since they technically fix bugs. --------- Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
Implement optimal uniform random number generator using the method proposed in swiftlang/swift#39143 based on OpenSSL's implementation of it in https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99 This PR also fixes some bugs found while developing it. This is a replacement for #50203 and fixes the issues found by @IanButterworth with both rngs C rng <img width="1011" alt="image" src="https://github.com/user-attachments/assets/0dd9d5f2-17ef-4a70-b275-1d12692be060"> New scheduler rng <img width="985" alt="image" src="https://github.com/user-attachments/assets/4abd0a57-a1d9-46ec-99a5-535f366ecafa"> ~On my benchmarks the julia implementation seems to be almost 50% faster than the current implementation.~ With oscars suggestion of removing the debiasing this is now almost 5x faster than the original implementation. And almost fully branchless We might want to backport the two previous commits since they technically fix bugs. --------- Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
Everyone knows that generating an unbiased random integer in a range 0 ..< upperBound, where upperBound is not a power of two, requires rejection sampling. What if I told you that Big Random Number has lied to us for decades, and we have been played for absolute fools?
Previously Swift used Lemire's "nearly divisionless" method (https://arxiv.org/abs/1805.10941) for this operation. This PR introduces a novel algorithm that:
Lemire interprets each word from the random source as a fixed-point number in [0, 1), multiplies by upperBound, and takes the floor. Up to this point, this is the algorithm suggested by Knuth in TAoCP vol 2, and as observed by Knuth, it is slightly biased. Lemire cleverly corrects this bias via rejection sampling, which requires one division in the general case (hence, "nearly divisionless").
Our new algorithm takes a different approach. Rather than using rejection sampling, we observe that the bias decreases exponentially in the number of bits used for the computation. In the limit we are interpreting the bitstream from the random source as a uniform real number r in [0, 1) and ⌊r * upperBound⌋ provides an unbiased sample in 0 ..< upperBound. The only challenge, then, is to know when we have computed enough bits of the product to know what the result is.
Observe that we can split the random stream at any bit position i, yielding r = r₀ + r₁ with r₀ a fixed-point number in [0,1) and 0 ≤ r₁ < 2⁻ⁱ. Further observe that:
The first term of this expression is just Knuth's biased sample, which is computed with just a full-width multiply.
If i > log₂(upperBound), both summands in the second term are smaller than 1, so the second term is either 0 or 1. Applying the bound on r₁, we see that if frac(r₀ * upperBound) <= 1 - upperBound * 2⁻ⁱ, the second term is necessarily zero, and the first term is the unbiased result. Happily, this is also a trivial computation on the low-order part of the full-width multiply.
If the test fails, we do not reject the sample, throwing away the bits we have already consumed from the random source; instead we increase i by a convenient amount, computing more bits of the product. This is the criticial improvement; while Lemire has a probability of 1/2 to reject for each word consumed in the worst case, we have a probability of terminating of 1/2 for each bit consumed. This reduces the worst-case expected number of random bits required from O(log₂(upperBound)) to log₂(upperBound) + O(1), which is optimal[1].
Of more practical interest, this new algorithm opens an intriguing possibility: we can compute just 64 extra bits, and have a probability of 1 - 2⁻⁶⁴ of terminating. This is so close to certainty that we can simply stop unconditionally without introducing any measurable bias (detecting any difference would require about 2¹²⁸ samples, which is prohibitive). This is a significant performance improvement for slow random generators, since it asymptotically reduces the number of bits required by a factor of two for bignums, while matching the expected number of bits required for smaller numbers. This is the algorithm implemented by this PR (the formally-uniform method is not much more complex to implement and is only a little bit slower, but there's no reason to do so).
More intriguing still, this algorithm can be made unconditional by removing the early out, so that every value computed requires word size + 64 bits from the stream, which breaks the loop-carried dependency for fast generators, unlocking vectorization and parallelization where it was previously impossible. This is an especially powerful advantage when paired with bitstream generators that allow skip-ahead such as counter-based generators or PCG.
Note that it is possible to employ Lemire's tighter early-out check that involves a division with this algorithm as well; this is beneficial in some cases when upperBound is a constant and the generator is slow, but we do not think it necessary with the new algorithm and other planned improvements.
[1] We can actually achieve log₂(upperBound) + ε for any ε > 0 by generating multiple random samples at once, but that is only of theoretical interest--it is still interesting, however, since I don't think anyone has described how to attain it previously.