Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An optimal algorithm for bounded random integers #39143

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Commits on Sep 2, 2021

  1. A novel optimal algorithm for bounded random integers

    Everyone knows that generating an unbiased random integer in a range 0 ..< upperBound, where upperBound is not a power of two, requires rejection sampling. What if I told you that Big Random Number has lied to us for decades, and we have been played for absolute fools?
    
    Previously Swift used Lemire's "nearly divisionless" method (https://arxiv.org/abs/1805.10941) for this operation. This PR introduces a novel algorithm that:
    
    - never divides
    - avoids rejection sampling entirely
    - achieves a theoretically optimal bound on the amount of randomness consumed to generate a sample
    - delivers actual performance improvements for most real cases
    
    Lemire interprets each word from the random source as a fixed-point number in [0, 1), multiplies by upperBound, and takes the floor. Up to this point, this is the algorithm suggested by Knuth in TAoCP vol 2, and as observed by Knuth, it is slightly biased. Lemire cleverly corrects this bias via rejection sampling, which requires one division in the general case (hence, "nearly divisionless").
    
    Our new algorithm takes a different approach. Rather than using rejection sampling, we observe that the bias decreases exponentially in the number of bits used for the computation. In the limit we are interpreting the bitstream from the random source as a uniform real number r in [0, 1) and ⌊r * upperBound⌋ provides an unbiased sample in 0 ..< upperBound. The only challenge, then, is to know when we have computed enough bits of the product to know what the result is.
    
    Observe that we can split the random stream at any bit position i, yielding r = r₀ + r₁ with r₀ a fixed-point number in [0,1) and 0 ≤ r₁ < 2⁻ⁱ. Further observe that:
    
        result = ⌊r * upperBound⌋
               = ⌊r₀ * upperBound⌋ + ⌊frac(r₀*upperBound) + r₁*upperBound⌋
    
    The first term of this expression is just Knuth's biased sample, which is computed with just a full-width multiply.
    
    If i > log₂(upperBound), both summands in the second term are smaller than 1, so the second term is either 0 or 1. Applying the bound on r₁, we see that if frac(r₀ * upperBound) <= 1 - upperBound * 2⁻ⁱ, the second term is necessarily zero, and the first term is the unbiased result. Happily, this is _also_ a trivial computation on the low-order part of the full-width multiply.
    
    If the test fails, we do not reject the sample, throwing away the bits we have already consumed from the random source; instead we increase i by a convenient amount, computing more bits of the product. This is the criticial improvement; while Lemire has a probability of 1/2 to reject for each word consumed in the worst case, we have a probability of terminating of 1/2 for each _bit_ consumed. This reduces the worst-case expected number of random bits required from O(log₂(upperBound)) to log₂(upperBound) + O(1), which is optimal[1].
    
    Of more practical interest, this new algorithm opens an intriguing possibility: we can compute just 64 extra bits, and have a probability of 1 - 2⁻⁶⁴ of terminating. This is so close to certainty that we can simply stop unconditionally without introducing any measurable bias (detecting any difference would require about 2¹²⁸ samples, which is prohibitive). This is a significant performance improvement for slow random generators, since it asymptotically reduces the number of bits required by a factor of two for bignums, while matching the expected number of bits required for smaller numbers. This is the algorithm implemented by this PR (the formally-uniform method is not much more complex to implement and is only a little bit slower, but there's no reason to do so).
    
    More intriguing still, this algorithm can be made unconditional by removing the early out, so that every value computed requires word size + 64 bits from the stream, which breaks the loop-carried dependency for fast generators, unlocking vectorization and parallelization where it was previously impossible. This is an especially powerful advantage when paired with bitstream generators that allow skip-ahead such as counter-based generators or PCG.
    
    Note that it is _possible_ to employ Lemire's tighter early-out check that involves a division with this algorithm as well; this is beneficial in some cases when upperBound is a constant and the generator is slow, but we do not think it necessary with the new algorithm and other planned improvements.
    
    [1] We can actually achieve log₂(upperBound) + ε for any ε > 0 by generating multiple random samples at once, but that is only of theoretical interest--it is still interesting, however, since I don't think anyone has described how to attain it previously.
    stephentyrone committed Sep 2, 2021
    Configuration menu
    Copy the full SHA
    87b3f60 View commit details
    Browse the repository at this point in the history