Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compromising accuracy and speed [distributions] #494

Open
dhardy opened this issue Jun 5, 2018 · 33 comments
Open

Compromising accuracy and speed [distributions] #494

dhardy opened this issue Jun 5, 2018 · 33 comments
Labels
E-question Participation: opinions wanted

Comments

@dhardy
Copy link
Member

dhardy commented Jun 5, 2018

This is an issue that has been discussed in part several times already in other contexts. But I think it's now worth making an issue specifically about this to get extra feedback.


There are several cases within distribution algorithms that we've had to make decisions about speed-vs-accuracy trade-offs:

So far, other than with the suggested addition of a HighPrecision distribution (#372) (and the other float sampling methods Open01 and OpenClosed01), we have not given users any choice over this speed-vs-accuracy trade-off but have simply made the choice for users.

  • Should we try to give users the choice, some how?
  • Which criteria should we use to choose default implementations?
@dhardy dhardy added E-question Participation: opinions wanted T-distributions labels Jun 5, 2018
@dhardy
Copy link
Member Author

dhardy commented Jun 5, 2018

Suggestion 1: we just choose defaults with reasonable precision.

Since computing power is inherently limited, sufficiently small biases are unlikely to be noticeable. But how small is good enough? Modern CPUs run up to around 2^32 cycles/second, and a year has around 2^25 seconds, for around 2^57 cycles/year/core. The top supercomputers now have in excess of a million cores, say around 2^24 with around 100 PFlop/s, so 10^14 × 3*10^7 = 3e21 (approx 2^72) floating-point operations per year.

What does this tell us? A bias as small as 1 in 2^128 is unlikely to be detectable, ever (at least without quantum computers). But a bias of 1 in 2^53 (i.e. f64) is potentially detectable even with a desktop PC — in a serious application running non-stop for months or years. On the other hand, many applications wouldn't even notice bias as large as 1 in 2^10. So perhaps a one-size-fits-all solution isn't a good choice, if high-precision comes at a significant performance cost.

Can we say anything else? Above around 1 in 2^20 bias is hard to detect without specifically looking for it, and above 2^40 doing so requires looking quite hard, hence biases this small may be good enough for most purposes, though certainly not all.

(Sorry @pitdicker for paraphrasing a lot of what you said I while back; I forget where.)

@dhardy
Copy link
Member Author

dhardy commented Jun 5, 2018

Suggestion 2: use feature flags to control precision

If around 2^20-2^40 bits of precision is enough for most users, then what if we accept such choices as defaults, but add a high-precision feature flag to enable higher-precision algorithms?

There is a potential catch here — activation of higher-precision becomes application-wide, and if any lib in the application requires Rand's high-precision mode, then it gets used everywhere. If we implement this option we should therefore recommend that only applications enable this flag (if required); unfortunately this doesn't play well with Cargo crates bundling both libraries and applications.

Further, this is only going to be a good choice if the high-precision alternatives are still reasonably fast, since users can't pick which algorithm to use at each use-site.

This would probably not affect compilation time or code size much, since the algorithms would be behind feature flags thus unused code can be eliminated early.

Summary: keeps the API clean and simple; Rand still gets a little more complex (multiple implementations); limited control

@dhardy
Copy link
Member Author

dhardy commented Jun 5, 2018

Suggestion 3: multiple implementations, as necessary

This is basically the current plan regarding Standard and HighPrecision (#372) — just have multiple implementations which can be selected at the use-site, e.g. rng.gen() vs rng.sample(HighPrecision::new(0.0, 1.0))

For e.g. gen_bool this would require multiple function names; alternatively it might be better to stick with only a single implementation (the current one rounds probabilities down to the next multiple of 2^-64, which is good enough for most users, though perhaps not everyone).

Summary: flexible but with extra API complexity and sometimes worse ergonomics.

@dhardy
Copy link
Member Author

dhardy commented Jun 5, 2018

Suggestion 4: generic, over some configuration type

What if we added dummy types High and Low both implementing a Precision trait, then templated over these? I.e. Uniform::<Low, f64>::new(0.0, 5.0), Bernoulli::<High>::new(1e-30), ...

This would probably be quite bad for ergonomics (if it even worked as designed) since presumably users could not omit generic parameters any more (let range = Uniform::new_inclusive(1, 6); would become let range = Uniform::<Low, _>::new_inclusive(1, 6);).

In my opinion the poor ergonomics and extra API complexity are unacceptable; as it is now Rand is already confusing some users.

@sicking
Copy link
Contributor

sicking commented Jun 6, 2018

There's a variant of Suggestion 3 which is that rather than make it a property of which API is used (gen_bool() vs gen_fast_bool()), that we make it a property of the Rng object.

I.e. that something like fast_rng().gen_range(...) would use slightly biased, but faster, algorithms, whereas secure_rng().gen_range(...) would not accept any bias.

Similarly vec.shuffle(fast_rng()) would used slightly biased generation, whereas vec.shuffle(secure_rng()) would not.

We might even be able to create wrappers which wrap Rng objects and which turns on/off bias. Such that fast_rng().unbiased().gen_range(...) would produce unbiased numbers.

I don't know if there's any good way to do this which doesn't introduce runtime cost though. It'd work to stick a fn allow_bias() -> bool function on RngCore and then check that where needed. But I don't know if the optimizer is good enough at optimizing away the check at all times?

Also this approach would make fast_rng().sample(Uniform::new(1, 10)) kind of wonky. The Uniform constructor will calculate the anti-bias zone, but would sample produce biased on unbiased values?

@sicking
Copy link
Contributor

sicking commented Jun 6, 2018

Ultimately I think Suggestion 1 is the most reasonable. Though it scares me a little to think that someone might use a CSPRNG together with a biased algorithm...

@pitdicker
Copy link
Contributor

That is a very good summary, @dhardy!

We seem to be in a bit of a difficult position, as Rand is used for some very different use cases:

  • randomness in common programming problems
  • reasonably secure randomness
  • long-running simulations

In my opinion, we should strongly optimize for the 'common programming problems' (security does not really apply here). Is seems reasonable to expect anyone who writes multiple-month simulations to spend an hour reading docs, or know the limitations of the algorithms he is using.

I like suggestion 1 most, but would also like to see 3 when reasonable.

Specifically:

  • gen_bool: return to the 32-bit accuracy, and have the Bernoulli distribution for 53~64 bit accuracy. One optimized for common use, one for long-running simulations.
  • Have both Standard and HighPrecision01.
  • Uniform range: have to think a bit more about this one. An imperfect method is a good choice when the numbers are relatively small. Maybe we can make it depend on the type, with the faster, but imperfect, method for u8 and u16?

Letting the accuracy depend on the RNG as @sicking suggest is also an interesting idea. But the distinction should not be between normal PRNGs and CSPRNGs. Accuracy is not really a security-related problem. PRNGs that are suitable for simulations are also the ones that are suitable for many common problems. I would say the suitability depends more on where the function is used, than the RNG.

@dhardy
Copy link
Member Author

dhardy commented Jun 6, 2018

I'm not sure how @sicking's suggestion could be implemented other than by adding a method like fn use_high_precision_sampling(&self) -> bool to RngCore (specialisation might allow a different approach based on whether an additional trait, HighPrecisionSamplingRng, was implemented, but not yet). Ultimately I'm not so keen on the idea anyway because as @pitdicker says, accuracy is more a use-case problem and less a security problem (though arguably enabling it globally within an application is a reasonable solution, hence suggestion 2).

We could also modify suggestion 2 by adding a low_precision feature in addition to (or instead of) the high_precision one.

Other than that, I just wanted to say that talking about Bernoulli as having 53/64-bit accuracy is a bit mis-leading: it currently supports down to 2^-64, but 53-bit (and even 24-bit) precise floats can represent far smaller positive numbers (MIN_POSITIVE is approx 10^-38 and 10^-308).

@pitdicker
Copy link
Contributor

pitdicker commented Jun 6, 2018 via email

@dhardy
Copy link
Member Author

dhardy commented Jun 6, 2018

Another thing I want to say is that I'd prefer to err on the side of caution (i.e. more precision than necessary) than on the side of performance (for default choices). That said, I think some trade-offs can still be made (e.g. the current Bernoulli should be good enough for the vast majority of users).

This is really more political than it is technical: people are far more likely to criticise incorrect and prematurely-optimised approaches than slow ones; also since in many cases performance will not be a real issue anyway, I don't think it makes sense to "over-optimise".

Because of this, I'm not happy to skip the zone checks for range sampling (e.g. sampling u8 from u32 with Uniform), even though quick tests show 30-50% better performance. If we were to sample e.g. i16 from a u64 backing range then arguably any bias would be too small to matter, but it appears faster to sample an i16 from a u32 backing range with the correct zone checks anyway, and in this case although bias is still small, it's not small enough that I'm happy to ignore it by default.

I therefore think that these range sampling optimisations (and perhaps also the 32-bit gen_bool solution) should either not be used at all, or only be enabled on an opt-in basis. There are probably some uses where the bias really doesn't matter, but in the interests of making users lives simpler I think the only good opt-in solution would be a feature flag (i.e. suggestion 2 inverted).

Good point @pitdicker that anyone writing serious simulations will likely look over the docs closely, so slightly less obvious/ergonomic solutions for people needing high precision is fine (i.e. solution 3), though suggestion 2 may still be easier.

I guess the wrap-up question is would any of these solutions make our lives easier (i.e. more obvious answers to speed-vs-quality compromises) vs harder (more code/complexity)?

No problem @pitdicker 😉

@vks
Copy link
Collaborator

vks commented Jun 6, 2018

gen_bool: return to the 32-bit accuracy, and have the Bernoulli distribution for 53~64 bit accuracy. One optimized for common use, one for long-running simulations.

I'm skeptical for the following reasons:

  • If you really care about performance, gen_bool is the wrong interface. Bernoulli::new and sample are potentially faster. The only reason to use 32-bit accuracy over 64-bit accuracy is performance, so a 32-bit gen_bool would be a weird trade-off between convenience and performance.
  • It makes the API more complicated. Right now, gen_bool is a convenient wrapper over Bernoulli, your suggestion would make it a different algorithm with different trade-offs to consider.

I would like to propose a variation of suggestion 3:

Suggestion 5: split up the distributions into two modules: fast and precise

Distributions optimized for speed go to distributions::fast, if they are optimized for accuracy they go to distribtuions::precise.

Only the distributions in precise could be reexported to provide a conservative default. Users can pick explicitly which one they want to use, or just pick the default. This would also get rid of weird names like HighPrecision01, which could be just precise::ClosedOpen01. We could also provide a 32-bit based fast::Bernoulli. (I don't think there should be agen_bool using fast::Bernoulli: convenient and less safe don't mix well.)

Summary: flexible but with extra API complexity that can be safely ignored by probably most users

@dhardy
Copy link
Member Author

dhardy commented Jun 6, 2018

Nice suggestion @vks! Two niggles:

  • the names are unwieldy
  • for maximum compatibility, it would make sense if the two modules are (roughly) equivalent, so users can just use rand::distributions::CHOICE as distributions;

With that in mind, perhaps rand::distributions should contain the default implementations, and rand::distributions::fast (?) contain the fast versions, re-binding all items from the main module which do not have faster equivalents.

One thing though: should we really make HighPrecision01 the default (and also a high-precision variant of Uniform, which we haven't implemented yet and may have large overhead)?

I guess we could have three variants of the module: the default, fast, and precise. Seems somewhat over-designed.

@vks
Copy link
Collaborator

vks commented Jun 6, 2018

the names are unwieldy

Not sure how they could be any shorter. prec instead of precise? To be honest, I would rather shorten distributions.

I think the alternative suggestions except suggestion 1 are more unwieldy.

for maximum compatibility, it would make sense if the two modules are (roughly) equivalent

Yes, it should be safe for fast to reexport precise if there is no faster implementation.

I guess we could have three variants of the module: the default, fast, and precise. Seems somewhat over-designed.

That's what I suggested. You would have to choose between distributions::Dist, distributions::fast::Dist and distributions::precise::Dist. We could decide on a case-by-case basis whether distributions::Dist is reexporting from fast or precise, but this would increase the mental burden for the user.

I don't think it is over-designed (unless providing more than one implementation is already over-designed), because it is backwards compatible and nicely allows to opt into the desired implementation.

@dhardy
Copy link
Member Author

dhardy commented Jun 6, 2018

I mean the combined names, e.g. distributions::fast. But no, I don't have a better idea. It'll do.

We could decide on a case-by-case basis whether distributions::Dist is reexporting from fast or precise

I see. Okay, sounds like a good plan, though it leaves two things out:

  • Short-cuts like Rng::gen, gen_range and gen_bool will not be configurable; users wanting something else must switch to distributions::XYZ::Uniform etc.; I guess this is acceptable though not ideal.
  • Do we want all of these things in Rand, vs some external library (Future of distributions in rand #290)? And if we do, do we want all implementations in Rand itself? Moving some distributions out to a different library shouldn't be difficult (should we decide to do that), but e.g. moving high-precision variants to an external lib wouldn't be so easy — but probably there is no reason to do that anyway.

@sicking
Copy link
Contributor

sicking commented Jun 6, 2018

But the distinction should not be between normal PRNGs and CSPRNGs. Accuracy is not really a security-related problem.

Yeah, I didn't mean to suggest otherwise. Mainly I was thinking that someone specifically requesting a cryptographically strong likely wouldn't want to use an algorithm with bias as a default. I.e. if you're generating a cryptographically strong security token for authentication, I was thinking you likely don't want more as showing up than zs. Though from a practical point of view, if we make the biases small enough, security doesn't seem affected in a meaningful way. I.e. if a 256-bit key ends up being just 255.99 bits, then that's likely fine.

But I'd want to check with someone that knows crypto better than me before being sure about that.

Suggestion 5: split up the distributions into two modules

I think the distributions are the relatively easy part to solve. Separate structs for fast vs. biased I think would make for both make for a clear and easy to implement API. If we do it by sticking the distributions in separate modules, or by using more expressive names, is not something I have an opinion about.

The bigger problem is all the other APIs. I.e. things like gen_range(), shuffle(), choose(), choose_multiple(), sample_indices(), etc. These are the ones that I was thinking we could do by using flags on the RngCore. But it's not clear to me that that's implementable. Or worth the implementation complexity.

@sicking
Copy link
Contributor

sicking commented Jun 6, 2018

I agree with @pitdicker that we should optimize for "common programming problems". Which I feel like it means that sufficiently small biases is ok.

Though I'm not sure what optimizing for "common programming problems" means for performance. Arguably it means that if making things faster comes at a cost of code complexity (which often is the case), we should only optimize the code if it's likely to show up in real-world profiles, rather than worry too much about squeezing everything out of our micro-benchmarks.

But I'm happy to defer to others. Especially in cases where implementation strategy doesn't affect the public API and so is something that we can easily change over time.

@vks
Copy link
Collaborator

vks commented Jun 7, 2018

Short-cuts like Rng::gen, gen_range and gen_bool will not be configurable; users wanting something else must switch to distributions::XYZ::Uniform etc.; I guess this is acceptable though not ideal.

They trade flexibility for convenience. Making them more flexible defeats their purpose, so I'm not sure what the alternative would be? I'm not a big fan of any of those methods TBH.

This is different for the algorithms operating on sequences (shuffle, choose etc.), which have no alternative. We could possible add shuffle_with_dist(&mut [T], &mut Rng, Fn(usize, usize) -> impl Distribution<usize>) etc., but maybe that is too flexible.

I'm not sure what "common programming problems" means, because it depends so strongly on the programs in question. Either of the alternatives @pitdicker mentioned may be "common" in certain domains.

Maybe it would be more actionable to look at typical implementations used in the popular randomness libraries that have been used a lot without anyone complaining about biases.

Please note that "long-running simulations" are not necessarily "long" in terms of real time but rather in CPU time. On most supercomputers there is an upper limit on the runtime, ranging from a day up to a week or two.

moving high-precision variants to an external lib wouldn't be so easy

Why would it be more difficult than for the others? In any case, I think they should live in the same crate. It might make sense to move the distributions to their own crate. However, at the moment they are tied to some methods of Rng. I think those could be all moved to the distributions or a module for the sequences though.

@dhardy
Copy link
Member Author

dhardy commented Jun 7, 2018

I'm not sure what "common programming problems" means, because it depends so strongly on the programs in question.

Agreed; in part also because trying to tie down an exact level of bias which is acceptable in general is difficult.

Two things about @vks suggestion still need finalising:

  • How do we document this? Obviously we can document the trade-offs each individual implementation makes, but e.g. given fast::Uniform and precise::Uniform, should distributions::Uniform document both implementations and link to them or copy the documentation from whichever it uses or even not talk about precision other than to briefly mention which variant is used?
  • How do we handle breaking changes where we decide to replace distributions::X with a different implementation of X? Do we tell users that they should explicitly use either fast::X or precise::X if they care about reproducibility or at all about which method gets used? (Related: Plan value-level stability guarantees (standards) #408)

@pitdicker
Copy link
Contributor

pitdicker commented Jun 7, 2018

I'm not sure what "common programming problems" means, because it depends so strongly on the programs in question. Either of the alternatives @pitdicker mentioned may be "common" in certain domains.

I don't really see it as difficult or vague. When we start to talk in terms of "this will take 1+ year of CPU-time before the tiniest bias becomes visible while doing nothing but calling this function", we are clearly outside the "common programming problems" territory.

Maybe it would be more actionable to look at typical implementations used in the popular randomness libraries that have been used a lot without anyone complaining about biases.

That is always a good idea 😄.


I really don't think suggestion 2 is a good idea. The choice of precision really depends on how something is used. Say a bit more accuracy is wanted in one place. A feature flag would effect every other function, also in every other dependency.

I don't know what to think of suggestion 5. If we only consider Uniform, maybe Standard for floats, and maybe Bernoulli, it makes some sense. But it doesn't really for all the other distributions.

moving high-precision variants to an external lib wouldn't be so easy

Why would it be more difficult than for the others?

Seems like a good alternative to keep in mind. We already had the same discussion around PRNG algorithms. In my opinion is is a good thing for Rand to be opinionated in the functionality it includes. If we can't make good choices, or offer multiple variants if really needed, how do we expect users to choose? Distribution is a trait, there is no harm in having alternative algorithms for some distributions in another crate.

@sicking
Copy link
Contributor

sicking commented Jun 9, 2018

Another approach occurred to me:

Suggestion 6: Use the type of the generated data to determine precision

I noticed that rand used to have a struct Closed01(f32) (or some such) type for generating values in the [0, 1] range.

We could use a similar approach for choosing between high-performance vs. low-bias algorithms. I.e. Uniform<Fast<u32>> would generate slightly biased but higher performance numbers. And Uniform<Precise<u32>> would generate unbiased numbers, but using a slower algorithm. Where Fast looks something like struct Fast<T: ...>(T).

This would unfortunately only work for APIs like Uniform, gen, gen_range, gen_ratio, and gen_bool, which are (or could be) generic over the type of the generated numbers. It does not work for SliceRandom.choose which has no such type parameter.

I'm not sure that I particularly like this solution, but I wanted to add it for completeness.

@dhardy
Copy link
Member Author

dhardy commented Jun 9, 2018

In some cases this "wrapping type" notation can be used easily, but to be honest I found Open01 etc. quite annoying to use a lot of the time. E.g. you might want to write this:

let Precise(x) = rng.gen();
// but the type of `x` must be specified somewhere, so you may require:
let Precise(x): Precise<f64> = rng.gen();

@dhardy
Copy link
Member Author

dhardy commented Jul 31, 2018

I now agree with @pitdicker's view the most: that in most cases we can just pick a sensible default, and in a few we can add extra implementations (i.e. Suggestion 3).

Proposal:

  • keep most things as-is
  • where necessary, add extra implementations whose type names are prefixed by Precise or Fast (i.e. rename HighPrecision01 to PreciseUniform01 and HighPrecision to PreciseUniform)

This seems the simplest solution, and also provides uniform names, without the confusion of shadowed names in sub-modules (I don't like the idea of having distributions::Uniform and distributions::precise::Uniform as in suggestion 5).

Is everyone happy enough with this proposal?

@vks
Copy link
Collaborator

vks commented Jul 31, 2018

I don't like the idea of having distributions::Uniform and distributions::precise::Uniform as in suggestion 5

Why do you think it is confusing? To me, PreciseUniform and precise::Uniform are the same, except that the former is less flexible.

@dhardy
Copy link
Member Author

dhardy commented Jul 31, 2018

Because the normal way to import the name is use rand::distributions::precise::Uniform, which doesn't make it clear in the rest of the code which implementation is used. It also requires two use statements with an explicit as NewName in order to use both in the same module.

@vks
Copy link
Collaborator

vks commented Jul 31, 2018

Because the normal way to import the name is use rand::distributions::precise::Uniform, which doesn't make it clear in the rest of the code which implementation is used.

You are free to prefer use rand::distributions::precise and precise::Uniform instead. No need to rename them. Note that this has precedence in std::io::Error vs. std::fmt::Error and ring's API.

@sicking
Copy link
Contributor

sicking commented Jul 31, 2018

FWIW, I think there's relatively little value in adding additional distributions. I think the real value is in if we can provide faster implementations of the "default" functionality like gen_range, shuffle, choose, etc. The same way we do with gen_bool and gen_ratio, where we use "good enough" precision in order to get better performance.

Which really comes down to having a faster, but "good enough", implementation of Uniform.

Defining "good enough" is hard, but we've been able to do it for gen_bool and gen_ratio, so I don't see a reason we couldn't for Uniform, shuffle and choose. Probably something like, "requires running load X for Y months in order to archive a confidence interval over Z% that Uniform doesn't generate evenly distributed numbers".

Additionally having a "perfectly unbiased" distribution like today's Uniform seems like a good idea. But I think the interesting question is what we do for the APIs above.

@dhardy
Copy link
Member Author

dhardy commented Jul 31, 2018

@sicking I am open to revisiting the question of bias when sampling small integer ranges, though I don't know exactly what "confidence" parameters we should have.

One of the issues though is allowing smart choices of bias/performance trade-offs. For example, allowing small bias in integer range sampling has a relatively small performance impact, vs allowing small bias in floating-point range sampling (IIUC). So does it make sense only to have distributions::Uniform with the biased optimisations and distributions::precise::Uniform with the maximal precision?

@sicking
Copy link
Contributor

sicking commented Jul 31, 2018

I'm also not sure what confidence interval to use. My thinking was to use 50% as in "more likely than not that there's bias". I think think the pick of load makes at least as big of a difference. I.e. the choice between "do nothing but generate random numbers on every core on a modern MacBook Pro" vs. "generate one random number every microsecond".

I was able to get about a 2x perf improvement on integer sampling by accepting a small bias for 8-bit, 16-bit integers and 32-bit integers. Smaller than the 6x difference between the current Uniform and the code in #531, but same order of magnitude.

So does it make sense only to have distributions::Uniform with the biased optimisations and distributions::precise::Uniform with the maximal precision?

If we do anything at all, then yeah, I think that's what I would do. Though I don't know if I would put "unbiased integer sampling" in the same distribution as "maximum precision float sampling". I feel like bias and precision are related but not the same. But I'm not sure.

@dhardy dhardy added this to the 0.7 release milestone Aug 23, 2018
@vks vks removed this from the 0.7 release milestone Jun 27, 2019
@dhardy
Copy link
Member Author

dhardy commented Sep 11, 2019

This problem has gone unanswered for a long time now. Since the above was written, many distributions have been moved to the rand_distr crate, and we have multiple implementations of the WeightedIndex distribution.

Related problems:

  • providing multiple algorithms differing by speed (WeightedIndex)
  • making old algorithms available (for value stability) while providing new implementations

Possible solution: add a new module, distributions::impls (or variants; not clear what is the best name). Under this, add a module for each distribution (e.g. distributions::impls::weighted), providing each implementation under a unique, invariant name. The re-export the preferred implementation under distributions.

@dhardy
Copy link
Member Author

dhardy commented Oct 18, 2021

By now we have found "nice defaults" (solution 1) for most cases and spun off many implementations to the rand_distr crate. There are still several cases to consider:

My suggestion now is that we stick with "solution 1" (reasonable defaults) for rand and rand_distr, but add a new crate, say rand_precise for high/infinite precision sampling algorithms, and just add new impls there (rand_precise::Bernoulli).

@dhardy
Copy link
Member Author

dhardy commented Feb 17, 2023

A lot of this discussion is based on HighPrecision floating-point generation. This is situational, and perhaps unnecessary. Our standard FP generation never produces a whole sub-set of values due to limited precision — which is fine for many uses where that extra precision would never be used anyway (e.g. x+1 would immediately lose it). So for this I agree that it should be a separate distribution (solution 3 or 5) or not at all.

Fast, potentially biased integer sampling is a distinct problem: potentially producing some values too often relative to others. It is behaviourally distinct: for small numbers of samples the bias is statistically non-detectable, while when using sufficiently large numbers of samples the bias could be visible for pretty much any usage. For this I think only solution 1 or 2 is appropriate: if you have a reason to need an unbiased Uniform distribution then exactly the same logic applies to rng.gen_range, slice.shuffle etc.

Having an exact Bernoulli distribution falls under the second case: bias.

The multiple impls of WeightedAlias and index::sample are neither about precision nor bias, therefore nothing to do with this issue.


I propose the following actions:

  • Reject solutions 3 / 5 and close HighPrecision<f32/64> #531. These are all neat ideas, but I simply haven't seen interest in using high-precision variants. (If there is interest, possibly it should be moved to rand_distr or a new rand_precise crate simply because I don't think most users of rand need it.)
  • Adopt solution 1 (we mostly have, but should add some guidelines for acceptable bias in distributions). Here I proposed accepting bias of 1 in 2^48 samples but not 2^32 samples. This should be good enough for almost everyone.
  • Optionally also add an unbiased feature flag which changes algorithms in a few places (Solution 2). I'm not sure about this one: few people would need it and enabling the flag breaks reproducibility (and thus several tests).
  • Investigate Bernoulli and Implement an exact Bernoulli distribution #1193: does the current implementation satisfy our new guidelines on bias?

@vks
Copy link
Collaborator

vks commented Feb 17, 2023

I agree that the high-precision variants could live in their own crate. I'm also not sure whether there are even use cases for them, and a common criticism of rand is too much complexity/flexibility, so it makes sense to split them off. (We should probably decrease our API surface instead of increasing it.1)

About unbiased: I think it's great to have a quantitative guideline! I would also prefer an extra crate over solution 2. I think people care more about portability/reproducibility than such small biases. Having an extra crate avoids portability and reproducibility problems, and keeps rand simpler.

Footnotes

  1. I would like to revisit rand::distributions::uniform and check whether people actually extend Uniform, or whether we can stop to expose those traits / types.

@dhardy
Copy link
Member Author

dhardy commented Feb 17, 2023

I don't see an extra crate as an alternative to solution 2: the biased/unbiased question comes down to "is our simulation likely to have noticeable bias due to the algorithms used" (given the 2^48 threshold used, anything involving less than ~1 year of CPU time is unaffected). For the few cases where the answer is yes, all uses of Uniform, list shuffling, etc. should use unbiased algorithms, hence this is about crate configuration not a new crate.

In my opinion the only relevant question is between solution 1 (one set of algorithms for everything) or 2 (crate configuration), and I don't see a strong rationale either way. (Also relevant: is 5-40% overhead on uniform sampling even an issue, even in things like games? Only if done a lot, as in the ridiculous micro-benchmarks I've been using.)

I would like to revisit rand::distributions::uniform and check whether people actually extend Uniform, or whether we can stop to expose those traits / types. leftwards_arrow_with_hook

The whole rand::distributions::uniform module is essentially an implementation detail people can skip over. If we wanted, we could move it to a new crate, rand_uniform, and import only Uniform into rand, just to hide it. Or just hide it in docs. But this hurts more than it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-question Participation: opinions wanted
Projects
None yet
Development

No branches or pull requests

4 participants