Improve doc on distributions and period #23

dhardy · 2019-11-14T13:06:40Z

Addresses @vigna's comments. Closes #18, #19, #20, #21.

@vigna would you please review? Note that the changes on distributions are more extensive; @vks may be a more appropriate reviewer here.

Addresses rust-random#21

Addresses rust-random#18 and rust-random#19

Addresses rust-random#20

src/guide-dist.md

src/guide-rngs.md

vigna · 2019-11-14T15:06:15Z

src/guide-rngs.md

-period of 2<sup>128</sup>).
-
+period. For some uses this may be a nice property to have, but it may suppress
+duplicates expected in a truely random generator;


I'm sorry to insist, and this will be long, but this is typical really O'Neill nonsense (not surprisingly, the paper was rejected). Non-equidistributed generators fail a strong collision test exactly like an equidistributed generator. It is very different to state the you cannot prove that something does not happen, and to prove that something happens.

Suppose you have a 32-bit generator with 16-bits outputs (so you can actually run the tests) that is 2-dimensionally equidistributed (the best), e.g., a Marsaglia xorshift generator, and one that is not, like a PCG generator.

What is true is that if you look at the sequences of pairs of outputs (that is, 32 bits at a time) you will have to wait 2^32 iterations before seeing a collision in the xorshift case, and less in the PCG case.

First of all, "less" will not be the right number. I invite you to try—at this size it is very easy to run the test. In theory, if you take 2^20 samples (pairs of 16-bit outputs) you should get 128 collisions. You won't. True, the result will be 0 for the xorshift case and 126 or something for the PCG case, but the statistic won't be exactly right.

But the real problem is that this is not how you run a collision test. The maximum power of the test is then you enumerate 1.25 times the possible values—in our case, 1.25 * 2^32 values. Both generators will fail the test—there's just not enough state. At that length, the collisions should be about 0.55 * 2^32, but you can easily check that neither generator will give that.

Additionally, in exchange for a not-so-bad-collision-test-in-the-short-run, the PCG generator cannot produce all possible values. If you run a test based on Coupon's collector (after enumerating O(n log n) times elements taken uniformly out of n, the probability that you've seen all elements is high), the PCG generator will disastrously fail the test: there will always be missing elements. The xorshift generator will have a statistic a bit off, but not so horribly bad.

So: either you generate all possible outputs for your state, or not. In the first case, you might approximately win Coupon's collector and do horribly on collisions. In the second case, you might approximately win collisions, and do horribly Coupon's collector. There's just not enough space to have both.

Any statement that either choice is "better" is just bogus.

enumerate 1.25 times the possible values

I don't understand what the point is: so you know at least 1/5th of values are duplicates? (And, as below, it doesn't seem relevant to real usage.)

If you run a test based on Coupon's collector

Is this relevant, given that this situation is only applicable after cycling the generator many times?

For the most part the (non-)equidistribution property doesn't appear important to typical usage either since one should have (at the very least) L^2 < P.

Do you have a better argument for why this bound (L^2 < P) is recommended? That was the main point of this paragraph.

The point is that the collision test is stronger in that way. See https://dl.acm.org/citation.cfm?id=979926

Note that the "missing collisions" can happen only if your state has kw bits, your output is w bits, you are k-dimensionally equidistributed, and you consider collisions on blocks of kw bits. That's a lot of ifs.

"Equidistribution" in general does not imply anything about collisions. It just implies that your source can generate all possible values of a certain size, and generates the same amount of them. This is why the statement is mathematically wrong. You can keep it there—it's your book—but it's wrong.

The argument for L^2 < P is that if you have w bits of state and w bits of outputs, and you output all possible values (nobody would use a generator of this kind without that property), if you use more than √P elements you will notice a lack of collisions. So you stay below that.

The argument extends to larger sizes in the sense that if you have kw bits of state and w bits of output, you can potentially generate in sequence all possible blocks of kw bits. If you do so you're in the same game for collisions of blocks of kw bits.

Generating all kw-bit blocks when kw is large might be not so relevant, but then also collisions not happening after √2^kw blocks is irrelevant.

Unfortunately, even with a university affiliation, I cannot access that article.

I get your point about equidistribution not implying correct distribution of repeats and have updated this paragraph. Please take a look.

Seriously? It takes 5s to get the public version 😂.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.6616&rep=rep1&type=pdf

OK, my final three remarks:

When you say 2^64 values, I don't think everybody realizes this is after centuries of output generation; maybe it would be useful to remind it. That is, at that size the "bias" is entirely theoretical. It might not be if you have 32-bit outputs and 64 bits of state. Maybe that's a better example.

I realize you cannot be completely precise in discussing equidistribution, but I would replace "k-dimensional equidistribution" with "equidistribution in the maximum possible dimension". Otherwise, with an unspecified k every generator emitting each output the same number of times (which happens for almost all generators, including PCG) would fall into your claim (and this quite obviously not true).

When you say "nice property", I would explain why: " For some uses this may be a nice property to have, because it means that there are no missing values in the output of the generator, ...". Otherwise, you're showing just one side of the coin.

When you say 2^64 values, I don't think everybody realizes this is after centuries of output generation

There were only two mentions of 2^64; one of them I forgot to delete (see @vks's comment above; I will fix), and the other does mention this in the same sentence.

"equidistribution in the maximum possible dimension

I thought it was clear, but will make this change.

there are no missing values in the output of the generator

~~I see no reason why this property is relevant, given that we're talking about generators which take (at least) centuries to cycle.~~ Because certain values may be impossible to generate with any seed.

Because certain values may be impossible to generate with any seed.

Shuffling is the main application where this applies, I think. (Still not sure how relevant it is, because it wouldn't be feasible to generate all possible shuffles anyway.)

Why do you care about shuffling? Most practical examples will have a relatively small sequence (e.g. 52), thus it is not the values output by the generator which matter but the values output by the Uniform distribution (which uses widening multiply + rejection, thus should be fairly resistant to bias in the low bits also).

dhardy · 2019-11-19T14:31:55Z

Whoops, it appears I pushed upstream instead of to this PR accidentally (I should enable branch protections)! I think in any case we're done with this PR, so long as @vigna approves my last commit.

dhardy added 3 commits November 14, 2019 12:26

Improve documentation of value/distribution sampling

37f01c4

Addresses rust-random#21

Improve documentation on period and remove incorrect formula

fbc3d98

Addresses rust-random#18 and rust-random#19

Revise section on "obselecence of non-crypto rngs"

6a9e46b

Addresses rust-random#20

vks reviewed Nov 14, 2019

View reviewed changes

src/guide-dist.md Show resolved Hide resolved

vks reviewed Nov 14, 2019

View reviewed changes

src/guide-rngs.md Show resolved Hide resolved

vks approved these changes Nov 14, 2019

View reviewed changes

vigna reviewed Nov 14, 2019

View reviewed changes

dhardy mentioned this pull request Nov 19, 2019

Documentation fixes: conversion to float, PCG, all features rust-random/rand#909

Merged

dhardy merged commit 6a9e46b into rust-random:master Nov 19, 2019

dhardy mentioned this pull request Nov 19, 2019

More tweaks to "our rngs" #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve doc on distributions and period #23

Improve doc on distributions and period #23

dhardy commented Nov 14, 2019 •

edited

Loading

vigna Nov 14, 2019 •

edited

Loading

dhardy Nov 14, 2019

vigna Nov 14, 2019

dhardy Nov 19, 2019

vigna Nov 19, 2019

vigna Nov 19, 2019

dhardy Nov 19, 2019 •

edited

Loading

vks Nov 19, 2019

dhardy Nov 19, 2019

dhardy commented Nov 19, 2019

Improve doc on distributions and period #23

Improve doc on distributions and period #23

Conversation

dhardy commented Nov 14, 2019 • edited Loading

vigna Nov 14, 2019 • edited Loading

Choose a reason for hiding this comment

dhardy Nov 14, 2019

Choose a reason for hiding this comment

vigna Nov 14, 2019

Choose a reason for hiding this comment

dhardy Nov 19, 2019

Choose a reason for hiding this comment

vigna Nov 19, 2019

Choose a reason for hiding this comment

vigna Nov 19, 2019

Choose a reason for hiding this comment

dhardy Nov 19, 2019 • edited Loading

Choose a reason for hiding this comment

vks Nov 19, 2019

Choose a reason for hiding this comment

dhardy Nov 19, 2019

Choose a reason for hiding this comment

dhardy commented Nov 19, 2019

dhardy commented Nov 14, 2019 •

edited

Loading

vigna Nov 14, 2019 •

edited

Loading

dhardy Nov 19, 2019 •

edited

Loading