Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve doc on distributions and period #23

Merged
merged 3 commits into from
Nov 19, 2019
Merged

Conversation

dhardy
Copy link
Member

@dhardy dhardy commented Nov 14, 2019

Addresses @vigna's comments. Closes #18, #19, #20, #21.

@vigna would you please review? Note that the changes on distributions are more extensive; @vks may be a more appropriate reviewer here.

src/guide-dist.md Show resolved Hide resolved
src/guide-rngs.md Show resolved Hide resolved
period of 2<sup>128</sup>).

period. For some uses this may be a nice property to have, but it may suppress
duplicates expected in a truely random generator;
Copy link

@vigna vigna Nov 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry to insist, and this will be long, but this is typical really O'Neill nonsense (not surprisingly, the paper was rejected). Non-equidistributed generators fail a strong collision test exactly like an equidistributed generator. It is very different to state the you cannot prove that something does not happen, and to prove that something happens.

Suppose you have a 32-bit generator with 16-bits outputs (so you can actually run the tests) that is 2-dimensionally equidistributed (the best), e.g., a Marsaglia xorshift generator, and one that is not, like a PCG generator.

What is true is that if you look at the sequences of pairs of outputs (that is, 32 bits at a time) you will have to wait 2^32 iterations before seeing a collision in the xorshift case, and less in the PCG case.

First of all, "less" will not be the right number. I invite you to try—at this size it is very easy to run the test. In theory, if you take 2^20 samples (pairs of 16-bit outputs) you should get 128 collisions. You won't. True, the result will be 0 for the xorshift case and 126 or something for the PCG case, but the statistic won't be exactly right.

But the real problem is that this is not how you run a collision test. The maximum power of the test is then you enumerate 1.25 times the possible values—in our case, 1.25 * 2^32 values. Both generators will fail the test—there's just not enough state. At that length, the collisions should be about 0.55 * 2^32, but you can easily check that neither generator will give that.

Additionally, in exchange for a not-so-bad-collision-test-in-the-short-run, the PCG generator cannot produce all possible values. If you run a test based on Coupon's collector (after enumerating O(n log n) times elements taken uniformly out of n, the probability that you've seen all elements is high), the PCG generator will disastrously fail the test: there will always be missing elements. The xorshift generator will have a statistic a bit off, but not so horribly bad.

So: either you generate all possible outputs for your state, or not. In the first case, you might approximately win Coupon's collector and do horribly on collisions. In the second case, you might approximately win collisions, and do horribly Coupon's collector. There's just not enough space to have both.

Any statement that either choice is "better" is just bogus.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enumerate 1.25 times the possible values

I don't understand what the point is: so you know at least 1/5th of values are duplicates? (And, as below, it doesn't seem relevant to real usage.)

If you run a test based on Coupon's collector

Is this relevant, given that this situation is only applicable after cycling the generator many times?

For the most part the (non-)equidistribution property doesn't appear important to typical usage either since one should have (at the very least) L^2 < P.

Do you have a better argument for why this bound (L^2 < P) is recommended? That was the main point of this paragraph.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that the collision test is stronger in that way. See https://dl.acm.org/citation.cfm?id=979926

Note that the "missing collisions" can happen only if your state has kw bits, your output is w bits, you are k-dimensionally equidistributed, and you consider collisions on blocks of kw bits. That's a lot of ifs.

"Equidistribution" in general does not imply anything about collisions. It just implies that your source can generate all possible values of a certain size, and generates the same amount of them. This is why the statement is mathematically wrong. You can keep it there—it's your book—but it's wrong.

The argument for L^2 < P is that if you have w bits of state and w bits of outputs, and you output all possible values (nobody would use a generator of this kind without that property), if you use more than √P elements you will notice a lack of collisions. So you stay below that.

The argument extends to larger sizes in the sense that if you have kw bits of state and w bits of output, you can potentially generate in sequence all possible blocks of kw bits. If you do so you're in the same game for collisions of blocks of kw bits.

Generating all kw-bit blocks when kw is large might be not so relevant, but then also collisions not happening after √2^kw blocks is irrelevant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, even with a university affiliation, I cannot access that article.

I get your point about equidistribution not implying correct distribution of repeats and have updated this paragraph. Please take a look.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seriously? It takes 5s to get the public version 😂.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.6616&rep=rep1&type=pdf

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, my final three remarks:

  • When you say 2^64 values, I don't think everybody realizes this is after centuries of output generation; maybe it would be useful to remind it. That is, at that size the "bias" is entirely theoretical. It might not be if you have 32-bit outputs and 64 bits of state. Maybe that's a better example.
  • I realize you cannot be completely precise in discussing equidistribution, but I would replace "k-dimensional equidistribution" with "equidistribution in the maximum possible dimension". Otherwise, with an unspecified k every generator emitting each output the same number of times (which happens for almost all generators, including PCG) would fall into your claim (and this quite obviously not true).
  • When you say "nice property", I would explain why: " For some uses this may be a nice property to have, because it means that there are no missing values in the output of the generator, ...". Otherwise, you're showing just one side of the coin.

Copy link
Member Author

@dhardy dhardy Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say 2^64 values, I don't think everybody realizes this is after centuries of output generation

There were only two mentions of 2^64; one of them I forgot to delete (see @vks's comment above; I will fix), and the other does mention this in the same sentence.

"equidistribution in the maximum possible dimension

I thought it was clear, but will make this change.

there are no missing values in the output of the generator

I see no reason why this property is relevant, given that we're talking about generators which take (at least) centuries to cycle. Because certain values may be impossible to generate with any seed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because certain values may be impossible to generate with any seed.

Shuffling is the main application where this applies, I think. (Still not sure how relevant it is, because it wouldn't be feasible to generate all possible shuffles anyway.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you care about shuffling? Most practical examples will have a relatively small sequence (e.g. 52), thus it is not the values output by the generator which matter but the values output by the Uniform distribution (which uses widening multiply + rejection, thus should be fairly resistant to bias in the low bits also).

@dhardy
Copy link
Member Author

dhardy commented Nov 19, 2019

Whoops, it appears I pushed upstream instead of to this PR accidentally (I should enable branch protections)! I think in any case we're done with this PR, so long as @vigna approves my last commit.

@dhardy dhardy mentioned this pull request Nov 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"This is not a property of true randomness."
3 participants