Robert Floyd's combination algorithm. #144

burdges · 2017-04-06T15:06:26Z

I'm donno if you guys want to merge this since it depends on deprecated access to constants, but it's useful if you've a slow random number generator and want only small samples.

This gets rid of the 'experimental' level, removes the non-staged_api case (i.e. stability levels for out-of-tree crates), and lets the staged_api attributes use 'unstable' and 'deprecated' lints. This makes the transition period to the full feature staging design a bit nicer.

This adds the int_uint feature to *every* library, whether or not it needs it.

Conflicts: src/test/compile-fail/borrowck-move-out-of-overloaded-auto-deref.rs src/test/compile-fail/issue-2590.rs src/test/compile-fail/lint-stability.rs src/test/compile-fail/slice-mut-2.rs src/test/compile-fail/std-uncopyable-atomics.rs

Fixes #16072

@huonw

Fixes #16072 r? @huonw

Lets them build with the -dev, -nightly, or snapshot compiler

* `core` - for the core crate * `hash` - hashing * `io` - io * `path` - path * `alloc` - alloc crate * `rand` - rand crate * `collections` - collections crate * `std_misc` - other parts of std * `test` - test crate * `rustc_private` - everything else

Conflicts: mk/tests.mk src/liballoc/arc.rs src/liballoc/boxed.rs src/liballoc/rc.rs src/libcollections/bit.rs src/libcollections/btree/map.rs src/libcollections/btree/set.rs src/libcollections/dlist.rs src/libcollections/ring_buf.rs src/libcollections/slice.rs src/libcollections/str.rs src/libcollections/string.rs src/libcollections/vec.rs src/libcollections/vec_map.rs src/libcore/any.rs src/libcore/array.rs src/libcore/borrow.rs src/libcore/error.rs src/libcore/fmt/mod.rs src/libcore/iter.rs src/libcore/marker.rs src/libcore/ops.rs src/libcore/result.rs src/libcore/slice.rs src/libcore/str/mod.rs src/libregex/lib.rs src/libregex/re.rs src/librustc/lint/builtin.rs src/libstd/collections/hash/map.rs src/libstd/collections/hash/set.rs src/libstd/sync/mpsc/mod.rs src/libstd/sync/mutex.rs src/libstd/sync/poison.rs src/libstd/sync/rwlock.rs src/libsyntax/feature_gate.rs src/libsyntax/test.rs

…calls

Conflicts: src/libcore/cell.rs src/librustc_driver/test.rs src/libstd/old_io/net/tcp.rs src/libstd/old_io/process.rs

For example, `sample(r, b"ATGC", 4)` always returns `[b'A', b'T', b'G', b'C']`.

Clarify reservoir sampling docs

See : http://fermatslibrary.com/s/a-sample-of-brilliance https://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin

burdges · 2017-04-06T15:38:41Z

I see. One cannot even use features for access to deprecated stuff in stable. Any opinions on doing this without Zero and One from std?

dhardy · 2017-07-28T10:16:41Z

Since the dimension of a Vec or slice is always usize, why use the generic One and Zero at all?

burdges · 2017-07-28T15:18:38Z

It's running over a Range<T>, not a vector, so that you can pull a small sample from a large range efficiently. Yes, a map from some integer type to T suffices too, so maybe there is a better bound using From or TryFrom?

Also, you could support only usize expecting the user to either pull from an array, etc. or else cast themselves. I suppose this exists for smallish samples, so indexing is likely and conversion is cheap, making this perhaps the best solution. Is that what you meant?

TheIronBorn · 2017-08-31T19:39:13Z

The issue I opened, referenced above, produces a similar result and runs in O(amount) (worst case O(n/2)) but uses the full space: O(n) / O(range.len()).

burdges · 2017-08-31T21:48:34Z

I do think a Range based method makes sense for small samples from big numeric ranges. If one needs allocation anyways then maybe a HashSet makes more sense, or some heap property applied to the output. It's fast to shuffle the result if amount is that small.

pitdicker · 2018-01-05T14:27:59Z

This is a neat algorithm, and to me it seems like something nice to add to rand.

I have not looked closely, but the new seq module might be a good place? And @burdges, you where planning on using usize instead of generics?

@dhardy do you think the time is right to move this forward?

pitdicker · 2018-01-05T14:29:20Z

Does it make sense to make this a distribution instead? Edit: no.

dhardy · 2018-01-05T14:32:41Z

If it does the same job as sample but with different performance trade-offs, then IMO it should be called sample_floyd or some such, and also in the seq module, yes. I haven't looked at it in detail.

burdges · 2018-01-06T01:54:25Z

Yes usize makes sense.

pitdicker · 2018-01-06T15:05:19Z

Robert Floyd's algorithm works in the sense that a value is never picked twice. But it does not produce an all that random list of results. Example from picking 50 numbers from 1..100, and comparing them against a simple counter in the second column:

4	50	
35	51	
39	52	
25	53	
40	54	
55	55	matches prediction
6	56	
1	57	
58	58	matches prediction
5	59	
11	60	
61	61	matches prediction
56	62	
51	63	
27	64	
64	65	
66	66	matches prediction
3	67	
7	68	
69	69	matches prediction
68	70	
2	71	
72	72	matches prediction
38	73	
74	74	matches prediction
71	75	
59	76	
28	77	
78	78	matches prediction
29	79	
52	80	
65	81	
82	82	matches prediction
83	83	matches prediction
84	84	matches prediction
79	85	
86	86	matches prediction
62	87	
8	88	
89	89	matches prediction
32	90	
76	91	
9	92	
93	93	matches prediction
94	94	matches prediction
36	95	
13	96	
23	97	
98	98	matches prediction
99	99	matches prediction

And if I try to get 90 results out of a list of 100, it becomes almost a counter:

 9, 10,  7, 13,  4, 15, 16, 17, 14, 19,  5, 12, 22,  3,  2, 25, 26, 6, 28, 29,
30, 24, 32, 33, 23, 34, 31, 37, 38, 39, 18, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, 51, 52, 11,  1, 55, 56, 57,  8, 21, 60, 61, 62, 63, 64, 65, 66, 67, 59, 69,
70, 40, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 93, 94, 95, 96, 97, 98, 99

So we should document very clearly when it is a good idea to use this, or not include it in rand at all.

burdges · 2018-01-06T21:11:10Z

Yes, the function is called combination and documentation says "The order of elements is not random."

As an aside, sample would return an unordered set if it correspond to the mathematical definition, but since combination_and_permutation is too long, and rust eschews allocations, it's reasonable to slightly incorrectly abbreviate that by sample.

pitdicker · 2018-04-04T18:45:21Z

Sorry this PR got so suddenly closed. The cause was that we cleaned up the git history (#350) and forced-pushed it to master. Now there was a difference of about 30.000, and I suppose GitHub just gave up...

This PR, together with two others, stayed open for so long with little attention because it touches on an area of Rand that needs some design first. That is now slowly getting explored in dhardy#82.

dhardy · 2018-05-25T19:54:29Z

The algorithm here is interesting. Unfortunately @burdges implementation here has a bug. This is a simpler version which I think is correct:

fn combination<T>(rng, low, high, amount) -> Vec<T> {
    assert!(amount <= (high - low));
    let mut s = Vec::with_capacity(amount);
    let n = high - low;
    for j in n - amount .. n {
        let t = rng.gen_range(low, low + j + 1);
        let t = if s.contains(&t) { low + j } else { t };
        s.push( t );
    }
    s
}

@burdges implementation instead does let m = min(amount, n - 1). Okay, there's not much point allowing all elements to be sampled, though that is valid. Further, we cannot sample more elements than are available. But is it correct to automatically reduce the number of elements sampled instead of stopping with an error message? This isn't the bug however.

The bug is that @burdges simplified by instead setting n = high and adjusting low + j → j, however forgetting to account for this in min(amount, n - 1).

Note that sample_indices_inplace is similar to this, but less generic. sample_indices_cache is the same algorithm but with lazy representation of elements (optimised for large n). Floyd's algorithm should be able to replace both with a simpler, unified implementation (although for m close to n I'm not sure whether sample_indices_inplace would be faster).

pitdicker · 2018-05-26T07:05:06Z

@dhardy Maybe you can say what you think about the direction in dhardy#82, which includes this one but tries to get a bit of a wider perspective?

dhardy · 2018-05-26T07:08:39Z

I'm getting there @pitdicker ...

pitdicker · 2018-05-26T07:13:29Z

No hurry, but I had the idea there that we could provide a couple of 'low-level' algorithms, and the current sample_* methods could be more the more 'high-level' variants, using different algorithms depending on the situation, and only available with std.

alexcrichton and others added 30 commits January 7, 2015 19:27

Test fixes and rebase conflicts

e4cd8be

Test fixes and rebase conflicts

cab5165

Remove warning from the libraries.

90f7740

This adds the int_uint feature to *every* library, whether or not it needs it.

powerpc: Add libstd utils

4238544

Add note about TLS lookups in random()

08aad13

Fixes #16072

cleanup: &foo[0..a] -> &foo[..a]

fa2a456

rollup merge of #21059: steveklabnik/gh16072

1f7dc56

Fixes #16072 r? @huonw

Register new snapshots.

381fb8d

Set allow(unstable) in crates that use unstable features

7c8b5ad

Lets them build with the -dev, -nightly, or snapshot compiler

Fallout from stabilization.

f2ddc0e

Fallout from stabilization.

377dec6

Add 'feature' and 'since' to stability attributes

f2091cc

Add 'feature' and 'since' to stability attributes

8138085

Tie stability attributes to feature gates

464d0a5

Remove 'since' from unstable attributes

f585111

Remove 'since' from unstable attributes

0810e12

Put #[staged_api] behind the 'staged_api' gate

e87a7d9

grandfathered -> rust1

46fc24b

Fallout of io => old_io

bcb0314

cleanup depricated uint in rand mod and rand os

5c8578a

Don't use if we can avoid it

654b198

cleanup: s/v.slice*()/&v[a..b]/g + remove redundant as_slice() …

ce70ae7

…calls

Merge remote-tracking branch 'rust-lang/master'

98e5d88

Conflicts: src/libcore/cell.rs src/librustc_driver/test.rs src/libstd/old_io/net/tcp.rs src/libstd/old_io/process.rs

matklad and others added 4 commits March 11, 2017 17:47

Clarify reservoir sampling docs

e35c44c

For example, `sample(r, b"ATGC", 4)` always returns `[b'A', b'T', b'G', b'C']`.

Merge pull request rust-random#141 from matklad/sample-docs

a0de9bf

Clarify reservoir sampling docs

Make test tighter

2f25c76

Add combination function that used Robert Floyd's algorithm

9238471

See : http://fermatslibrary.com/s/a-sample-of-brilliance https://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin

burdges changed the title ~~Rober Floyd's combination algorithm.~~ Robert Floyd's combination algorithm. Apr 6, 2017

alexcrichton added the F-new-int Functionality: new, within Rand label Jun 14, 2017

burdges mentioned this pull request Aug 31, 2017

sample with random ordering #169

Closed

burdges mentioned this pull request Nov 22, 2017

rand::sample's API should be fallible #194

Closed

pitdicker mentioned this pull request Jan 6, 2018

Sequence sampling: seq, WeightedChoice dhardy/rand#82

Closed

5 tasks

dhardy added the P-low Priority: Low label Mar 12, 2018

pitdicker mentioned this pull request Mar 29, 2018

Clean up git history #350

Closed

pitdicker closed this Apr 4, 2018

pitdicker force-pushed the master branch from baac1ca to 73ea10f Compare April 4, 2018 17:30

dhardy mentioned this pull request May 25, 2018

seq: use Floyd's combination algorithm to sample indices #479

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robert Floyd's combination algorithm. #144

Robert Floyd's combination algorithm. #144

burdges commented Apr 6, 2017

burdges commented Apr 6, 2017

dhardy commented Jul 28, 2017

burdges commented Jul 28, 2017 •

edited

Loading

TheIronBorn commented Aug 31, 2017 •

edited

Loading

burdges commented Aug 31, 2017

pitdicker commented Jan 5, 2018

pitdicker commented Jan 5, 2018 •

edited

Loading

dhardy commented Jan 5, 2018

burdges commented Jan 6, 2018

pitdicker commented Jan 6, 2018 •

edited

Loading

burdges commented Jan 6, 2018

pitdicker commented Apr 4, 2018

dhardy commented May 25, 2018

pitdicker commented May 26, 2018

dhardy commented May 26, 2018

pitdicker commented May 26, 2018

Robert Floyd's combination algorithm. #144

Robert Floyd's combination algorithm. #144

Conversation

burdges commented Apr 6, 2017

burdges commented Apr 6, 2017

dhardy commented Jul 28, 2017

burdges commented Jul 28, 2017 • edited Loading

TheIronBorn commented Aug 31, 2017 • edited Loading

burdges commented Aug 31, 2017

pitdicker commented Jan 5, 2018

pitdicker commented Jan 5, 2018 • edited Loading

dhardy commented Jan 5, 2018

burdges commented Jan 6, 2018

pitdicker commented Jan 6, 2018 • edited Loading

burdges commented Jan 6, 2018

pitdicker commented Apr 4, 2018

dhardy commented May 25, 2018

pitdicker commented May 26, 2018

dhardy commented May 26, 2018

pitdicker commented May 26, 2018

burdges commented Jul 28, 2017 •

edited

Loading

TheIronBorn commented Aug 31, 2017 •

edited

Loading

pitdicker commented Jan 5, 2018 •

edited

Loading

pitdicker commented Jan 6, 2018 •

edited

Loading