Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robert Floyd's combination algorithm. #144

Closed
wants to merge 10,000 commits into from

Conversation

burdges
Copy link
Contributor

@burdges burdges commented Apr 6, 2017

I'm donno if you guys want to merge this since it depends on deprecated access to constants, but it's useful if you've a slow random number generator and want only small samples.

alexcrichton and others added 30 commits January 7, 2015 19:27
This gets rid of the 'experimental' level, removes the non-staged_api
case (i.e. stability levels for out-of-tree crates), and lets the
staged_api attributes use 'unstable' and 'deprecated' lints.

This makes the transition period to the full feature staging design
a bit nicer.
This gets rid of the 'experimental' level, removes the non-staged_api
case (i.e. stability levels for out-of-tree crates), and lets the
staged_api attributes use 'unstable' and 'deprecated' lints.

This makes the transition period to the full feature staging design
a bit nicer.
This adds the int_uint feature to *every* library, whether or not it
needs it.
Conflicts:
	src/test/compile-fail/borrowck-move-out-of-overloaded-auto-deref.rs
	src/test/compile-fail/issue-2590.rs
	src/test/compile-fail/lint-stability.rs
	src/test/compile-fail/slice-mut-2.rs
	src/test/compile-fail/std-uncopyable-atomics.rs
Lets them build with the -dev, -nightly, or snapshot compiler
* `core` - for the core crate
* `hash` - hashing
* `io` - io
* `path` - path
* `alloc` - alloc crate
* `rand` - rand crate
* `collections` - collections crate
* `std_misc` - other parts of std
* `test` - test crate
* `rustc_private` - everything else
* `core` - for the core crate
* `hash` - hashing
* `io` - io
* `path` - path
* `alloc` - alloc crate
* `rand` - rand crate
* `collections` - collections crate
* `std_misc` - other parts of std
* `test` - test crate
* `rustc_private` - everything else
Conflicts:
	mk/tests.mk
	src/liballoc/arc.rs
	src/liballoc/boxed.rs
	src/liballoc/rc.rs
	src/libcollections/bit.rs
	src/libcollections/btree/map.rs
	src/libcollections/btree/set.rs
	src/libcollections/dlist.rs
	src/libcollections/ring_buf.rs
	src/libcollections/slice.rs
	src/libcollections/str.rs
	src/libcollections/string.rs
	src/libcollections/vec.rs
	src/libcollections/vec_map.rs
	src/libcore/any.rs
	src/libcore/array.rs
	src/libcore/borrow.rs
	src/libcore/error.rs
	src/libcore/fmt/mod.rs
	src/libcore/iter.rs
	src/libcore/marker.rs
	src/libcore/ops.rs
	src/libcore/result.rs
	src/libcore/slice.rs
	src/libcore/str/mod.rs
	src/libregex/lib.rs
	src/libregex/re.rs
	src/librustc/lint/builtin.rs
	src/libstd/collections/hash/map.rs
	src/libstd/collections/hash/set.rs
	src/libstd/sync/mpsc/mod.rs
	src/libstd/sync/mutex.rs
	src/libstd/sync/poison.rs
	src/libstd/sync/rwlock.rs
	src/libsyntax/feature_gate.rs
	src/libsyntax/test.rs
Conflicts:
	mk/tests.mk
	src/liballoc/arc.rs
	src/liballoc/boxed.rs
	src/liballoc/rc.rs
	src/libcollections/bit.rs
	src/libcollections/btree/map.rs
	src/libcollections/btree/set.rs
	src/libcollections/dlist.rs
	src/libcollections/ring_buf.rs
	src/libcollections/slice.rs
	src/libcollections/str.rs
	src/libcollections/string.rs
	src/libcollections/vec.rs
	src/libcollections/vec_map.rs
	src/libcore/any.rs
	src/libcore/array.rs
	src/libcore/borrow.rs
	src/libcore/error.rs
	src/libcore/fmt/mod.rs
	src/libcore/iter.rs
	src/libcore/marker.rs
	src/libcore/ops.rs
	src/libcore/result.rs
	src/libcore/slice.rs
	src/libcore/str/mod.rs
	src/libregex/lib.rs
	src/libregex/re.rs
	src/librustc/lint/builtin.rs
	src/libstd/collections/hash/map.rs
	src/libstd/collections/hash/set.rs
	src/libstd/sync/mpsc/mod.rs
	src/libstd/sync/mutex.rs
	src/libstd/sync/poison.rs
	src/libstd/sync/rwlock.rs
	src/libsyntax/feature_gate.rs
	src/libsyntax/test.rs
Conflicts:
	src/libcore/cell.rs
	src/librustc_driver/test.rs
	src/libstd/old_io/net/tcp.rs
	src/libstd/old_io/process.rs
@burdges burdges changed the title Rober Floyd's combination algorithm. Robert Floyd's combination algorithm. Apr 6, 2017
@burdges
Copy link
Contributor Author

burdges commented Apr 6, 2017

I see. One cannot even use features for access to deprecated stuff in stable. Any opinions on doing this without Zero and One from std?

@alexcrichton alexcrichton added the F-new-int Functionality: new, within Rand label Jun 14, 2017
@dhardy
Copy link
Member

dhardy commented Jul 28, 2017

Since the dimension of a Vec or slice is always usize, why use the generic One and Zero at all?

@burdges
Copy link
Contributor Author

burdges commented Jul 28, 2017

It's running over a Range<T>, not a vector, so that you can pull a small sample from a large range efficiently. Yes, a map from some integer type to T suffices too, so maybe there is a better bound using From or TryFrom?

Also, you could support only usize expecting the user to either pull from an array, etc. or else cast themselves. I suppose this exists for smallish samples, so indexing is likely and conversion is cheap, making this perhaps the best solution. Is that what you meant?

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Aug 31, 2017

The issue I opened, referenced above, produces a similar result and runs in O(amount) (worst case O(n/2)) but uses the full space: O(n) / O(range.len()).

@burdges
Copy link
Contributor Author

burdges commented Aug 31, 2017

I do think a Range based method makes sense for small samples from big numeric ranges. If one needs allocation anyways then maybe a HashSet makes more sense, or some heap property applied to the output. It's fast to shuffle the result if amount is that small.

@pitdicker
Copy link
Contributor

This is a neat algorithm, and to me it seems like something nice to add to rand.

I have not looked closely, but the new seq module might be a good place? And @burdges, you where planning on using usize instead of generics?

@dhardy do you think the time is right to move this forward?

@pitdicker
Copy link
Contributor

pitdicker commented Jan 5, 2018

Does it make sense to make this a distribution instead? Edit: no.

@dhardy
Copy link
Member

dhardy commented Jan 5, 2018

If it does the same job as sample but with different performance trade-offs, then IMO it should be called sample_floyd or some such, and also in the seq module, yes. I haven't looked at it in detail.

@burdges
Copy link
Contributor Author

burdges commented Jan 6, 2018

Yes usize makes sense.

@pitdicker
Copy link
Contributor

pitdicker commented Jan 6, 2018

Robert Floyd's algorithm works in the sense that a value is never picked twice. But it does not produce an all that random list of results. Example from picking 50 numbers from 1..100, and comparing them against a simple counter in the second column:

4	50	
35	51	
39	52	
25	53	
40	54	
55	55	matches prediction
6	56	
1	57	
58	58	matches prediction
5	59	
11	60	
61	61	matches prediction
56	62	
51	63	
27	64	
64	65	
66	66	matches prediction
3	67	
7	68	
69	69	matches prediction
68	70	
2	71	
72	72	matches prediction
38	73	
74	74	matches prediction
71	75	
59	76	
28	77	
78	78	matches prediction
29	79	
52	80	
65	81	
82	82	matches prediction
83	83	matches prediction
84	84	matches prediction
79	85	
86	86	matches prediction
62	87	
8	88	
89	89	matches prediction
32	90	
76	91	
9	92	
93	93	matches prediction
94	94	matches prediction
36	95	
13	96	
23	97	
98	98	matches prediction
99	99	matches prediction

And if I try to get 90 results out of a list of 100, it becomes almost a counter:

 9, 10,  7, 13,  4, 15, 16, 17, 14, 19,  5, 12, 22,  3,  2, 25, 26, 6, 28, 29,
30, 24, 32, 33, 23, 34, 31, 37, 38, 39, 18, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, 51, 52, 11,  1, 55, 56, 57,  8, 21, 60, 61, 62, 63, 64, 65, 66, 67, 59, 69,
70, 40, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 93, 94, 95, 96, 97, 98, 99

So we should document very clearly when it is a good idea to use this, or not include it in rand at all.

@burdges
Copy link
Contributor Author

burdges commented Jan 6, 2018

Yes, the function is called combination and documentation says "The order of elements is not random."

As an aside, sample would return an unordered set if it correspond to the mathematical definition, but since combination_and_permutation is too long, and rust eschews allocations, it's reasonable to slightly incorrectly abbreviate that by sample.

@pitdicker
Copy link
Contributor

Sorry this PR got so suddenly closed. The cause was that we cleaned up the git history (#350) and forced-pushed it to master. Now there was a difference of about 30.000, and I suppose GitHub just gave up...

This PR, together with two others, stayed open for so long with little attention because it touches on an area of Rand that needs some design first. That is now slowly getting explored in dhardy#82.

@dhardy
Copy link
Member

dhardy commented May 25, 2018

The algorithm here is interesting. Unfortunately @burdges implementation here has a bug. This is a simpler version which I think is correct:

fn combination<T>(rng, low, high, amount) -> Vec<T> {
    assert!(amount <= (high - low));
    let mut s = Vec::with_capacity(amount);
    let n = high - low;
    for j in n - amount .. n {
        let t = rng.gen_range(low, low + j + 1);
        let t = if s.contains(&t) { low + j } else { t };
        s.push( t );
    }
    s
}

@burdges implementation instead does let m = min(amount, n - 1). Okay, there's not much point allowing all elements to be sampled, though that is valid. Further, we cannot sample more elements than are available. But is it correct to automatically reduce the number of elements sampled instead of stopping with an error message? This isn't the bug however.

The bug is that @burdges simplified by instead setting n = high and adjusting low + jj, however forgetting to account for this in min(amount, n - 1).

Note that sample_indices_inplace is similar to this, but less generic. sample_indices_cache is the same algorithm but with lazy representation of elements (optimised for large n). Floyd's algorithm should be able to replace both with a simpler, unified implementation (although for m close to n I'm not sure whether sample_indices_inplace would be faster).

@pitdicker
Copy link
Contributor

@dhardy Maybe you can say what you think about the direction in dhardy#82, which includes this one but tries to get a bit of a wider perspective?

@dhardy
Copy link
Member

dhardy commented May 26, 2018

I'm getting there @pitdicker ...

@pitdicker
Copy link
Contributor

No hurry, but I had the idea there that we could provide a couple of 'low-level' algorithms, and the current sample_* methods could be more the more 'high-level' variants, using different algorithms depending on the situation, and only available with std.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
F-new-int Functionality: new, within Rand P-low Priority: Low
Projects
None yet
Development

Successfully merging this pull request may close these issues.