-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements for shuffle
and partial_shuffle
#1272
Conversation
Huh, I've always thought shuffling was memory bound. Impressive work |
If we did have a way of determining native output size (#1261) using 64 bit chunks would give significant performance improvements when shuffling longer sequences. Unfortunately this would lead to different values on 32 and 64 bit machines. |
Only where a different RNG is used on these machines (e.g. There is a question of which chunk size we should use by default given that 64-bit CPUs are now the norm, though it would penalise results for short lists with e.g. ChaCha. Actually, we should run benchmarks with both chunk sizes with several RNGs (e.g. Pcg32, Pcg64 and ChaCha12; don't need two variants of ChaCha) — as a hack that doesn't need to be committed. I'll do this for your other PR. |
I ran the benchmarks comparing u32 and u64 versions.
|
Meanwhile ChaCha and Pcg32 also see gains despite not discarding random bits in the same way. We could use this to select a different shuffling implementation based on the slice length, regardless of RNG algorithm. But is there enough interest in shuffling large slices to justify the extra complexity? Probably better to stick with 32-bit only. Thanks for the extra benchmarks @wainwrightmark. |
I agree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase and add a copyright header to increasing_uniform.rs
(also coin_flipper.rs
, missed in the last PR).
Please also run rustfmt src/seq/increasing_uniform.rs
and wrap long comments. Less comments may be better — detailed explanations like this run the risk of becoming outdated (I mostly didn't read them).
self.swap(i, index); | ||
} | ||
} else { | ||
for i in m..self.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to reverse the iterator (both loops). Your code can only "choose" the last element of the list with probability 1/len when it should be m/len.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ermm, I'm pretty sure I've got this right. The last element gets swapped to a random place in the list so it has a m/len probability of being in the first m elements. Earlier elements are more likely to be chosen initially but can get booted out by later ones. The test_shuffle
test is checking this and I've also tried similar tests with longer lists and more runs.
The reason I don't reverse the iterator is because the increasing_uniform
needs i
to increase and a decreasing version would be more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. We previously reversed since this way the proof by induction is easier. But we can also prove this algorithm works.
First, lets not use m = len - amount
since in the last PR we used m = amount
. I'll continue to use end = len - amount
.
Lets say we have a list elts = [e0, e1, .., ei, ..]
of length len
. Elements are "chosen" if they appear in elts[end..len]
after the algorithm; additionally we need to show that this slice is fully shuffled.
Algorithm is:
for i in end..len {
elts.swap(i, rng.sample_range(0..=i));
}
For any length, for amount = 0
or amount = 1
, this is clearly correct. We'll prove by induction, assuming that the algorithm is already proven correct for amount-1
and len-1
(so that end
does not change and the algorithm only has one last swap to perform).
Thus, we assume:
- For any elt
ei
, we haveP(ei in elts[0..end]) = end/(len-1)
[here we say nothing about element order] - For any elt
ei
, for anyk in end..(len-1)
,P(elts[k] = ei) = (amount-1)/(len-1)
[fully shuffled]
We perform the last step of the algorithm: let x = sample_range(0..=len); elts.swap(len-1, x);
. Now:
- Any element in
elts[0..end]
is moved toelts[len-1]
with probability1/len
, thus for any eltei
excepte_last
,P(ei in elts[0..end]) = end/(len-1) * (len-1)/len = end/len
- For any elt
ei
previously inelts[end..len-1]
, the chance it is not moved is(len-1)/len
, thus, for theseei
, for anyk in end..(len-1)
,P(elts[k] = ei) = (amount-1)/(len-1) * (len-1)/len = (amount-1)/len
- For any elt
ei
previously inelts[end..len-1]
,P(elts[len-1] = ei) = 1/len
- The previous two points together imply that for any
ei
previously inelts[end..len-1]
, for anyk in end..len
,P(elts[k] = ei) = (amount-1+1)/len = amount/len
- Element
e_last
may appear in any position with probability1/len
Thus each element has chance amount/len
to appear in ents[end..len]
and this slice is fully shuffled.
let r = self.chunk % next_n; | ||
self.chunk /= next_n; | ||
r as usize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably also room for further optimisation here: modulus is a slow operation (see https://www.pcg-random.org/posts/bounded-rands.html).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did read that article and it helped me find some of the optimizations I used for this. I also tried using a method based on bitmask but it turned out about 50% slower than this. Obviously I could easily have missed something.
0cee068
to
d2e939f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This is Related to #1266 but completely orthogonal to #1268 which improves the performance of a different set of methods in a different way.
This improves the performance of
SliceRandom::shuffle()
andSliceRandom::partial_shuffle()
by essentially batching the random number generation.It seems to be about 50-100% faster for most slice lengths, with less performance improvement for longer slices. It will use the old method for slices with length longer than 2^32.
This is a value breaking change.
Benchmark results
Partial Shuffle
Partial shuffle half of the slice
Shuffle