-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Iterator::size_hint() to speed up IteratorRandom::choose #593
Conversation
❌ Build rand 1.0.69 failed (commit 9ee9124fba by @sicking) |
❌ Build rand 1.0.70 failed (commit 8dd0f8477c by @sicking) |
✅ Build rand 1.0.71 completed (commit 337045d12c by @sicking) |
✅ Build rand 1.0.72 completed (commit 0e742eb193 by @sicking) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well @sicking, this did make the logic a bit more complex than I'd expected. I think the logic is correct.
5-10% performance decrease over general iterators for a big increase in cases with known size sounds like a win for me, even if the overhead is a bit higher than expected.
You could try instead only calling size_hint
once or twice (i.e. removing the skip logic from the loop); does this improve un-skippable iterators? In my experience most iterators either have known length or completely unknown, or in a few cases may have known minimum. It seems unlikely that size_hint
will be useful later — although it's possible this could be useful somewhere I guess.
|
||
if upper == Some(lower) { | ||
// Remove this once we can specialize on ExactSizeIterator | ||
return if lower == 0 { None } else { self.nth(rng.gen_range(0, lower)) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we should remove later — it's valid to give an exact size hint without implementing the latter trait (e.g. because the size bounds are not always known).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can certainly go either way. The code should still compile to something quite similar since the upper == Some(lower)
check further down is still there. But we do a few branches which the compiler likely won't be able to optimize away before getting there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's seems there's a big difference when this is removed — around 6ms vs 3.7ms on the fully hinted seq_iter_choose_from_1000
test — so worth keeping I think.
if upper == Some(lower) { | ||
// Remove this once we can specialize on ExactSizeIterator | ||
return if lower == 0 { None } else { self.nth(rng.gen_range(0, lower)) }; | ||
} else if lower <= 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this path for? Seems to me that it's only purpose is to eliminate a possible None
result early, yet it doesn't always do that. Does it improve performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal was to ensure that the new code should reduce down to the old code once the compiler does constant folding of an "unhinted" iterator. And similarly that doing constant folding of a perfectly hinted iterator would reduce to SliceRandom::choose. I.e. the first two categories discussed in the initial comment should reduce to optimal code once constant folding is done.
That's why both of these paths are there. Both could be removed without loosing any correctness, and in both cases we just end up adding a few branches to the runtime code.
(The difference for "perfectly hinted" is slightly bigger for iterators of size exactly 1, but that's a rare use case anyway).
We definitely could remove these paths. I don't have a strong opinion either way. The performance difference will be small and will mainly depend on how slow other operations are, i.e. how many items the "unhinted" iterator has, and how fast RngCore implementation is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, makes some sense, though I still don't really understand the reason to use lower <= 1
here. You could simply drop the condition (equivalent of true
) and would get the same behaviour on unhinted iterators.
My gut feeling is that the cases worth bothering with are (a) perfectly hinted iterators, (b) iterators where some lower bound is known, and (c) iterators with no hinting. So you could potentially insert a lower > 0
case (instead of the loop case) and use this code for the lower == 0
case — but I think only benchmarks can tell which one is best.
Are chunked iterators common? It's not a pattern I remember seeing with iterators, but I guess it may be relevant to buffered input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, makes some sense, though I still don't really understand the reason to use lower <= 1 here
When lower > 1
it is better to use gen_range
to pick one of the first lower
items, than to unconditionally grab the first item and then iterate.
Are chunked iterators common?
No idea.
As mentioned in other comments, the idea is that for all other cases, the code will reduce to the optimal approach. So the cost to support chunked iterators is only one of source complexity, neither one of runtime cost, nor of binary size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this path improves benchmarks slightly. The only functional difference from the second path within the loop is that the gen_bool
bit is skipped.
The Bernoulli
code actually has special handling for the p==1.0
case anyway and doesn't touch the RNG.
Simpler code and faster benchmarks implies we're probably better off without this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference either way. Generally it's a good idea to be careful about optimizing too much based on just microbenchmarks, since that often doesn't reflect real-world performance.
In theory this code should help "unhinted" iterators, though as you point out by very little since it mainly saves a few branches. In neither case will we call into the RNG to generate numbers. So yeah, probably worth removing since the win seems quite small.
I can see two reasons why removing this code would speed up anything:
- We might end up inlining things better since the function is smaller.
- The "window hinted" test might get faster since we do one branch less. However this benchmark seems less important to me than the "unhinted" and the fully hinted ones.
Which benchmarks specifically got faster with this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unhinted one:
test seq_iter_choose_from_1000 ... bench: 3,756 ns/iter (+/- 108) = 2129 MB/s
test seq_iter_unhinted_choose_from_1000 ... bench: 5,311 ns/iter (+/- 98)
test seq_iter_window_hinted_choose_from_1000 ... bench: 1,789 ns/iter (+/- 44)
# after removing this branch:
test seq_iter_choose_from_1000 ... bench: 3,746 ns/iter (+/- 60) = 2135 MB/s
test seq_iter_unhinted_choose_from_1000 ... bench: 5,007 ns/iter (+/- 106)
test seq_iter_window_hinted_choose_from_1000 ... bench: 1,769 ns/iter (+/- 42)
I'm not sure I understand what is being asked here. I agree that most iterators will either have a known length, or a completely unknown one. That is the first two categories in the initial comment above. For iterators with a known length, For iterators with an unknown length, The only two cases where we'll have multiple meaningful calls to a
fn myfunc(i: &mut dyn Iterator<Item = usize>) -> usize {
i.choose(&mut thread_rng()).unwrap()
} Performance in this case is going to be worse than current code. But I hope this is a very rare pattern, especially given how easy Rust makes it to avoid dynamic dispatch.
We could go the alternative route mentioned in the initial comment. That'll ensure that we never call |
Ah, that's the design rationale I was missing 👍 |
This was likely my last actual code contribution to So especially appreciate the quick review and merge on this one! I also added #594, but that's just documentation changes so not counting that one 😄 It's been great to get to contribute to So wanted to give a thanks to @dhardy and @pitdicker for your time in discussions and doing reviews. Best of luck getting to 1.0. Looking forward to seeing what it'll look like. |
Thank you so much for the contributions too. There have been a few differences of opinion but also a lot of high quality code! Without volunteers like yourself this project would probably still be sitting at 0.3 with only minor patches landing 😄 |
This fixes #511.
I've approached hinting as three different categories:
(0, None)
. I.e. ones that use the default implementation ofsize_hint
.size_hint
always returns(N, Some(N))
I suspect the first two categories are most common, and in theory the compiler should be able to optimize these two categories to an optimal implementation with this code.
Unfortunately I'm seeing about a 10% slowdown compared to the existing code for iterators that fall into the first category. There's no inherent reason for this slowdown, but just appears to be the optimizer not doing as good of a job as it's doing on the current code.
For the first category I'm getting optimal performance. I.e.
vec.iter().choose(&mut r)
is as performant asvec.choose(&mut r)
.An alternative would be to not worry about the third category and just do:
With this I'm just seeing a 5% slowdown for unhinted iterators. Again, there's no reason the optimizer shouldn't be able to optimize away this slowdown, but it's not doing so for me.
Note that both the 10% and the 5% is within the
cargo bench
-reported noise.