-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RngCore::bytes_per_round #396
Conversation
48d4d1e
to
1745287
Compare
/// RNG can generate without throwing away part of the generated value. | ||
/// | ||
/// `bytes_per_round` has a default implementation that returns `4` (bytes). | ||
fn bytes_per_round(&self) -> usize { 4 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be an associated constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried something like that in a not really thought through attempt: #377 (comment). The problem is that we then can't make RngCore
into a trait object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird; doesn't sound like associated constants should prevent a trait from becoming object-safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error[E0038]: the trait `rand_core::RngCore` cannot be made into an object
--> src\lib.rs:1189:21
|
1189 | let mut r = Box::new(rng) as Box<RngCore>;
| ^^^^^^^^^^^^^ the trait `rand_core::RngCore` cannot be made into an object
|
= note: the trait cannot contain associated consts like `BYTES_PER_ROUND`
= note: required because of the requirements on the impl of `std::ops::CoerceUnsized<std::boxed::Box<rand_core::RngCore>>` for `std::boxed::Box<test::TestRng<StdRng>>`
This makes me wonder whether we should make |
I was curious about this to for the last couple of days, and tried that before adding Edit: on reading your comment again you meant something else. But the idea is the same: what would happen in |
I like this idea a lot! I think `bytes_per_round` can be generalized to
represent the optimal number of bytes per round. For some RNGs this will be
`usize::MAX`. This should probably be the default value.
|
Ah, great 😄.
Than we have a different idea in mind of how the value works. Can you explain it a bit more? Then I'll try my thoughts: For example: With Xoroshiro128+ But it is difficult to describe the performance trade-offs with a single value... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea... but using this to optimise large generators vs the next_u32
/ next_u64
we use for smaller ones feels a strange mixture of methods. Still it may be the best compromise.
Do you think there will be many other uses of bytes_per_round
?
What if bytes_per_round
cannot be evaluated at compile time (e.g. if R
is unsized therefore cannot be inlined)?
/// RNG can generate without throwing away part of the generated value. | ||
/// | ||
/// `bytes_per_round` has a default implementation that returns `4` (bytes). | ||
fn bytes_per_round(&self) -> usize { 4 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird; doesn't sound like associated constants should prevent a trait from becoming object-safe.
/// RNG can generate without throwing away part of the generated value. | ||
/// | ||
/// `bytes_per_round` has a default implementation that returns `4` (bytes). | ||
fn bytes_per_round(&self) -> usize { 4 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really give a default implementation here?
Did you forget to implement for Jitter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not sure. A disadvantage with a default implementation is that you can easily forget a wrapper.
Did you forget to implement for Jitter?
JitterRng::next_u32()
is about twice as fast as next_u64()
, so 4 bytes would be the best fit there.
|
||
// Implement `RngCore` for references to an `RngCore`. | ||
// Force inlining all functions, so that it is up to the `RngCore` | ||
// implementation and the optimizer to decide on inlining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually if R
is unsized these cannot be inlined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, true.
But before fill_bytes
was basically never inlined because we always use the RNG through this implementation, through a reference (and for some reason LLVM really does not like our abstractions...). So now we can at least control things when it is not a trait object.
let b_ptr = &mut *(ptr as *mut u128 as *mut [u8; 16]); | ||
rng.fill_bytes(b_ptr); | ||
} | ||
val.to_le() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to see benchmarks for this on a BE platform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sometimes just change things to to_be()
and measure on x86_64. But it really only starts to show improvements for things like my SIMD experiment (twice as fast there).
In the case of Overall I wonder if you're over-estimating the utility of the extra method? Especially because it cannot always be evaluated at compile time I'm less convinced. |
The method seems fine but it cannot be a constant. For hardware generators
it may be necessary to talk to hardware before figuring out its best
performing (whichever of lowest latency or greatest throughput criteria is
used) block size.
…On Fri, Apr 13, 2018, 12:07 Paul Dicker ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/distributions/integer.rs
<#396 (comment)>
:
> - fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> i128 {
- rng.gen::<u128>() as i128
+ fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> u128 {
+ if rng.bytes_per_round() < 128 {
+ // Use LE; we explicitly generate one value before the next.
+ let x = rng.next_u64() as u128;
+ let y = rng.next_u64() as u128;
+ (y << 64) | x
+ } else {
+ let mut val = 0u128;
+ unsafe {
+ let ptr = &mut val;
+ let b_ptr = &mut *(ptr as *mut u128 as *mut [u8; 16]);
+ rng.fill_bytes(b_ptr);
+ }
+ val.to_le()
I sometimes just change things to to_be() and measure on x86_64. But it
really only starts to show improvements for things like my SIMD experiment
(twice as fast there).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#396 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AApc0uPL4I9UTLRqdITE8jVjeI6y1bPsks5toGrngaJpZM4TSUO->
.
|
Maybe, hard to say without using it in various situations. But I think it can be pretty useful.
I thought about this too, especially in combination with unsized RNGs. But even if it can't be done at compile time, the branch predictor should be able to figure things out after a few rounds I think? |
I suppose so, but there's still an extra operation involved. It should be easy enough to add a few benchmarks using |
Benchmarked with HC-128 as a trait object. It normally performs at 78% of the normal speed, and with the change at 75%.
|
That's better than I expected. And what about a small generator like Xorshift? |
Also better than I expected. Xorshift a little less so:
After a few more days you get a bit more objective 😄.
So we could do fine without this method, but I believe it can enable optimizations that would otherwise not be possible. |
A bit more objective maybe — but I haven't had much time the last few days (travelling)!
So I'm still unconvinced on this one. |
I would like to see the first commit here in rand_core 0.1, but don't mind if the rest follows later, or maybe never. |
The first commit looks fine if you want to merge that separately. |
I'm going to close this PR, at least for now. |
If you like. I don't mind it remaining as a discussion item for now, though it is tidier to close. |
From the doc comment:
I have thought quite a few times over the last couple of months: It would be great to know if this is a 32 or 64-bit RNG. Then I could implement this algorithm more optimally. Now, when playing with SIMD, this became even more visible.
I added one example in this PR: generating an
u128
. For most RNGs the current method of combining tou64
s is optimal. But forOsRng
and SIMD RNGs it would be twice as fast two usefill_bytes
. Now it can make the choice.The same is true for
HighPrecision01
. The implementation forf32
now always usesnext_u32
. With a 64-bit RNG it always throws away half the generated bits. And when it finds out it needs more, generates another 32. It could easily be made more optimal, if only we knew whether the RNG is best at generating 32 or 64 bit at a time.And a bit more controversial:
gen_bool
now uses 32 bits to make its decision on. Fast and usually good enough. Using 64 bits could halve the performance for many RNGs. But if the RNG produces 64 bits at a time (and throws away half of it), it could just as well use them to increase precision at no extra cost.