-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rayon support #399
Add rayon support #399
Conversation
That is, assuming we only have one, fixed-size jump available. With a different scheme is could be possible. On the first split, we would jump the clone half a period. On the two second-level splits we jump the clones 1/4th of the period. On the four third-level splits the clones jump 1/8th of the period, etc. Still there is the disadvantage jumping has: for many LFSR RNGs it needs as many rounds of the RNG to do the jump, as it has bits of state. Xorshift with 128 bits of state would need 128 rounds calculate the jump. We had some discussion around streams and jumping/seeking in the RNG core RFC thread. Back then the conclusion was: probably not worth exposing, mostly useful to decrease the birthday problem when seeding RNGs. Now I think we are starting to see how they may be practical. We could add something like a split method to pub trait SeedableRng: Sized {
/* ... */
fn split(&mut self, _stream_id: u64) -> Self where Self: RngCore {
// Alternative for custom implementations: use streams or jumping
Self::from_rng(self).unwrap()
}
} Now PRNGs can choose how they want to be split: by seeding from itself, selecting another stream, or jumping. And it is possible with wrapper types to override, by forwarding all methods except |
As I commented in your branch, we don't necessarily want the same PRNG algorithm for the sub-generators as for the source, because there is an extra requirement: that the sub-generators can be initialised fast (and probably also that they have low memory usage). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a cool idea, but also quite heavy (in terms of extra code, and all the extra PRNG initialisation) which makes me wonder how often this is useful over simply using thread_rng
as suggested in #398 (obviously this is useless for reproducibility thus the only possible advantage over thread_rng
is performance).
/// | ||
/// [`Distribution`]: trait.Distribution.html | ||
/// [`sample_iter`]: trait.Distribution.html#method.sample_iter | ||
#[cfg(feature = "rayon")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary cfg since whole module is behind feature flag
fn split_at(mut self, index: usize) -> (Self, Self) { | ||
assert!(index <= self.amount); | ||
// Create a new PRNG of the same type, by seeding it with this PRNG. | ||
// `from_rng` should never fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than say this in a comment just do something like unwrap_or_else(|e| panic!("parallel distribution iterator failed to create new RNG: {}", e))
.
There's no contract saying from_rng
can't panic; it's just unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should document that ideally from_rng
should only forward errors from the RNG used for seeding, and is not supposed to produce errors itself. But that is another issue...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the original idea, but we can't enforce a contract like that and I don't see any real point trying. Then again, there appear to be no notes on error handling on that function at all so far.
/// for simulations etc. While it will also work with cryptographic RNGs, that | ||
/// is not optimal. | ||
/// | ||
/// Every time `rayon` splits the work in two to create parallel tasks, one new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like these warnings should be on sample_par_iter
too/instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely. Actually this PR was not ready for review, just to have some code to look at to help exploring the issue 😄.
|
||
fn min_len(&self) -> usize { | ||
100 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can make this configurable with a default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rayon already makes it configurable, but this sets a lower bound. Not sure what would be a good value yet, if we should set one at all. I don't know the internals of Rayon well enough. I'd like to ask for review from the contributors there when this is more ready.
} | ||
} | ||
|
||
/// FIXME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not compile without doc comments. Maybe a nice place to explain why we sample an amount
. I believe we can't just keep generating values until the user decides it has enough; Rayon seems to (reasonably) not support much functionality when the number of values is unknown.
Another thing we could try is adding an equivalent of |
This sounds right in theory, but I don't see a way to pass around two RNGs without increasing the problems.
Yes, performance is the motivation here, and having the choice to easily use different algorithms. That results are not reproducible is an important issue to mention in the documentation. |
Yes, certainly. It does not give me some kind of intellectual satisfaction 😄. But if it is practical and faster, ... We can also offer both methods. Using It seems nice to me that this would finally be a use for all the theory behind RNGs with streams or jumping. I wonder if Rayon is designed do deal with situations like ours, where every split of the job is not a basically free operation. |
Sorry, I didn't understand how Rayon works. If we're dividing a master runner as needed this idea won't work.
Only thing is can you somehow ensure that each thread gets a unique stream/position? |
On initialization we set |
I still need to dig through the questions in rayon-rs/rayon#566, but a quick note of caution here first: rayon-core depends on rand 0.4. If rand 0.5 depends on rayon, then I think we won't be able to ever update rayon to that because of a cyclic dependency! (this pr: rand-0.5 -> rayon -> rayon-core -> rand-0.4) |
Good point, definitely don't want that! I did not expect rayon to need some randomness somewhere 😄. Would it be okay for If I see it right, you only really need rand in But |
It could instead make sense for Rayon to provide support for Rand's distributions, I don't know. There's also #290 — we could move all distribution stuff outside Rand itself, though I don't think this is so likely. |
Came up with a simpler scheme that supports streams and jumping, and where fn split_at(mut self, index: usize) -> (Self, Self) {
let new = DistProducer {
distr: self.distr,
amount: self.amount - index,
rng: R::split_from(&mut self.rng, self.split_level),
split_level = self.split_level + 1;
phantom: ::core::marker::PhantomData,
};
self.amount = index;
self.split_level += 1;
(self, new)
}
// PRNG supporting streams:
pub trait SeedableRng: Sized {
/* ... */
fn split_from(&mut self, split_level: usize) -> Self {
let mut new = self.clone();
// XOR the stream with bit nr `split_level`.
new.stream ^= (1 << split_level); // shift by one more if `stream` must be odd
new
}
}
// PRNG supporting jumps:
pub trait SeedableRng: Sized {
/* ... */
fn split_from(&mut self, split_level: usize) -> Self {
let mut new = self.clone();
// Jump by smaller amounts if we are at a deeper `split_level`.
new.jump(PERIOD / (2 << split_level));
new
}
}
// PRNG not supporting either
pub trait SeedableRng: Sized {
/* ... */
fn split_from(&mut self, split_level: usize) -> Self {
let mut new = Self::from_rng(self).unwrap();
// Adding `split_level` to the state may improve things a bit
new.state = new.state.wrapping_add(split_level + 1);
new
}
} |
And what should the default implementation be? Maybe a variant on the last: fn split_from(&mut self, split_level: usize) -> Self {
let mut seed = Seed::default();
self.fill_bytes(&mut seed);
// somehow do cascading wrapping add to seed (cast to `[usize]` on x86?)
Self::from_seed(seed)
} I'm not sure; this is a bit niche but does have a certain utility. For use with Rayon it obviously doesn't need to be reproducible. An alternative could be to have Or maybe it just makes more sense using thread-local RNGs... |
I feel like this should be implemented in rayon. I think they could even make it reproducible (with some performance cost). |
/// `SeedableRng::from_rng`. **Important**: Not all RNG algorithms support this! | ||
/// Notably the low-quality plain Xorshift, the current default for `SmallRng`, | ||
/// will simply clone itself using this method instead of seeding the split off | ||
/// RNG well. Consider using something like PCG or Xoroshiro128+. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be possible to use different RNGs for the seeding and the sampling.
Seems like a good default to me. If we use
I am not so sure yet it is impossible to be reproducible. If you could control how often the job gets split, the worst that can happen is that the jobs execute in different order. But all the values would be the same.
I thought about requiring a different trait than A trait to support working with streams and a trait to support seeking can both be interesting. In a way they would be more low-level than what I proposed here. They would need quite some design work, and would probably be less easy to use. For streams you would want to have both a getter and setter, and figure out how to make good use of streams yourself. Supported values may be specific to the RNG. Seeking is more difficult, with some RNGs supporting absolute seeking, others only relative. Do you want to emulate the ability to seek backwards (by wrapping around the RNGs period)? Seeking by a variable amount should be possible for most RNGs, but can be very slow. For some RNGs you may want to cache the intermediate polynomial. And should seeking by some powers of two, or fractions of the period need a special function? And I don't even know if it is always possible to calculate the difference between two points in a random number stream. But you may also want to expose a function to get that too. Even if we someday would have traits for both functionalities, I think something simple like the
I'll assure you: I won't forget that option. I am not pushing this solution, just exploring how plausible it would be as alternative. Because I think there are cases where we can do much better than using
Something to keep in mind. Yet this is mostly a problem related to how RNGs work.
@dhardy had the same comment, but I don't see a way to make that work. |
Yes, but I think the thread pool should manage the RNGs (see rayon-rs/rayon#566), which can only be implemented in rayon. |
Made a benchmark to compare the alternatives pitdicker@64c6985:
A few things to note. Both sequential methods do not optimize well. Xorshift should be able to do ~4000 MB/s on my machine. But this is kind of a known problem with iterators, not sure how to improve that... Using rayon and with The two parallel methods use the approach from this PR. I am not sure what is going that makes them not suffer from the bad performance of the normal iterator methods. So something to figure out before drawing any conclusions. |
Yes, interesting, but as you say 10-fold speedup is more than expected, so something odd is going on. |
But in this case the trouble is with the normale iterator code, the parallel results are roughly as expected. You also investigated this before, in #275 (comment). |
Adjusted the benchmark to generate 100.000
I have inspected the results with perf. The parallel method in this PR with jumping overhead is 3.8× as fast as the fastest sequential version. For both the The The method using a thread_local Xorshift spends 25~30% of its time retrieving the RNG from thread-local storage (because that has to happen on every iteration). A bit more than 50% is spend in the actual loop, which is not as optimized as in the other variants. Not sure why. |
Simply implementing |
How many sub-generators are used by the Tight loops can benefit from unrolling but of course that can't happen when the same RNG gets used in each iteration. It would be interesting to try optimising with a native 128-bit generator or even a |
Not that many. In a previous test 8, could be a bit more in this benchmark. According to rayon-rs/rayon#566 (comment) the number of cores (I have 4), plus a few more near the end when it starts to steal work.
Not sure what you mean here. I didn't see any loop unrolling with Xorshift. But what I mostly looked for is if LLVM was able to cut through all the abstractions. And it is, leaving a simple loop with the by now familiar Xorshift instructions and float conversion code, basically the same instructions as in the |
I tried another approach, having I plan to do another test with a parallel iterator on I think it is more useful than the The problem from #398 would then hopefully look something like this: fn normal(mu:f64, sigma: &Vec<f64>) -> Vec<f64> {
let mut rng = SmallRng::new();
rng.par_iter_map(|rng| {
let mut g = Normal::new(mu, sigma[i]);
g.sample(&mut rng)
}).take(sigma.len()).collect();
} Edit: it would also have to iterate over the |
I wrote a comment about how I now think going with a simple wrapper that splits an RNG on clone is the right approach, in combination with Advantages are:
I do like the part sketched here of adding some kind of As there is basically no code shared between that approach and the one in this PR, I am going to close this one. Thanks for all the good discussion 👍 and expect a new PR soon(ish). |
Hi, I'm interested in this exact problem, and I'm wonder what the current state-of-the-art is. In a toy simulation I wrote, I found that calling The easiest way to stop calling In short, I have this code:
I would like to write this code, which is 17% faster, almost entirely because it doesn't call rand::thread_rng() inside of
But IIUC, the second program is buggy, as each chunk is going to be seeded with an identical Rng. |
Oh, I think I could just use
|
A clone should be identical, right? You almost never want to clone an RNG, aside from certain tests. You could instead construct a new RNG, possibly from a seed passed into I'm a bit confused about that last example: I would imagine something more like this fn fast_sim(trials: usize) {
let sum = (0..trials)
.into_par_iter()
.map_init(|| StdRng::from_seed(rand::random()), |rng, _| do_one(rng))
.reduce(|| 0, |a, b| a + b);
}
fn do_one(rng: &mut StdRng) -> usize {
// ...
} Like this you can now replace If you want deterministic, you would have to do something like pass the same fixed
Since it's parametrised with constants, most of the set-up should effectively happen at compile time. Not that it's very complex anyway. |
I agree. Clone should give an identical value back. It's doing the correct thing.
I played around in the playground trying to create a deterministic threaded simulation. I think we can do what we need to do in map_init to deterministically construct a seed, but the issue is that the number of jobs Rayon will spawn, and the number of individual work items that will be allocated to each rayon batch is not deterministic, due to work stealing. For example, this code outputs similar values every time, but if an item of work is stolen onto a different thread (and thus a different seed), it will produce something different: I think it's not currently possible to write a deterministically reproducible monte carlo sim using rayon. |
To make it deterministic, you should use a single seed (instead of constructing a new seed in
Some other generators support multiple streams somehow. A little warning: PCG has been criticized in this regard since its streams are more related than might be expected. |
I'm not sure if I follow you. I can't construct an rng one time and mutably share it between the par_iter threads. I can construct an rng one time per batch from an identical seed and chose the stream based on a work number, but for something like xoshiro, I'd have to call I tried to adapt the pi estimator example for rayon using chacha
I found that I have a pull request for this as an example, but of course I want to make sure it's correct, e.g., that every work item is generating different samples from the distribution. |
You're using the same RNG with the same
Because you are re-using the RNG across work-units in non-deterministic order, and the "word position" is retained from one work unit to the next. Try this. The result appears to be deterministic, as expected. use rand::distributions::{Distribution, Uniform};
use rand_chacha::rand_core::SeedableRng;
use rand_chacha::ChaCha8Rng;
use rayon::prelude::*;
static SEED: u64 = 0;
fn main() {
let range = Uniform::new(-1.0f64, 1.0);
let total = 1_000_000;
let in_circle: usize = (0..total)
.into_par_iter()
.map(|i| {
let mut rng = ChaCha8Rng::seed_from_u64(SEED);
rng.set_stream(i);
let a = range.sample(&mut rng);
let b = range.sample(&mut rng);
if a * a + b * b <= 1.0 {
1
} else {
0
}
})
.reduce(|| 0, |a, b| a + b);
// prints something close to 3.14159...
println!(
"π is approximately {}",
4. * (in_circle as f64) / (total as f64)
);
} |
Thanks for looking! I will update my PR to use this approach, and I don't mind and fully release this code for any purpose.
If you use the same code pattern and just vary the number of cores, it's expected to get a speedup factor close to your core count. The trickiness here is that you can use a different code pattern when running ST or when running MT but dropping determinism. No one would write the ST version of this program by building the Rng in the hot loop. I think there are only three plausible programs to consider: #![cfg(all(feature = "std", feature = "std_rng"))]
use rand::distributions::{Distribution, Uniform};
use rand_chacha::{rand_core::SeedableRng, ChaCha8Rng};
use rayon::prelude::*;
static SEED: u64 = 0;
static TRIALS: u64 = 10_000_000;
fn main() {
time("deterministic st", deterministic_st);
time("nondeterministic mt", nondeterministic_mt);
time("deterministic mt", deterministic_mt);
}
fn time(name: &str, cb: fn(u64) -> f64) {
let start = std::time::Instant::now();
let answer = cb(TRIALS);
let end = start.elapsed();
println!("\n{}:", name);
println!("\telapsed: {:?}", end);
println!("\tπ approximation: {}", answer);
}
fn deterministic_st(total: u64) -> f64 {
let range = Uniform::new(-1.0f64, 1.0);
let mut rng = ChaCha8Rng::seed_from_u64(SEED);
let mut sum = 0u64;
for _ in 0..total {
let a = range.sample(&mut rng);
let b = range.sample(&mut rng);
if a * a + b * b <= 1.0 { sum += 1 }
}
4. * (sum as f64) / (total as f64)
}
fn nondeterministic_mt(total: u64) -> f64 {
let range = Uniform::new(-1.0f64, 1.0);
let sum = (0..total).into_par_iter().map_init(
|| ChaCha8Rng::from_entropy(),
|rng, _| {
let a = range.sample(rng);
let b = range.sample(rng);
if a * a + b * b <= 1.0 { 1 } else { 0 }
},
).reduce(|| 0, |a, b| a + b);
4. * (sum as f64) / (total as f64)
}
fn deterministic_mt(total: u64) -> f64 {
let range = Uniform::new(-1.0f64, 1.0);
let sum = (0..total)
.into_par_iter()
.map(|i| {
let mut rng = ChaCha8Rng::seed_from_u64(SEED);
rng.set_stream(i);
let a = range.sample(&mut rng);
let b = range.sample(&mut rng);
if a * a + b * b <= 1.0 {
1
} else {
0
}
})
.reduce(|| 0, |a, b| a + b);
4. * (sum as f64) / (total as f64)
} Here's what I get when I run this with
So what we see is that getting determinism back into the mt version increases runtime by 7x using this pattern. I would say it's still not good, even though it does barely beat the st version. Do you get similar numbers? |
It's worse actually:
The point is, for determinism, you must fix the RNG to the work unit. The trick then is to make sure your work unit is large enough that the RNG initialisation isn't significant. And for this toy example, I don't think Rayon alone cuts it. This version actually beats your nondeterministic code: fn deterministic_mt(total: u64) -> f64 {
let range = Uniform::new(-1.0f64, 1.0);
const JOB_SIZE: u64 = 1000;
let sum = (0..(total / JOB_SIZE))
.into_par_iter()
.map(|i| {
let mut rng = ChaCha8Rng::seed_from_u64(SEED);
rng.set_stream(i);
let mut count = 0;
for _ in 0..JOB_SIZE {
let a = range.sample(&mut rng);
let b = range.sample(&mut rng);
if a * a + b * b <= 1.0 {
count += 1;
}
}
count
})
.reduce(|| 0, |a, b| a + b);
4. * (sum as f64) / (total as f64)
} |
Awesome! That's what we should show as the example. |
I opened this PR for discussion, but it is not ready for merging. Above all it needs better documentation.
The idea is to create a parallel iterator that works with Rayon.
The difficult part is where to store/get state of the RNG. Using
thread_rng()
with every sample is one possible solution. Then one RNG is initialized for every thread Rayon uses.This PR instead creates one new RNG at every point Rayon splits the job in two. And that can be quite a few times. How should this new RNG be created?
FromEntropy
, but that would be too expensive for every split.thread_rng
, but that would require the initialization of oneThreadRng
for every thread Rayon uses, which could also be expensive.stream_id
.I think option 3 is the fastest, and a good solution for the scenarios where you also want to use rayon. But if we can somehow be flexible, that would be nicer.