-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate SIMD support #377
Comments
I’ve actually been playing with this a lot. Let me see about putting something worth scrutiny on github.
…Sent from my mobile device
On Apr 5, 2018, at 12:58, Paul Dicker ***@***.***> wrote:
Disclaimer: I don't really know what I am talking about here.
The first step, as far as Rand is concerned, is to generate a small array of values from an RNG. Some RNG's may lend itself well for this, like ChaCha, and possibly HC-128, although that one does not use SIMD. Other options are packing together multiple small RNGs, like Xorshift+ or Xoroshiro+. The result type of such an RNG should be some array, so maybe we can make use of the BlockRngCore trait?
Making use of the generated values can be a more interesting problem. This involves the distributions code.
Converting to floating point: the Standard uniform distribution should not cause any problems, and neither should the range code.
I suppose the other distributions are going to give trouble, at least anything that needs branches or a look-up table (i.e. the current ziggurat code) is not going to work well. And what should happen when an algorithm needs to do rejections sampling, and one of a couple of SIMD values get rejected? Return NaN? For most distributions there should exist an algorithm suitable for SIMD. The Box-Muller transform for example looks promising for the Normal distribution, although it would need a library to provide functions like sin and cos.
Integer ranges: I don't expect the widening multiply method to work for integer range reduction, and division or modulus also do not seem to be available. And I suppose we have no way to indicate a value is rejected because it falls outside the acceptable zone and can cause bias.
Now I don't know how far we should go, and if Rand should provide any implementations at all. But I think we should make sure the API does not prevent Rand from being used in this area.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I think this might give us some direction https://github.com/lemire/SIMDxorshift, specifically the |
Also of note: for multi-rngs we could use several sets of constants at once. More or less alternating between Xorshift64(10, 13, 10), Xorshift64(8, 9, 22), Xorshift64(2, 7, 3), and Xorshift64(23, 3, 24) for example. Might help raise statistical quality. |
Great if you would want to help out a bit here! I believe @vks has also been exploring this, for example in https://github.com/vks/xoroshiro/blob/master/src/rng/xoroshiro128simd.rs
👍 I can't read those function calls yet, but he is doing two widening multplies, even and odd, and keeping the
We could, and that seems like a good idea. But just to make sure, that doesn't mean all four (/eight?) can use the same seed. @TheIronBorn Would you be interested in exploring what it would take to have a Converting it to a float, or a floating point range are the simplest methods. |
Are you asking if such an implementation would require each lane to have the same value upon seeding? |
Float gen is pretty simple. Here's a quick gist https://gist.github.com/TheIronBorn/1ba4da07d6e8e05ab20cb780c383dbec |
Added range sampling |
Speeding up block RNGs with SIMD sounds reasonable; anything else sounds application-specific IMO.
For anything else, e.g. attempting to produce One specific case might be worth a specific solution: |
Thank you for code @TheIronBorn, something to look into. Can I also ask you to create a branch in your fork of rand with the changes? Or did you have another way set up to test things?
When I was exploring papers around RNGs and related stuff GPU processing was the current trend, and exploring the possibilities of SIMD also seems to come up regularly. As it is a topic we did not give any attention until know, at least good to explore a bit... But I have no idea yet how useful it will really be. Especially for some distributions it may be hard for SIMD to provide any benefit over the traditional methods. One thing that seems like an obvious problem is that So our current Xorshift implementation natively generates |
Why would we need caching? To return one value at a time instead of a SIMD type? Or did you write this because the smaller number of sampling algorithms may need some mutable state? |
Caching would just be to return one value at a time, yes; the trouble with that approach is that there is significant overhead storing/retrieving the cached values. @pitdicker your branch looks interesting, though you could probably just do The There is a significant reason I don't want to add an associated type to But I hold to my previous point: other than a few cases where SIMD can fit into the existing API (e.g. ChaCha core) I see little point trying to develop the infrastructure without at least one (preferably several) application able to consume SIMD values. |
Here's my branch https://github.com/TheIronBorn/rand/tree/simd/src/simd |
Blending sets of constants for Xorshift seems to only optimize well for machines later than somewhere around Haswell: https://godbolt.org/g/G2x1E2. (Adapted from O'Neill's Xorshift library) I don't have such a machine so I can't confirm actual speed. Removing the multiply helps. |
I realized just after posting. So it was a bad idea from me, feel free to shoot it down 😄.
Nice, I have been playing with it. And SFC seems a pretty good choice here. I think I have found a way to support SIMD without extra traits like I don't think the I have a proof of concept in https://github.com/pitdicker/rand/commits/simd_support. The code in
I agree we shouldn't turn everything upside down only to support SIMD, especially when we are not sure how useful it will be. Does implementing |
Nice, I didn't expect Implementing existing traits/distributions is fine — well, we have to look at how much code gets added, but probably fine. |
@TheIronBorn I have tried to clean up my branch a bit, removed some unnecessary unsafe code etc. Do you want to have a look and maybe experiment from there? I have tried to make the One nice benchmark on my PC: |
You could probably use |
Recently I switch to SIMD enabled Fast Mersenne Twister and could see improvement compared to |
If you care about performance, you should probably not use the Mersenne Twister, it is a lot slower than more modern RNGs like PCG and xoroshiro, while having worse statistical quality. |
I just pushed an update to my repo which implements Box-Muller TheIronBorn@f0b2f38.
The performance probably suffers significantly as my machine doesn't have SIMD FMA instructions. The API could be better and the math could probably be optimized, but it seems to work fine. |
I also just pushed some helper functions which should make implementing a SIMD RNG easier TheIronBorn@8da1cd6 |
A nice example of what SIMD RNGs can do for you: Monte Carlo estimation of pi. https://gist.github.com/TheIronBorn/ba1f1e39b6466353b1eee6cf450fe247
|
Wow, respect for the math implementations there! Did you use methods similar to those in Cephes library, or another source? We really need something like that to be a crate in Rust. The performance is not super impressive (it has to approximate sin and cos after all)... |
I'm not sure what you are looking for, but there are bindings to cephes, special functions implemented in Rust and special functions implemented for statrs. |
Sorry I didn't read that right. Those libraries look useful thanks |
Most of the my math impls come from https://github.com/miloyip/normaldist-benchmark |
Also of note: the performance should be better on Haswell and above (I have only Sandy Bridge) as the math functions make heavy use of SIMD FMA operations. |
To showcase the portable packed SIMD vector facilities in The scalar version of the benchmark needs to generate random I've added some benchmarks (src here):
These numbers do not make much sense to me (feel free to investigate further), but they hint that explicitly vectorized PRNG might make sense in some cases. For example, if my intent was to populate a large vector of It would be cool if some of the pseudo-random number generators in the |
Functionalities
Possible?
I doubt large lookup tables (ziggurat) will ever be viable (though perhaps with (inlining throughout my branch is inconsistent and might be doing more harm than good) Edit: |
Quite a list already!
Yes, I think this one is a good start for a PR. I spend a lot of effort optimizing the scalar
This one will probably be difficult to merge, but can you describe it some more? And did you somewhere implement the endianness conversion functions? That is somewhat of a requirement for SIMD RNGs. |
The rejection sampling trait is available in my branch. It allows a user to pass a comparison function similar to I didn’t implement endianness conversion. Thinking on it now it shouldn’t be too difficult to write. |
I added |
|
I managed to get equal performance, but it seems like |
The newer the hardware the faster unaligned loads are. For modern hardware, they are pretty much as fast as aligned loads. For older hardware, unaligned loads are slower, but depending on how much work you do per load, the speed-up you might get from switching from unaligned loads to aligned ones might be minimal.
I think we should add these to |
Should we consider offering an SIMD PRNG? Concerns are portability, quality, correlation between streams/lanes, reproducibility(?). |
IMO yes we should but as you say, there are several considerations. Of course the naive approach is to use multiple simple PRNGs in parallel (and this may also be the fastest), but ideally there would be some mixing between the different parts of state (this should extend cycle length at least). Note that some generators compute with higher state sizes internally than they output, so SIMD may not always be appropriate (e.g. MCGs and LCGs may use 128-bit arithmetic to output 64-bit values). However, first, we might look at whether any of the existing generators can have SIMD optimisations, and then look for existing designs for SIMD generators. |
Most modern SIMD designs also use AES-NI. |
CorrelationThis paper is useful: Random Numbers for Parallel Computers: Requirements and Methods, With Emphasis on GPUs It lists a couple methods:
Other approaches
|
I published a crate with my PRNG tests: https://github.com/TheIronBorn/simd_prngs Most are just parallelized scalar algorithms. I'm not aware of a decent method to mix state. Pseudo-Random Number Generators for Vector Processors and Multicore Processors might have something. The only non-cryptographic PRNGs designed for SIMD I'm aware of are those of Random123.
A few other notes:
|
Shuffling the 16 bytes of let x = u8x16::from_bits(xorshift128plus_x2);
let indices = [9, 15, 0, 8, 3, 11, 6, 13, 4, 14, 10, 1, 7, 5, 2, 12]; // rng.shuffle(0..16)
let r: u8x16 = unsafe { simd_shuffle16(x, x, indices) };
return u64x2::from_bits(r); Mixing I don't know what effect this might have on correlation. |
You improved test performance just by shuffling the output? Interesting. I wonder if this just makes correlations harder to spot though? If PractRand is still seeing all the same output then literally the only change is the byte order. I think many of the tests are engineered to look for certain types of pattern, and some of those may not show up with the extra shuffle. On the other hand I think shuffling the state could significantly improve the quality (cycle length could in theory be increased by the power of the number of streams), though in effect this produces a totally different generator, hence there could be problems and it might be preferable to use different constants. I'm not an expert on this; it would be much better to have input from someone who actually designs PRNGs. |
Nice work @TheIronBorn assembling that collection of SIMD generators. What's XSM? A pseudo-LCG? It would be interesting to include 96-bit / 128-bit LCGs. Basically half of PCG, with a bit better performance:
I'm not quite sure where this is going, but publishing some of these at least as part of Rand would be nice. I guess we should get back to #431. |
Could you make that LCG code available? Perhaps just a pull request on https://github.com/TheIronBorn/simd_prngs. And here’s @pitdicker’s evaluation of XSM dhardy#60 (comment) |
|
I managed to get |
No, that's not impressive. It seems to me that multiplication or even addition by a constant would cause better bit-avalanche than bit-shift-and-XOR operations, but then Xorshift stays away from these for performance. I'm not sure whether it's worth continuing down this path. It's essentially building a new PRNG using only automated tests for evaluation. But I'll have a go at building a mixing LCG/MCG if you like. |
Intel has a mixing LCG here https://software.intel.com/en-us/articles/fast-random-number-generator-on-the-intel-pentiumr-4-processor/. (Perhaps I'm reading it incorrectly and it's not actually mixing state) My quick testing doesn't indicate it's very fast and they don't give many details on its quality. |
I was only looking into the state mixing to try to find some way of achieving
Probably not the best approach to test mixing with a poorer quality generator like |
I was just experimenting with a generator based on this. Unfortunately performance is abysmal using |
In all honesty, I think it might be time to close this issue. Lets recap. #523 landed support for SIMD types on the There has been some discussion of supporting SIMD types for other distributions, notably There has been some discussion on adapting the There has been a lot of discussion on parallel/SIMD RNG implementations and @TheIronBorn has assembled a collection of SIMD generators. I remain sceptical about simply bolting together multiple small RNGs since there are several existing PRNGs with 128-bit output at least, and larger generators tend to have better quality than smaller ones, although for some applications multiple small generators may be faster and have sufficient quality. At this point I think it makes sense to create smaller, more topical discussions on issues of continued interest. I also suggest that when considering SIMD PRNGs we should try benchmarking with something more complex than micro-benchmarks where possible. @gnzlbg's Ambient Occlusion ray casting benchmark may be useful for this (though multiple applications would be preferred). In fact, it may be worth opening an issue just to talk about benchmarking (SIMD as well as scalar PRNGs). |
FYI the purpose of However, ISPC only provides one PRNG, and if we can beat it conveniently in Rust, that benchmark might be a good place to try that out and even report it. Just keep in mind what the original intent of the benchmark is, and if you submit a PR with a better PRNGs, put them behind a feature flag so that we can still use the benchmark in the |
53: switch gen_biguint to fill_bytes r=cuviper a=TheIronBorn Changes `gen_biguint` from a `push(gen::<u32>)` method to rand's [`fill_bytes`](https://docs.rs/rand/0.5.0/rand/trait.RngCore.html#tymethod.fill_bytes). This should improve performance in most cases. - For small PRNGs which only natively generate 64 bits (like Xorshift64 or [`splitmix64.c`](http://prng.di.unimi.it/splitmix64.c)), this will no longer throw away half the bits generated. - For block PRNGs like `StdRng`, this should reduce overhead. - For an SIMD PRNG (rust-random/rand#377), this would be a significant improvement. ```diff,ignore name no_fill ns/iter fill ns/iter diff ns/iter diff % speedup +rand_1009 256 222 -34 -13.28% x 1.15 +rand_131072 27,366 14,715 -12,651 -46.23% x 1.86 +rand_2048 459 357 -102 -22.22% x 1.29 -rand_256 93 130 37 39.78% x 0.72 +rand_4096 842 557 -285 -33.85% x 1.51 -rand_64 69 92 23 33.33% x 0.75 +rand_65536 13,625 7,382 -6,243 -45.82% x 1.85 +rand_8192 1,836 869 -967 -52.67% x 2.11 ``` (i.e. `rand_1009` does `gen_biguint(1009)`. All benches are powers of two except `rand_1009`) (Let me know if you want the `rand_` benches added) Co-authored-by: TheIronBorn <> Co-authored-by: Josh Stone <cuviper@gmail.com>
Hi! This topic is interesting to me, I have this pet project where I'm using the monte carlo method to simulate football matches. I've used this project as an excuse to learn more about SIMD and different topics. I also wanted to learn Rust so naturally I ended up here. I noticed that the current I have some portable implementations which are using nightly portable SIMD, and some that are using SIMD intrinsics directly from I thought I'd mention it here in case there is any interest in this still |
Disclaimer: I don't really know what I am talking about here.
The first step, as far as Rand is concerned, is to generate a small array of values from an RNG. Some RNG's may lend itself well for this, like ChaCha, and possibly HC-128, although that one does not use SIMD. Other options are packing together multiple small RNGs, like Xorshift+ or Xoroshiro+. The result type of such an RNG should be some array, so maybe we can make use of the
BlockRngCore
trait?Making use of the generated values can be a more interesting problem. This involves the distributions code.
Converting to floating point: the
Standard
uniform distribution should not cause any problems, and neither should the range code.I suppose the other distributions are going to give trouble, at least anything that needs branches or a look-up table (i.e. the current ziggurat code) is not going to work well. And what should happen when an algorithm needs to do rejections sampling, and one of a couple of SIMD values get rejected? Return
NaN
? For most distributions there should exist an algorithm suitable for SIMD. The Box-Muller transform for example looks promising for theNormal
distribution, although it would need a library to provide functions likesin
andcos
.Integer ranges: I don't expect the widening multiply method to work for integer range reduction, and division or modulus also do not seem to be available. And I suppose we have no way to indicate a value is rejected because it falls outside the acceptable zone and can cause bias.
Now I don't know how far we should go, and if Rand should provide any implementations at all. But I think we should make sure the API does not prevent Rand from being used in this area.
The text was updated successfully, but these errors were encountered: