Investigate SIMD support #377

pitdicker · 2018-04-05T19:58:10Z

Disclaimer: I don't really know what I am talking about here.

The first step, as far as Rand is concerned, is to generate a small array of values from an RNG. Some RNG's may lend itself well for this, like ChaCha, and possibly HC-128, although that one does not use SIMD. Other options are packing together multiple small RNGs, like Xorshift+ or Xoroshiro+. The result type of such an RNG should be some array, so maybe we can make use of the BlockRngCore trait?

Making use of the generated values can be a more interesting problem. This involves the distributions code.

Converting to floating point: the Standard uniform distribution should not cause any problems, and neither should the range code.

I suppose the other distributions are going to give trouble, at least anything that needs branches or a look-up table (i.e. the current ziggurat code) is not going to work well. And what should happen when an algorithm needs to do rejections sampling, and one of a couple of SIMD values get rejected? Return NaN? For most distributions there should exist an algorithm suitable for SIMD. The Box-Muller transform for example looks promising for the Normal distribution, although it would need a library to provide functions like sin and cos.

Integer ranges: I don't expect the widening multiply method to work for integer range reduction, and division or modulus also do not seem to be available. And I suppose we have no way to indicate a value is rejected because it falls outside the acceptable zone and can cause bias.

Now I don't know how far we should go, and if Rand should provide any implementations at all. But I think we should make sure the API does not prevent Rand from being used in this area.

The text was updated successfully, but these errors were encountered:

TheIronBorn · 2018-04-05T22:17:19Z

I’ve actually been playing with this a lot. Let me see about putting something worth scrutiny on github.

…

Sent from my mobile device

On Apr 5, 2018, at 12:58, Paul Dicker ***@***.***> wrote: Disclaimer: I don't really know what I am talking about here. The first step, as far as Rand is concerned, is to generate a small array of values from an RNG. Some RNG's may lend itself well for this, like ChaCha, and possibly HC-128, although that one does not use SIMD. Other options are packing together multiple small RNGs, like Xorshift+ or Xoroshiro+. The result type of such an RNG should be some array, so maybe we can make use of the BlockRngCore trait? Making use of the generated values can be a more interesting problem. This involves the distributions code. Converting to floating point: the Standard uniform distribution should not cause any problems, and neither should the range code. I suppose the other distributions are going to give trouble, at least anything that needs branches or a look-up table (i.e. the current ziggurat code) is not going to work well. And what should happen when an algorithm needs to do rejections sampling, and one of a couple of SIMD values get rejected? Return NaN? For most distributions there should exist an algorithm suitable for SIMD. The Box-Muller transform for example looks promising for the Normal distribution, although it would need a library to provide functions like sin and cos. Integer ranges: I don't expect the widening multiply method to work for integer range reduction, and division or modulus also do not seem to be available. And I suppose we have no way to indicate a value is rejected because it falls outside the acceptable zone and can cause bias. Now I don't know how far we should go, and if Rand should provide any implementations at all. But I think we should make sure the API does not prevent Rand from being used in this area. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

TheIronBorn · 2018-04-05T22:20:34Z

I think this might give us some direction https://github.com/lemire/SIMDxorshift, specifically the avx_randombound_epu32 function: https://github.com/lemire/SIMDxorshift/blob/1766a98f4be5e1d2e78d9e2c3623800fbf92931f/src/simdxorshift128plus.c#L90

TheIronBorn · 2018-04-06T01:42:48Z

Also of note: for multi-rngs we could use several sets of constants at once. More or less alternating between Xorshift64(10, 13, 10), Xorshift64(8, 9, 22), Xorshift64(2, 7, 3), and Xorshift64(23, 3, 24) for example. Might help raise statistical quality.

pitdicker · 2018-04-06T05:35:14Z

Great if you would want to help out a bit here!

I believe @vks has also been exploring this, for example in https://github.com/vks/xoroshiro/blob/master/src/rng/xoroshiro128simd.rs

I think this might give us some direction lemire/SIMDxorshift

👍 I can't read those function calls yet, but he is doing two widening multplies, even and odd, and keeping the high parts? And without checking against low they will be biased, but this is not always a problem, especially if the range is small.

Also of note: for multi-rngs we could use several sets of constants at once.

We could, and that seems like a good idea. But just to make sure, that doesn't mean all four (/eight?) can use the same seed.

@TheIronBorn Would you be interested in exploring what it would take to have a Distribution take the output of such an RNG, and transforming it wile staying in the SIMD area?

Converting it to a float, or a floating point range are the simplest methods.

TheIronBorn · 2018-04-06T06:16:23Z

We could, and that seems like a good idea. But just to make sure, that doesn't mean all four (/eight?) can use the same seed.

Are you asking if such an implementation would require each lane to have the same value upon seeding?

TheIronBorn · 2018-04-06T07:24:43Z

Float gen is pretty simple. Here's a quick gist https://gist.github.com/TheIronBorn/1ba4da07d6e8e05ab20cb780c383dbec

TheIronBorn · 2018-04-06T07:40:23Z

Added range sampling

dhardy · 2018-04-06T10:25:20Z

Speeding up block RNGs with SIMD sounds reasonable; anything else sounds application-specific IMO.

Distribution::sample(..) -> T so in theory the trait can support SIMD distributions, given native consumers of the SIMD types. I guess it should be possible to make adapter implementations, i.e. automatically implement Distribution<(f64, f64)> for any type supporting Distribution<f64x2>.

For anything else, e.g. attempting to produce Normal samples in blocks for individual consumption, I guess something like the MutDistribution or CachedDistribution (i.e. equivalent to the old Sample trait) is needed to allow caching — but this significantly increases complexity and I'm not sure it would even be faster in many cases.

One specific case might be worth a specific solution: StandardBlockRng<T>. It may be possible to make a wrapper around any BlockRngCore which immediately applies the Standard distribution to convert its results buffer to an array of type T, then draws values from this buffer. In theory this might allow some optimisation between the RNG (generation) and the distribution (conversion) steps, although probably whatever loop the RNG step requires to fill the buffer would prevent this. This has the added advantages of a common design with & without SIMD and that it can simply implement Iterator<Item=T>.

pitdicker · 2018-04-06T14:08:10Z

Thank you for code @TheIronBorn, something to look into. Can I also ask you to create a branch in your fork of rand with the changes? Or did you have another way set up to test things?

Speeding up block RNGs with SIMD sounds reasonable; anything else sounds application-specific IMO.

When I was exploring papers around RNGs and related stuff GPU processing was the current trend, and exploring the possibilities of SIMD also seems to come up regularly. As it is a topic we did not give any attention until know, at least good to explore a bit...

But I have no idea yet how useful it will really be. Especially for some distributions it may be hard for SIMD to provide any benefit over the traditional methods.

One thing that seems like an obvious problem is that Distribution takes an Rng, but Rng exposes no methods to give something like next_u32x4. I think a lot can be solved by adding another method and associated type to RngCore. This type would indicate the native value produced by this RNG. It is something I thought that would be useful for months, but never spoke about...

So our current Xorshift implementation natively generates u32s, Xoroshiro128+ u64s, ChaCha [u32; 16], etc. With an addition like this is is possible to implement RNGs for specific types not covered by the trait. For example one specialized for u16 (maybe for embedded?), for SIMD with a type like next_u32x4, or one designed for floats with f64. I created an incomplete branch with changes. @dhardy Please don't shoot it down instantly...

pitdicker · 2018-04-06T14:54:32Z

For anything else, e.g. attempting to produce Normal samples in blocks for individual consumption, I guess something like the MutDistribution or CachedDistribution (i.e. equivalent to the old Sample trait) is needed to allow caching

Why would we need caching? To return one value at a time instead of a SIMD type? Or did you write this because the smaller number of sampling algorithms may need some mutable state?

dhardy · 2018-04-06T15:48:20Z

Caching would just be to return one value at a time, yes; the trouble with that approach is that there is significant overhead storing/retrieving the cached values.

@pitdicker your branch looks interesting, though you could probably just do RngCore: Iterator instead. Also what would the OsRng implementation look like?

The BlockRngCore::Results return values are probably going to be too big for many uses and there is no provided way of breaking the block up into smaller segments; this can be worked around however by choosing an RNG with Item type of the correct size.

There is a significant reason I don't want to add an associated type to RngCore: it is no longer object-safe without specifying the type (the same problem we hit in #307). It might however make sense to use a completely separate trait, possibly just Iterator, and then just implement SIMD distributions as simple functions usable via Iterator::map — this is not well integrated however.

But I hold to my previous point: other than a few cases where SIMD can fit into the existing API (e.g. ChaCha core) I see little point trying to develop the infrastructure without at least one (preferably several) application able to consume SIMD values.

TheIronBorn · 2018-04-06T18:31:37Z

Here's my branch https://github.com/TheIronBorn/rand/tree/simd/src/simd

TheIronBorn · 2018-04-07T01:24:25Z

Blending sets of constants for Xorshift seems to only optimize well for machines later than somewhere around Haswell: https://godbolt.org/g/G2x1E2. (Adapted from O'Neill's Xorshift library) I don't have such a machine so I can't confirm actual speed.

Removing the multiply helps.

pitdicker · 2018-04-07T18:49:12Z

There is a significant reason I don't want to add an associated type to RngCore: it is no longer object-safe without specifying the type

I realized just after posting. So it was a bad idea from me, feel free to shoot it down 😄.

Here's my branch TheIronBorn/rand:src/simd@simd

Nice, I have been playing with it. And SFC seems a pretty good choice here.

I think I have found a way to support SIMD without extra traits like SimdRng and SimdDistribution. The idea is that RngCore::fill_bytes does a simple copy of its result (like u32x4) to the byte slice if they have exactly the same length. And the integer distributions for Standard do the opposite: they initialize an empty u32x4, and pass it to fill_bytes as a byte slice. The optimizer is able to work things out.

I don't think the BlockRng wrapper is going to work well in combination with a simple RNG adapted for SIMD. I expect the bookkeeping to take as much time as just generating a new batch of values. So we may just as well always generate a new batch, and use as much as needed.

I have a proof of concept in https://github.com/pitdicker/rand/commits/simd_support. The code in sfc.rs is very bad and ugly. I am not sure using BlockRngCore in there is useful. But the trick with using fill_bytes seems to work.

But I hold to my previous point: other than a few cases where SIMD can fit into the existing API (e.g. ChaCha core) I see little point trying to develop the infrastructure without at least one (preferably several) application able to consume SIMD values.

I agree we shouldn't turn everything upside down only to support SIMD, especially when we are not sure how useful it will be.

Does implementing Standard and Range for SIMD types fall for you in the category 'develop infrastructure' or 'extension of existing functionality'? Without tests it takes about 125 extra loc: pitdicker/rand@c03deb6...pitdicker:6918e11. The first commit is a necessary change in rand_core to get the inlining right for fill_bytes, and I think that is something that should be in rand_core 0.1. The rest is definitely not interesting until after rand 0.5.

dhardy · 2018-04-08T07:16:58Z

Nice, I didn't expect fill_bytes to work so well!

Implementing existing traits/distributions is fine — well, we have to look at how much code gets added, but probably fine.

pitdicker · 2018-04-08T17:33:24Z

@TheIronBorn I have tried to clean up my branch a bit, removed some unnecessary unsafe code etc. Do you want to have a look and maybe experiment from there?

I have tried to make the fill_bytes method on Sfc32x* in such a way that just about all of it is optimized out when the target is a SIMD type, but to also support filling arbitrary slices, as it supposed to be able to do. One thing missing is byte-swapping on big-endian architectures. This can probably be done pretty efficiently with SIMD, do you know a method?

One nice benchmark on my PC: Sfc32x4Rng can generate f32 floating point values at 10GB/s, 4x as fast as the current Xorshift. And we would have to get very creative with supporting Distribution on tuples or slices for even Xoroshiro128+ to get 5GB/s (might be worth the experiment though).

TheIronBorn · 2018-04-10T05:57:32Z

You could probably use simd_shuffle for btye-swapping. It's not available via #![feature(stdsimd)] though. I suppose the better thing to do is implement it at stdsimd.

rohitjoshi · 2018-04-20T16:08:17Z

Recently I switch to SIMD enabled Fast Mersenne Twister and could see improvement compared to mersenne_twister

vks · 2018-04-22T02:28:01Z

If you care about performance, you should probably not use the Mersenne Twister, it is a lot slower than more modern RNGs like PCG and xoroshiro, while having worse statistical quality.

TheIronBorn · 2018-04-24T02:48:21Z

I just pushed an update to my repo which implements Box-Muller TheIronBorn@f0b2f38.

distr_normal          ... bench:  33,329 ns/iter (+/- 53,512) = 240 MB/s
distr_normal_Sfc32x4  ... bench:  92,156 ns/iter (+/- 58,657) = 347 MB/s

The performance probably suffers significantly as my machine doesn't have SIMD FMA instructions.

The API could be better and the math could probably be optimized, but it seems to work fine.

TheIronBorn · 2018-04-24T04:40:38Z

I also just pushed some helper functions which should make implementing a SIMD RNG easier TheIronBorn@8da1cd6

TheIronBorn · 2018-04-25T20:42:39Z

A nice example of what SIMD RNGs can do for you: Monte Carlo estimation of pi. https://gist.github.com/TheIronBorn/ba1f1e39b6466353b1eee6cf450fe247

test scalar ... bench:  26,424,886 ns/iter (+/- 10,327,348) = 605 MB/s
test simd   ... bench:  28,363,059 ns/iter (+/- 48,045,043) = 2256 MB/s

pitdicker · 2018-04-27T14:39:02Z

I just pushed an update to my repo which implements Box-Muller

Wow, respect for the math implementations there! Did you use methods similar to those in Cephes library, or another source? We really need something like that to be a crate in Rust.

The performance is not super impressive (it has to approximate sin and cos after all)...
I wonder if this is a situation where SIMD really makes sense. I'll have a closer look at your code later.

vks · 2018-04-30T22:35:04Z

Did you use methods similar to those in Cephes library, or another source? We really need something like that to be a crate in Rust.

I'm not sure what you are looking for, but there are bindings to cephes, special functions implemented in Rust and special functions implemented for statrs.

TheIronBorn · 2018-04-30T22:54:13Z

Implementing Box-Muller and other complex stuff with SIMD requires complex math like sin/cos/log. For most machines those instructions aren't available, but it is still possible to have fast functions modeled after something like Cephes library. I don't know of any such Rust implementations.

I don't think most of those intrinsics are even in stdsimd yet: rust-lang/stdarch#310

Sorry I didn't read that right. Those libraries look useful thanks

TheIronBorn · 2018-04-30T23:01:39Z

Most of the my math impls come from https://github.com/miloyip/normaldist-benchmark

TheIronBorn · 2018-04-30T23:12:58Z

Also of note: the performance should be better on Haswell and above (I have only Sandy Bridge) as the math functions make heavy use of SIMD FMA operations.

gnzlbg · 2018-06-06T13:32:21Z

To showcase the portable packed SIMD vector facilities in std::simd I was porting the Ambient Occlusion ray casting benchmark (aobench) from ISPC to Rust: https://github.com/gnzlbg/aobench where I noticed that pseudo-random number generation was a bottleneck of the Rust version (using the rand crate).

The scalar version of the benchmark needs to generate random f32s, and the vectorized version of the benchmark needs to generate pseudo-random SIMD vectors of f32s. I've switched from the rand crate to a hacked-in pseudo-random number generator for the scalar version (src here), and explicitly vectorizing it for generating SIMD vectors (src here). I've used the same algorithm for both versions since the point of the benchmark was to test the portable SIMD facilities in std::simd.

I've added some benchmarks (src here):

scalar (Xeon E5-2690 v4 @ 2.60GHz): throughput: 174*10^6 f32/s, 5.7ns per function call
vector (Xeon E5-2690 v4 @ 2.60GHz): throughput: 2072*10^6 f32/s (12x larger), 3.8ns per function call (generates one f32x8 per call)
scalar (Intel Core i5 @1.8 GHz): throughput: 190*10^6 f32/s, 5.2ns per function call
vector (Intel Core i5 @1.8 GHz)): throughput: 673*10^6 f32/s (3.5x larger), 11.9ns per function call (generates one f32x8 per call)

These numbers do not make much sense to me (feel free to investigate further), but they hint that explicitly vectorized PRNG might make sense in some cases. For example, if my intent was to populate a large vector of f32s with pseudo-random numbers, on my laptop the vector version has twice the latency but still 3.5x higher throughput.

It would be cool if some of the pseudo-random number generators in the rand crate could be explicitly vectorized to generate random SIMD vectors.

TheIronBorn · 2018-06-21T22:03:29Z

Functionalities

Helper functions for implementing RngCore functions via SIMD
- SIMD fill_bytes_via_simd optimizations (need to look into this more, could it help scalar block RNGs?)
SIMD rejection sampling trait
UniformInt for integer vectors (with rejection sampling)
- sample_single with leading zeros approx
- SIMD wmul (could probably be better optimized. there are mull instructions for some lane widths, mulh for fewer)
- optimizations for the gen_range(vec(0), vec(N)) case
shuffle (generates M indices at once for a uNxM vector, especially helpful for block rngs)
distributions
- box muller
- central limit theorem
- char & alphanumeric
- boolean, Bernoulli and Standard
lots of SIMD parallelized PRNGs (few are worthwhile)
various convenience stuff: SIMD rotate_left/right, AsByteSliceMut

Possible?

seq module (probably at least sample_indices_inplace)
WeightedChoice (via SIMD binary search, perhaps even linear?)
Cauchy (tangent might be slow on older hardware)
Binomial, Poisson, & Gamma via some complicated rejection sampling

I doubt large lookup tables (ziggurat) will ever be viable (though perhaps with VGATHER instructions?). Small tables (like Alphanumeric) are possible with a couple comparisons or SIMD vector shuffling (perfect for random hex characters).

(inlining throughout my branch is inconsistent and might be doing more harm than good)

Edit: ~~the gen_below optimizations~~, ~~bool & char distrs~~, other PRNGs are local and not available in my branch

pitdicker · 2018-06-22T16:40:17Z

Quite a list already!

SIMD fill_bytes_via_simd optimizations (need to look into this more, could it help scalar block RNGs?)

Yes, I think this one is a good start for a PR. I spend a lot of effort optimizing the scalar fill_bytes implementations, so that may not be worth the energy (but feel free to try 😄).

SIMD rejection sampling trait

This one will probably be difficult to merge, but can you describe it some more?

And did you somewhere implement the endianness conversion functions? That is somewhat of a requirement for SIMD RNGs.

TheIronBorn · 2018-06-22T16:58:25Z

The rejection sampling trait is available in my branch. It allows a user to pass a comparison function similar to sort_by, and a distribution to sample from. Returns only once each lane of the SIMD mask returned by the comparator is true.

I didn’t implement endianness conversion. Thinking on it now it shouldn’t be too difficult to write.

TheIronBorn · 2018-06-22T18:00:05Z

I added swap_bytes using stdsimd's intrinsics: TheIronBorn@3c6d9d4
We should switch to the safe portable shuffles when they land.

TheIronBorn · 2018-06-22T18:37:39Z

fill_bytes_via_simd for scalar RNGs doesn't seem worthwhile

test gen_bytes_hc128       ... bench:     976,078 ns/iter (+/- 717,525) = 1049 MB/s
test gen_bytes_hc128_simd  ... bench:   1,147,668 ns/iter (+/- 1,017,842) = 892 MB/s

TheIronBorn · 2018-06-22T20:08:51Z

I managed to get equal performance, but it seems like store_aligned doesn't give enough speedup (maybe for very new hardware?).

gnzlbg · 2018-06-23T10:11:11Z

it seems like store_aligned doesn't give enough speedup (maybe for very new hardware?).

The newer the hardware the faster unaligned loads are. For modern hardware, they are pretty much as fast as aligned loads. For older hardware, unaligned loads are slower, but depending on how much work you do per load, the speed-up you might get from switching from unaligned loads to aligned ones might be minimal.

various convenience stuff: SIMD rotate_left/right

I think we should add these to stdsimd, feel free to open a PR there and I'll review these.

TheIronBorn · 2018-06-30T00:00:54Z

Should we consider offering an SIMD PRNG? Concerns are portability, quality, correlation between streams/lanes, reproducibility(?).

dhardy · 2018-06-30T08:40:09Z

IMO yes we should but as you say, there are several considerations. Of course the naive approach is to use multiple simple PRNGs in parallel (and this may also be the fastest), but ideally there would be some mixing between the different parts of state (this should extend cycle length at least).

Note that some generators compute with higher state sizes internally than they output, so SIMD may not always be appropriate (e.g. MCGs and LCGs may use 128-bit arithmetic to output 64-bit values).

However, first, we might look at whether any of the existing generators can have SIMD optimisations, and then look for existing designs for SIMD generators.

vks · 2018-06-30T08:42:16Z

Most modern SIMD designs also use AES-NI.

TheIronBorn · 2018-06-30T19:05:59Z

Correlation

This paper is useful: Random Numbers for Parallel Computers: Requirements and Methods, With Emphasis on GPUs

It lists a couple methods:

A different generator for each stream (i.e. different parameters)
- viable unless the parameters are used in shifts/rotates (variable shift/rotates are only on newer hardware, mostly AVX512). In that case we could iterate through parameters. i.e. with 2-streams: (param1, param1), (param2, param2), ... (requires at least twice as much memory)
- We could even iterate through more parameter sets than vector lanes.
Splitting a single RNG into equally-spaced blocks (streams)
- requires a PRNG that can easily compute seed "distance"
- xorshift jumps, etc
A single RNG with a “random” seed for each stream
- probability that none of those streams overlap
  ≈ (1 - streams * stream_length / period) ^ (streams - 1)
Counter-based RNGs
- this is an excellent solution: jumping is trivial, we could even do it dynamically if needed.
- probability p that the s random keys are not all distinct:
  1 - p ≈ streams^2 / period
- AES or other hashes/ciphers. HighwayHash, CLHash I hear are fast with SIMD
  - Random123 has several: ARS-5/7 (non-cryptographic) might be the fastest around
- We could adapt some of these techniques to the counters of other PRNGs like SFC https://sourceforge.net/p/pracrand/discussion/366935/thread/a7761577

Other approaches

Combine multiple different PRNGs algorithms: Pseudo-Random Number Generators for Vector Processors and Multicore Processors
- with proper seeding the probability of overlaps
  ≈ streams^2 * stream_length / (period_1 * period_2)

TheIronBorn · 2018-06-30T20:02:09Z

I published a crate with my PRNG tests: https://github.com/TheIronBorn/simd_prngs

Most are just parallelized scalar algorithms. I'm not aware of a decent method to mix state. Pseudo-Random Number Generators for Vector Processors and Multicore Processors might have something.

The only non-cryptographic PRNGs designed for SIMD I'm aware of are those of Random123.

AES in counter mode
- small memory size (usually just 256 bits)
- ARS-5/7 (non-cryptographic) of Random123
  - decently fast
- some CSPRNGs which are faster than some scalar non-cryptographic PRNGs: Randen and fast-key-erasure AES
SFC
- decently to very fast (even without rotate instructions)
- PractRand has some SFC alternatives which may suit SIMD better than scalar (including VeryFast algs w/t counters)
JSF
- similar to faster speed than SFC (unlike scalar, interesting)
- lower period guarantees than SFC (2^20 for 32-bit, likely larger for 64-bit)
- larger statespace than SFC
Xorshift & related
- xoroshiro/xoshiro are merely decently fast on Sandy Bridge, likely because of their heavy use of rotates
- xorshift128+ & xorshift128 are very fast. (using all bits)
- which implies that when rotates are available xoroshiro/xoshiro could be very fast
XSM
- decently fast
- it has some nice jumping/transition features
- multiple multiply parameters?
PCG
- very slow without variable shift/rotate instructions
- multiple parameters are viable
- Pcg64 variants use 128-bit multiplication, difficult with SIMD
- built-in streams (some new details http://www.pcg-random.org/posts/on-vignas-pcg-critique.html)
lfsr113/258
- lfsr113 is used by Intel SPMD Program Compiler
- not very fast
Multiply-with-carry
- used by http://www.agner.org/random
- uses multiple sets of parameters (iterating through 8 sets, 2 at a time)
- they claim each parameter set individually passes BigCrush when a xorshift is added. (PractRand seems to disagree, but I may have implemented it incorrectly. Using @pitdicker's cat_rng.rs)
- decently fast
- small to large memory size

A few other notes:

This PRNG based on MumHash is pretty fast (might be good for counter-mode). Edit: 16 GB in PractRand.
Seeding an SIMD PRNG takes a lot of bits. We could try to speed it up or use fewer bits with things like SFC's seed_fast, xorshift jumps, or rand's seed_from_u64.

TheIronBorn · 2018-07-01T03:46:25Z

Shuffling the 16 bytes of xorshift128+x2 according to some randomly shuffled unique indices actually bumps it from 128MB in PractRand to 64GB (binary rank failure)

let x = u8x16::from_bits(xorshift128plus_x2);
let indices = [9, 15, 0, 8, 3, 11, 6, 13, 4, 14, 10, 1, 7, 5, 2, 12]; // rng.shuffle(0..16)
let r: u8x16 = unsafe { simd_shuffle16(x, x, indices) };
return u64x2::from_bits(r);

Mixing ~~state~~ output might allow us to use a poorer, faster PRNG. (I did have to try several indices though).

I don't know what effect this might have on correlation.

dhardy · 2018-07-01T09:14:22Z

You improved test performance just by shuffling the output? Interesting. I wonder if this just makes correlations harder to spot though? If PractRand is still seeing all the same output then literally the only change is the byte order. I think many of the tests are engineered to look for certain types of pattern, and some of those may not show up with the extra shuffle.

On the other hand I think shuffling the state could significantly improve the quality (cycle length could in theory be increased by the power of the number of streams), though in effect this produces a totally different generator, hence there could be problems and it might be preferable to use different constants. I'm not an expert on this; it would be much better to have input from someone who actually designs PRNGs.

dhardy · 2018-07-01T09:42:07Z

Nice work @TheIronBorn assembling that collection of SIMD generators.

What's XSM? A pseudo-LCG?

It would be interesting to include 96-bit / 128-bit LCGs. Basically half of PCG, with a bit better performance:

test lcg32x2                 ... bench:       3,152 ns/iter (+/- 60) = 2598 MB/s
test lcg32x4                 ... bench:       3,161 ns/iter (+/- 64) = 5183 MB/s
test lcg32x8                 ... bench:       3,841 ns/iter (+/- 150) = 8531 MB/s

I'm not quite sure where this is going, but publishing some of these at least as part of Rand would be nice. I guess we should get back to #431.

TheIronBorn · 2018-07-01T16:34:29Z

Could you make that LCG code available? Perhaps just a pull request on https://github.com/TheIronBorn/simd_prngs.

And here’s @pitdicker’s evaluation of XSM dhardy#60 (comment)

pitdicker · 2018-07-01T16:37:17Z

What's XSM? A pseudo-LCG?

dhardy#60 (comment)

TheIronBorn · 2018-07-01T21:46:38Z

I managed to get xorshift128+_x4 to 4GB by swizzling the full state before the regular xorshift, but it's got 512 bits of state so that isn't too impressive

dhardy · 2018-07-02T07:39:41Z

No, that's not impressive. It seems to me that multiplication or even addition by a constant would cause better bit-avalanche than bit-shift-and-XOR operations, but then Xorshift stays away from these for performance.

I'm not sure whether it's worth continuing down this path. It's essentially building a new PRNG using only automated tests for evaluation. But I'll have a go at building a mixing LCG/MCG if you like.

TheIronBorn · 2018-07-02T08:22:56Z

Intel has a mixing LCG here https://software.intel.com/en-us/articles/fast-random-number-generator-on-the-intel-pentiumr-4-processor/. (Perhaps I'm reading it incorrectly and it's not actually mixing state)

My quick testing doesn't indicate it's very fast and they don't give many details on its quality.

TheIronBorn · 2018-07-02T08:28:27Z

I was only looking into the state mixing to try to find some way of achieving

ideally there would be some mixing between the different parts of state

Probably not the best approach to test mixing with a poorer quality generator like xorshift128+. Also we still won't have the math guarantees of the period/etc of the mixed generator.

dhardy · 2018-07-02T09:12:26Z

I was just experimenting with a generator based on this. Unfortunately performance is abysmal using simd_shuffle16 etc (2-3GB/s) which means it's probably not worth further investigation.

dhardy · 2018-07-02T11:14:27Z

In all honesty, I think it might be time to close this issue. Lets recap.

#523 landed support for SIMD types on the Standard and Uniform distributions (thanks especially to @pitdicker and @TheIronBorn!).

There has been some discussion of supporting SIMD types for other distributions, notably Normal and the Ziggurat algorithm, and possible alternatives (Central Limit Theorem).

There has been some discussion on adapting the RngCore type, though for now it looks like the current API might be sufficient.

There has been a lot of discussion on parallel/SIMD RNG implementations and @TheIronBorn has assembled a collection of SIMD generators. I remain sceptical about simply bolting together multiple small RNGs since there are several existing PRNGs with 128-bit output at least, and larger generators tend to have better quality than smaller ones, although for some applications multiple small generators may be faster and have sufficient quality.

At this point I think it makes sense to create smaller, more topical discussions on issues of continued interest.

I also suggest that when considering SIMD PRNGs we should try benchmarking with something more complex than micro-benchmarks where possible. @gnzlbg's Ambient Occlusion ray casting benchmark may be useful for this (though multiple applications would be preferred).

In fact, it may be worth opening an issue just to talk about benchmarking (SIMD as well as scalar PRNGs).

gnzlbg · 2018-07-02T11:29:05Z

FYI the purpose of aobench is to compare Rust's code generation with that of ISPC, so I will keep the benchmark algorithmically/data-structure-wise as close as possible to the benchmark implementation in ISPC (see the volta subdirectory).

However, ISPC only provides one PRNG, and if we can beat it conveniently in Rust, that benchmark might be a good place to try that out and even report it. Just keep in mind what the original intent of the benchmark is, and if you submit a PR with a better PRNGs, put them behind a feature flag so that we can still use the benchmark in the lolbench suite to track Rust run-time performance regressions.

53: switch gen_biguint to fill_bytes r=cuviper a=TheIronBorn Changes `gen_biguint` from a `push(gen::<u32>)` method to rand's [`fill_bytes`](https://docs.rs/rand/0.5.0/rand/trait.RngCore.html#tymethod.fill_bytes). This should improve performance in most cases. - For small PRNGs which only natively generate 64 bits (like Xorshift64 or [`splitmix64.c`](http://prng.di.unimi.it/splitmix64.c)), this will no longer throw away half the bits generated. - For block PRNGs like `StdRng`, this should reduce overhead. - For an SIMD PRNG (rust-random/rand#377), this would be a significant improvement. ```diff,ignore name no_fill ns/iter fill ns/iter diff ns/iter diff % speedup +rand_1009 256 222 -34 -13.28% x 1.15 +rand_131072 27,366 14,715 -12,651 -46.23% x 1.86 +rand_2048 459 357 -102 -22.22% x 1.29 -rand_256 93 130 37 39.78% x 0.72 +rand_4096 842 557 -285 -33.85% x 1.51 -rand_64 69 92 23 33.33% x 0.75 +rand_65536 13,625 7,382 -6,243 -45.82% x 1.85 +rand_8192 1,836 869 -967 -52.67% x 2.11 ``` (i.e. `rand_1009` does `gen_biguint(1009)`. All benches are powers of two except `rand_1009`) (Let me know if you want the `rand_` benches added) Co-authored-by: TheIronBorn <> Co-authored-by: Josh Stone <cuviper@gmail.com>

martinothamar · 2023-07-26T13:49:08Z

Hi! This topic is interesting to me, I have this pet project where I'm using the monte carlo method to simulate football matches. I've used this project as an excuse to learn more about SIMD and different topics. I also wanted to learn Rust so naturally I ended up here. I noticed that the current rand SIMD feature doesn't really vectorize random number generation, so I've ended up implementing some vectorized PRNGs here: https://github.com/martinothamar/simd-rand

I have some portable implementations which are using nightly portable SIMD, and some that are using SIMD intrinsics directly from std::arch. The fastest one is Xoshiro256+ for AVX512 which has reached ~50 GiB/s on my laptop when generating u64x8 vectors.

I thought I'd mention it here in case there is any interest in this still

pitdicker mentioned this issue Apr 12, 2018

Add RngCore::bytes_per_round #396

Closed

gnzlbg mentioned this issue Jun 6, 2018

Random and pseudo-random generation of portable packed SIMD vector types #497

Closed

This was referenced Jun 25, 2018

add rotate_left/right rust-lang/stdarch#496

Closed

switch gen_biguint to fill_bytes rust-num/num-bigint#53

Merged

dhardy closed this as completed Jul 2, 2018

dhardy added this to the 0.6 release milestone Aug 23, 2018

Investigate SIMD support #377

Investigate SIMD support #377

Comments

pitdicker commented Apr 5, 2018

TheIronBorn commented Apr 5, 2018 via email

TheIronBorn commented Apr 5, 2018

TheIronBorn commented Apr 6, 2018 • edited Loading

pitdicker commented Apr 6, 2018

TheIronBorn commented Apr 6, 2018

TheIronBorn commented Apr 6, 2018

TheIronBorn commented Apr 6, 2018

dhardy commented Apr 6, 2018 • edited Loading

pitdicker commented Apr 6, 2018 • edited Loading

pitdicker commented Apr 6, 2018

dhardy commented Apr 6, 2018

TheIronBorn commented Apr 6, 2018 • edited Loading

TheIronBorn commented Apr 7, 2018 • edited Loading

pitdicker commented Apr 7, 2018 • edited Loading

dhardy commented Apr 8, 2018

pitdicker commented Apr 8, 2018

TheIronBorn commented Apr 10, 2018 • edited Loading

rohitjoshi commented Apr 20, 2018

vks commented Apr 22, 2018

TheIronBorn commented Apr 24, 2018 • edited Loading

TheIronBorn commented Apr 24, 2018

TheIronBorn commented Apr 25, 2018

pitdicker commented Apr 27, 2018

vks commented Apr 30, 2018

TheIronBorn commented Apr 30, 2018 • edited Loading

TheIronBorn commented Apr 30, 2018

TheIronBorn commented Apr 30, 2018

gnzlbg commented Jun 6, 2018 • edited Loading

TheIronBorn commented Jun 21, 2018 • edited Loading

Functionalities

Possible?

pitdicker commented Jun 22, 2018

TheIronBorn commented Jun 22, 2018

TheIronBorn commented Jun 22, 2018

TheIronBorn commented Jun 22, 2018 • edited Loading

TheIronBorn commented Jun 22, 2018 • edited Loading

gnzlbg commented Jun 23, 2018

TheIronBorn commented Jun 30, 2018 • edited Loading

dhardy commented Jun 30, 2018

vks commented Jun 30, 2018

TheIronBorn commented Jun 30, 2018 • edited Loading

Correlation

Other approaches

TheIronBorn commented Jun 30, 2018 • edited Loading

A few other notes:

TheIronBorn commented Jul 1, 2018 • edited Loading

dhardy commented Jul 1, 2018

dhardy commented Jul 1, 2018

TheIronBorn commented Jul 1, 2018 • edited Loading

pitdicker commented Jul 1, 2018

TheIronBorn commented Jul 1, 2018

dhardy commented Jul 2, 2018

TheIronBorn commented Jul 2, 2018 • edited Loading

TheIronBorn commented Jul 2, 2018

dhardy commented Jul 2, 2018

dhardy commented Jul 2, 2018

gnzlbg commented Jul 2, 2018

martinothamar commented Jul 26, 2023

TheIronBorn commented Apr 6, 2018 •

edited

Loading

dhardy commented Apr 6, 2018 •

edited

Loading

pitdicker commented Apr 6, 2018 •

edited

Loading

TheIronBorn commented Apr 6, 2018 •

edited

Loading

TheIronBorn commented Apr 7, 2018 •

edited

Loading

pitdicker commented Apr 7, 2018 •

edited

Loading

TheIronBorn commented Apr 10, 2018 •

edited

Loading

TheIronBorn commented Apr 24, 2018 •

edited

Loading

TheIronBorn commented Apr 30, 2018 •

edited

Loading

gnzlbg commented Jun 6, 2018 •

edited

Loading

TheIronBorn commented Jun 21, 2018 •

edited

Loading

TheIronBorn commented Jun 22, 2018 •

edited

Loading

TheIronBorn commented Jun 22, 2018 •

edited

Loading

TheIronBorn commented Jun 30, 2018 •

edited

Loading

TheIronBorn commented Jun 30, 2018 •

edited

Loading

TheIronBorn commented Jun 30, 2018 •

edited

Loading

TheIronBorn commented Jul 1, 2018 •

edited

Loading

TheIronBorn commented Jul 1, 2018 •

edited

Loading

TheIronBorn commented Jul 2, 2018 •

edited

Loading