Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate SIMD support #377

Closed
pitdicker opened this issue Apr 5, 2018 · 70 comments
Closed

Investigate SIMD support #377

pitdicker opened this issue Apr 5, 2018 · 70 comments
Milestone

Comments

@pitdicker
Copy link
Contributor

Disclaimer: I don't really know what I am talking about here.

The first step, as far as Rand is concerned, is to generate a small array of values from an RNG. Some RNG's may lend itself well for this, like ChaCha, and possibly HC-128, although that one does not use SIMD. Other options are packing together multiple small RNGs, like Xorshift+ or Xoroshiro+. The result type of such an RNG should be some array, so maybe we can make use of the BlockRngCore trait?

Making use of the generated values can be a more interesting problem. This involves the distributions code.

Converting to floating point: the Standard uniform distribution should not cause any problems, and neither should the range code.

I suppose the other distributions are going to give trouble, at least anything that needs branches or a look-up table (i.e. the current ziggurat code) is not going to work well. And what should happen when an algorithm needs to do rejections sampling, and one of a couple of SIMD values get rejected? Return NaN? For most distributions there should exist an algorithm suitable for SIMD. The Box-Muller transform for example looks promising for the Normal distribution, although it would need a library to provide functions like sin and cos.

Integer ranges: I don't expect the widening multiply method to work for integer range reduction, and division or modulus also do not seem to be available. And I suppose we have no way to indicate a value is rejected because it falls outside the acceptable zone and can cause bias.

Now I don't know how far we should go, and if Rand should provide any implementations at all. But I think we should make sure the API does not prevent Rand from being used in this area.

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Apr 5, 2018 via email

@TheIronBorn
Copy link
Collaborator

I think this might give us some direction https://github.com/lemire/SIMDxorshift, specifically the avx_randombound_epu32 function: https://github.com/lemire/SIMDxorshift/blob/1766a98f4be5e1d2e78d9e2c3623800fbf92931f/src/simdxorshift128plus.c#L90

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Apr 6, 2018

Also of note: for multi-rngs we could use several sets of constants at once. More or less alternating between Xorshift64(10, 13, 10), Xorshift64(8, 9, 22), Xorshift64(2, 7, 3), and Xorshift64(23, 3, 24) for example. Might help raise statistical quality.

@pitdicker
Copy link
Contributor Author

Great if you would want to help out a bit here!

I believe @vks has also been exploring this, for example in https://github.com/vks/xoroshiro/blob/master/src/rng/xoroshiro128simd.rs

I think this might give us some direction lemire/SIMDxorshift

👍 I can't read those function calls yet, but he is doing two widening multplies, even and odd, and keeping the high parts? And without checking against low they will be biased, but this is not always a problem, especially if the range is small.

Also of note: for multi-rngs we could use several sets of constants at once.

We could, and that seems like a good idea. But just to make sure, that doesn't mean all four (/eight?) can use the same seed.

@TheIronBorn Would you be interested in exploring what it would take to have a Distribution take the output of such an RNG, and transforming it wile staying in the SIMD area?

Converting it to a float, or a floating point range are the simplest methods.

@TheIronBorn
Copy link
Collaborator

We could, and that seems like a good idea. But just to make sure, that doesn't mean all four (/eight?) can use the same seed.

Are you asking if such an implementation would require each lane to have the same value upon seeding?

@TheIronBorn
Copy link
Collaborator

Float gen is pretty simple. Here's a quick gist https://gist.github.com/TheIronBorn/1ba4da07d6e8e05ab20cb780c383dbec

@TheIronBorn
Copy link
Collaborator

Added range sampling

@dhardy
Copy link
Member

dhardy commented Apr 6, 2018

Speeding up block RNGs with SIMD sounds reasonable; anything else sounds application-specific IMO.

Distribution::sample(..) -> T so in theory the trait can support SIMD distributions, given native consumers of the SIMD types. I guess it should be possible to make adapter implementations, i.e. automatically implement Distribution<(f64, f64)> for any type supporting Distribution<f64x2>.

For anything else, e.g. attempting to produce Normal samples in blocks for individual consumption, I guess something like the MutDistribution or CachedDistribution (i.e. equivalent to the old Sample trait) is needed to allow caching — but this significantly increases complexity and I'm not sure it would even be faster in many cases.

One specific case might be worth a specific solution: StandardBlockRng<T>. It may be possible to make a wrapper around any BlockRngCore which immediately applies the Standard distribution to convert its results buffer to an array of type T, then draws values from this buffer. In theory this might allow some optimisation between the RNG (generation) and the distribution (conversion) steps, although probably whatever loop the RNG step requires to fill the buffer would prevent this. This has the added advantages of a common design with & without SIMD and that it can simply implement Iterator<Item=T>.

@pitdicker
Copy link
Contributor Author

pitdicker commented Apr 6, 2018

Thank you for code @TheIronBorn, something to look into. Can I also ask you to create a branch in your fork of rand with the changes? Or did you have another way set up to test things?

Speeding up block RNGs with SIMD sounds reasonable; anything else sounds application-specific IMO.

When I was exploring papers around RNGs and related stuff GPU processing was the current trend, and exploring the possibilities of SIMD also seems to come up regularly. As it is a topic we did not give any attention until know, at least good to explore a bit...

But I have no idea yet how useful it will really be. Especially for some distributions it may be hard for SIMD to provide any benefit over the traditional methods.

One thing that seems like an obvious problem is that Distribution takes an Rng, but Rng exposes no methods to give something like next_u32x4. I think a lot can be solved by adding another method and associated type to RngCore. This type would indicate the native value produced by this RNG. It is something I thought that would be useful for months, but never spoke about...

So our current Xorshift implementation natively generates u32s, Xoroshiro128+ u64s, ChaCha [u32; 16], etc. With an addition like this is is possible to implement RNGs for specific types not covered by the trait. For example one specialized for u16 (maybe for embedded?), for SIMD with a type like next_u32x4, or one designed for floats with f64. I created an incomplete branch with changes. @dhardy Please don't shoot it down instantly...

@pitdicker
Copy link
Contributor Author

For anything else, e.g. attempting to produce Normal samples in blocks for individual consumption, I guess something like the MutDistribution or CachedDistribution (i.e. equivalent to the old Sample trait) is needed to allow caching

Why would we need caching? To return one value at a time instead of a SIMD type? Or did you write this because the smaller number of sampling algorithms may need some mutable state?

@dhardy
Copy link
Member

dhardy commented Apr 6, 2018

Caching would just be to return one value at a time, yes; the trouble with that approach is that there is significant overhead storing/retrieving the cached values.

@pitdicker your branch looks interesting, though you could probably just do RngCore: Iterator instead. Also what would the OsRng implementation look like?

The BlockRngCore::Results return values are probably going to be too big for many uses and there is no provided way of breaking the block up into smaller segments; this can be worked around however by choosing an RNG with Item type of the correct size.

There is a significant reason I don't want to add an associated type to RngCore: it is no longer object-safe without specifying the type (the same problem we hit in #307). It might however make sense to use a completely separate trait, possibly just Iterator, and then just implement SIMD distributions as simple functions usable via Iterator::map — this is not well integrated however.

But I hold to my previous point: other than a few cases where SIMD can fit into the existing API (e.g. ChaCha core) I see little point trying to develop the infrastructure without at least one (preferably several) application able to consume SIMD values.

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Apr 6, 2018

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Apr 7, 2018

Blending sets of constants for Xorshift seems to only optimize well for machines later than somewhere around Haswell: https://godbolt.org/g/G2x1E2. (Adapted from O'Neill's Xorshift library) I don't have such a machine so I can't confirm actual speed.

Removing the multiply helps.

@pitdicker
Copy link
Contributor Author

pitdicker commented Apr 7, 2018

There is a significant reason I don't want to add an associated type to RngCore: it is no longer object-safe without specifying the type

I realized just after posting. So it was a bad idea from me, feel free to shoot it down 😄.

Here's my branch TheIronBorn/rand:src/simd@simd

Nice, I have been playing with it. And SFC seems a pretty good choice here.

I think I have found a way to support SIMD without extra traits like SimdRng and SimdDistribution. The idea is that RngCore::fill_bytes does a simple copy of its result (like u32x4) to the byte slice if they have exactly the same length. And the integer distributions for Standard do the opposite: they initialize an empty u32x4, and pass it to fill_bytes as a byte slice. The optimizer is able to work things out.

I don't think the BlockRng wrapper is going to work well in combination with a simple RNG adapted for SIMD. I expect the bookkeeping to take as much time as just generating a new batch of values. So we may just as well always generate a new batch, and use as much as needed.

I have a proof of concept in https://github.com/pitdicker/rand/commits/simd_support. The code in sfc.rs is very bad and ugly. I am not sure using BlockRngCore in there is useful. But the trick with using fill_bytes seems to work.

But I hold to my previous point: other than a few cases where SIMD can fit into the existing API (e.g. ChaCha core) I see little point trying to develop the infrastructure without at least one (preferably several) application able to consume SIMD values.

I agree we shouldn't turn everything upside down only to support SIMD, especially when we are not sure how useful it will be.

Does implementing Standard and Range for SIMD types fall for you in the category 'develop infrastructure' or 'extension of existing functionality'? Without tests it takes about 125 extra loc: pitdicker/rand@c03deb6...pitdicker:6918e11. The first commit is a necessary change in rand_core to get the inlining right for fill_bytes, and I think that is something that should be in rand_core 0.1. The rest is definitely not interesting until after rand 0.5.

@dhardy
Copy link
Member

dhardy commented Apr 8, 2018

Nice, I didn't expect fill_bytes to work so well!

Implementing existing traits/distributions is fine — well, we have to look at how much code gets added, but probably fine.

@pitdicker
Copy link
Contributor Author

@TheIronBorn I have tried to clean up my branch a bit, removed some unnecessary unsafe code etc. Do you want to have a look and maybe experiment from there?

I have tried to make the fill_bytes method on Sfc32x* in such a way that just about all of it is optimized out when the target is a SIMD type, but to also support filling arbitrary slices, as it supposed to be able to do. One thing missing is byte-swapping on big-endian architectures. This can probably be done pretty efficiently with SIMD, do you know a method?

One nice benchmark on my PC: Sfc32x4Rng can generate f32 floating point values at 10GB/s, 4x as fast as the current Xorshift. And we would have to get very creative with supporting Distribution on tuples or slices for even Xoroshiro128+ to get 5GB/s (might be worth the experiment though).

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Apr 10, 2018

You could probably use simd_shuffle for btye-swapping. It's not available via #![feature(stdsimd)] though. I suppose the better thing to do is implement it at stdsimd.

@rohitjoshi
Copy link

Recently I switch to SIMD enabled Fast Mersenne Twister and could see improvement compared to mersenne_twister

@vks
Copy link
Collaborator

vks commented Apr 22, 2018

If you care about performance, you should probably not use the Mersenne Twister, it is a lot slower than more modern RNGs like PCG and xoroshiro, while having worse statistical quality.

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Apr 24, 2018

I just pushed an update to my repo which implements Box-Muller TheIronBorn@f0b2f38.

distr_normal          ... bench:  33,329 ns/iter (+/- 53,512) = 240 MB/s
distr_normal_Sfc32x4  ... bench:  92,156 ns/iter (+/- 58,657) = 347 MB/s

The performance probably suffers significantly as my machine doesn't have SIMD FMA instructions.

The API could be better and the math could probably be optimized, but it seems to work fine.

@TheIronBorn
Copy link
Collaborator

I also just pushed some helper functions which should make implementing a SIMD RNG easier TheIronBorn@8da1cd6

@TheIronBorn
Copy link
Collaborator

A nice example of what SIMD RNGs can do for you: Monte Carlo estimation of pi. https://gist.github.com/TheIronBorn/ba1f1e39b6466353b1eee6cf450fe247

test scalar ... bench:  26,424,886 ns/iter (+/- 10,327,348) = 605 MB/s
test simd   ... bench:  28,363,059 ns/iter (+/- 48,045,043) = 2256 MB/s

@pitdicker
Copy link
Contributor Author

I just pushed an update to my repo which implements Box-Muller

Wow, respect for the math implementations there! Did you use methods similar to those in Cephes library, or another source? We really need something like that to be a crate in Rust.

The performance is not super impressive (it has to approximate sin and cos after all)...
I wonder if this is a situation where SIMD really makes sense. I'll have a closer look at your code later.

@vks
Copy link
Collaborator

vks commented Apr 30, 2018

Did you use methods similar to those in Cephes library, or another source? We really need something like that to be a crate in Rust.

I'm not sure what you are looking for, but there are bindings to cephes, special functions implemented in Rust and special functions implemented for statrs.

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Apr 30, 2018

Implementing Box-Muller and other complex stuff with SIMD requires complex math like sin/cos/log. For most machines those instructions aren't available, but it is still possible to have fast functions modeled after something like Cephes library. I don't know of any such Rust implementations.

I don't think most of those intrinsics are even in stdsimd yet: rust-lang/stdarch#310

Sorry I didn't read that right. Those libraries look useful thanks

@TheIronBorn
Copy link
Collaborator

Most of the my math impls come from https://github.com/miloyip/normaldist-benchmark

@TheIronBorn
Copy link
Collaborator

Also of note: the performance should be better on Haswell and above (I have only Sandy Bridge) as the math functions make heavy use of SIMD FMA operations.

@gnzlbg
Copy link

gnzlbg commented Jun 6, 2018

To showcase the portable packed SIMD vector facilities in std::simd I was porting the Ambient Occlusion ray casting benchmark (aobench) from ISPC to Rust: https://github.com/gnzlbg/aobench where I noticed that pseudo-random number generation was a bottleneck of the Rust version (using the rand crate).

The scalar version of the benchmark needs to generate random f32s, and the vectorized version of the benchmark needs to generate pseudo-random SIMD vectors of f32s. I've switched from the rand crate to a hacked-in pseudo-random number generator for the scalar version (src here), and explicitly vectorizing it for generating SIMD vectors (src here). I've used the same algorithm for both versions since the point of the benchmark was to test the portable SIMD facilities in std::simd.

I've added some benchmarks (src here):

  • scalar (Xeon E5-2690 v4 @ 2.60GHz): throughput: 174*10^6 f32/s, 5.7ns per function call

  • vector (Xeon E5-2690 v4 @ 2.60GHz): throughput: 2072*10^6 f32/s (12x larger), 3.8ns per function call (generates one f32x8 per call)

  • scalar (Intel Core i5 @1.8 GHz): throughput: 190*10^6 f32/s, 5.2ns per function call

  • vector (Intel Core i5 @1.8 GHz)): throughput: 673*10^6 f32/s (3.5x larger), 11.9ns per function call (generates one f32x8 per call)

These numbers do not make much sense to me (feel free to investigate further), but they hint that explicitly vectorized PRNG might make sense in some cases. For example, if my intent was to populate a large vector of f32s with pseudo-random numbers, on my laptop the vector version has twice the latency but still 3.5x higher throughput.

It would be cool if some of the pseudo-random number generators in the rand crate could be explicitly vectorized to generate random SIMD vectors.

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Jun 21, 2018

Functionalities

  • Helper functions for implementing RngCore functions via SIMD
    • SIMD fill_bytes_via_simd optimizations (need to look into this more, could it help scalar block RNGs?)
  • SIMD rejection sampling trait
  • UniformInt for integer vectors (with rejection sampling)
    • sample_single with leading zeros approx
    • SIMD wmul (could probably be better optimized. there are mull instructions for some lane widths, mulh for fewer)
    • optimizations for the gen_range(vec(0), vec(N)) case
  • shuffle (generates M indices at once for a uNxM vector, especially helpful for block rngs)
  • distributions
    • box muller
    • central limit theorem
    • char & alphanumeric
    • boolean, Bernoulli and Standard
  • lots of SIMD parallelized PRNGs (few are worthwhile)
  • various convenience stuff: SIMD rotate_left/right, AsByteSliceMut

Possible?

  • seq module (probably at least sample_indices_inplace)
  • WeightedChoice (via SIMD binary search, perhaps even linear?)
  • Cauchy (tangent might be slow on older hardware)
  • Binomial, Poisson, & Gamma via some complicated rejection sampling

I doubt large lookup tables (ziggurat) will ever be viable (though perhaps with VGATHER instructions?). Small tables (like Alphanumeric) are possible with a couple comparisons or SIMD vector shuffling (perfect for random hex characters).

(inlining throughout my branch is inconsistent and might be doing more harm than good)

Edit: the gen_below optimizations, bool & char distrs, other PRNGs are local and not available in my branch

@pitdicker
Copy link
Contributor Author

Quite a list already!

  • SIMD fill_bytes_via_simd optimizations (need to look into this more, could it help scalar block RNGs?)

Yes, I think this one is a good start for a PR. I spend a lot of effort optimizing the scalar fill_bytes implementations, so that may not be worth the energy (but feel free to try 😄).

  • SIMD rejection sampling trait

This one will probably be difficult to merge, but can you describe it some more?

And did you somewhere implement the endianness conversion functions? That is somewhat of a requirement for SIMD RNGs.

@TheIronBorn
Copy link
Collaborator

The rejection sampling trait is available in my branch. It allows a user to pass a comparison function similar to sort_by, and a distribution to sample from. Returns only once each lane of the SIMD mask returned by the comparator is true.

I didn’t implement endianness conversion. Thinking on it now it shouldn’t be too difficult to write.

@TheIronBorn
Copy link
Collaborator

I added swap_bytes using stdsimd's intrinsics: TheIronBorn@3c6d9d4
We should switch to the safe portable shuffles when they land.

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Jun 22, 2018

fill_bytes_via_simd for scalar RNGs doesn't seem worthwhile

test gen_bytes_hc128       ... bench:     976,078 ns/iter (+/- 717,525) = 1049 MB/s
test gen_bytes_hc128_simd  ... bench:   1,147,668 ns/iter (+/- 1,017,842) = 892 MB/s

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Jun 22, 2018

I managed to get equal performance, but it seems like store_aligned doesn't give enough speedup (maybe for very new hardware?).

@gnzlbg
Copy link

gnzlbg commented Jun 23, 2018

it seems like store_aligned doesn't give enough speedup (maybe for very new hardware?).

The newer the hardware the faster unaligned loads are. For modern hardware, they are pretty much as fast as aligned loads. For older hardware, unaligned loads are slower, but depending on how much work you do per load, the speed-up you might get from switching from unaligned loads to aligned ones might be minimal.

various convenience stuff: SIMD rotate_left/right

I think we should add these to stdsimd, feel free to open a PR there and I'll review these.

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Jun 30, 2018

Should we consider offering an SIMD PRNG? Concerns are portability, quality, correlation between streams/lanes, reproducibility(?).

@dhardy
Copy link
Member

dhardy commented Jun 30, 2018

IMO yes we should but as you say, there are several considerations. Of course the naive approach is to use multiple simple PRNGs in parallel (and this may also be the fastest), but ideally there would be some mixing between the different parts of state (this should extend cycle length at least).

Note that some generators compute with higher state sizes internally than they output, so SIMD may not always be appropriate (e.g. MCGs and LCGs may use 128-bit arithmetic to output 64-bit values).

However, first, we might look at whether any of the existing generators can have SIMD optimisations, and then look for existing designs for SIMD generators.

@vks
Copy link
Collaborator

vks commented Jun 30, 2018

Most modern SIMD designs also use AES-NI.

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Jun 30, 2018

Correlation

This paper is useful: Random Numbers for Parallel Computers: Requirements and Methods, With Emphasis on GPUs

It lists a couple methods:

  1. A different generator for each stream (i.e. different parameters)
    • viable unless the parameters are used in shifts/rotates (variable shift/rotates are only on newer hardware, mostly AVX512). In that case we could iterate through parameters. i.e. with 2-streams: (param1, param1), (param2, param2), ... (requires at least twice as much memory)
    • We could even iterate through more parameter sets than vector lanes.
  2. Splitting a single RNG into equally-spaced blocks (streams)
    • requires a PRNG that can easily compute seed "distance"
    • xorshift jumps, etc
  3. A single RNG with a “random” seed for each stream
    • probability that none of those streams overlap
      ≈ (1 - streams * stream_length / period) ^ (streams - 1)
  4. Counter-based RNGs
    • this is an excellent solution: jumping is trivial, we could even do it dynamically if needed.
    • probability p that the s random keys are not all distinct:
      1 - p ≈ streams^2 / period
    • AES or other hashes/ciphers. HighwayHash, CLHash I hear are fast with SIMD
      • Random123 has several: ARS-5/7 (non-cryptographic) might be the fastest around
    • We could adapt some of these techniques to the counters of other PRNGs like SFC https://sourceforge.net/p/pracrand/discussion/366935/thread/a7761577

Other approaches

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Jun 30, 2018

I published a crate with my PRNG tests: https://github.com/TheIronBorn/simd_prngs

Most are just parallelized scalar algorithms. I'm not aware of a decent method to mix state. Pseudo-Random Number Generators for Vector Processors and Multicore Processors might have something.

The only non-cryptographic PRNGs designed for SIMD I'm aware of are those of Random123.

  • AES in counter mode
    • small memory size (usually just 256 bits)
    • ARS-5/7 (non-cryptographic) of Random123
      • decently fast
    • some CSPRNGs which are faster than some scalar non-cryptographic PRNGs: Randen and fast-key-erasure AES
  • SFC
    • decently to very fast (even without rotate instructions)
    • PractRand has some SFC alternatives which may suit SIMD better than scalar (including VeryFast algs w/t counters)
  • JSF
    • similar to faster speed than SFC (unlike scalar, interesting)
    • lower period guarantees than SFC (2^20 for 32-bit, likely larger for 64-bit)
    • larger statespace than SFC
  • Xorshift & related
    • xoroshiro/xoshiro are merely decently fast on Sandy Bridge, likely because of their heavy use of rotates
    • xorshift128+ & xorshift128 are very fast. (using all bits)
    • which implies that when rotates are available xoroshiro/xoshiro could be very fast
  • XSM
    • decently fast
    • it has some nice jumping/transition features
    • multiple multiply parameters?
  • PCG
  • lfsr113/258
  • Multiply-with-carry
    • used by http://www.agner.org/random
    • uses multiple sets of parameters (iterating through 8 sets, 2 at a time)
    • they claim each parameter set individually passes BigCrush when a xorshift is added. (PractRand seems to disagree, but I may have implemented it incorrectly. Using @pitdicker's cat_rng.rs)
    • decently fast
    • small to large memory size

A few other notes:

  • This PRNG based on MumHash is pretty fast (might be good for counter-mode). Edit: 16 GB in PractRand.
  • Seeding an SIMD PRNG takes a lot of bits. We could try to speed it up or use fewer bits with things like SFC's seed_fast, xorshift jumps, or rand's seed_from_u64.

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Jul 1, 2018

Shuffling the 16 bytes of xorshift128+x2 according to some randomly shuffled unique indices actually bumps it from 128MB in PractRand to 64GB (binary rank failure)

let x = u8x16::from_bits(xorshift128plus_x2);
let indices = [9, 15, 0, 8, 3, 11, 6, 13, 4, 14, 10, 1, 7, 5, 2, 12]; // rng.shuffle(0..16)
let r: u8x16 = unsafe { simd_shuffle16(x, x, indices) };
return u64x2::from_bits(r);

Mixing state output might allow us to use a poorer, faster PRNG. (I did have to try several indices though).

I don't know what effect this might have on correlation.

@dhardy
Copy link
Member

dhardy commented Jul 1, 2018

You improved test performance just by shuffling the output? Interesting. I wonder if this just makes correlations harder to spot though? If PractRand is still seeing all the same output then literally the only change is the byte order. I think many of the tests are engineered to look for certain types of pattern, and some of those may not show up with the extra shuffle.

On the other hand I think shuffling the state could significantly improve the quality (cycle length could in theory be increased by the power of the number of streams), though in effect this produces a totally different generator, hence there could be problems and it might be preferable to use different constants. I'm not an expert on this; it would be much better to have input from someone who actually designs PRNGs.

@dhardy
Copy link
Member

dhardy commented Jul 1, 2018

Nice work @TheIronBorn assembling that collection of SIMD generators.

What's XSM? A pseudo-LCG?

It would be interesting to include 96-bit / 128-bit LCGs. Basically half of PCG, with a bit better performance:

test lcg32x2                 ... bench:       3,152 ns/iter (+/- 60) = 2598 MB/s
test lcg32x4                 ... bench:       3,161 ns/iter (+/- 64) = 5183 MB/s
test lcg32x8                 ... bench:       3,841 ns/iter (+/- 150) = 8531 MB/s

I'm not quite sure where this is going, but publishing some of these at least as part of Rand would be nice. I guess we should get back to #431.

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Jul 1, 2018

Could you make that LCG code available? Perhaps just a pull request on https://github.com/TheIronBorn/simd_prngs.

And here’s @pitdicker’s evaluation of XSM dhardy#60 (comment)

@pitdicker
Copy link
Contributor Author

What's XSM? A pseudo-LCG?

dhardy#60 (comment)

@TheIronBorn
Copy link
Collaborator

I managed to get xorshift128+_x4 to 4GB by swizzling the full state before the regular xorshift, but it's got 512 bits of state so that isn't too impressive

@dhardy
Copy link
Member

dhardy commented Jul 2, 2018

No, that's not impressive. It seems to me that multiplication or even addition by a constant would cause better bit-avalanche than bit-shift-and-XOR operations, but then Xorshift stays away from these for performance.

I'm not sure whether it's worth continuing down this path. It's essentially building a new PRNG using only automated tests for evaluation. But I'll have a go at building a mixing LCG/MCG if you like.

@TheIronBorn
Copy link
Collaborator

TheIronBorn commented Jul 2, 2018

Intel has a mixing LCG here https://software.intel.com/en-us/articles/fast-random-number-generator-on-the-intel-pentiumr-4-processor/. (Perhaps I'm reading it incorrectly and it's not actually mixing state)

My quick testing doesn't indicate it's very fast and they don't give many details on its quality.

@TheIronBorn
Copy link
Collaborator

I was only looking into the state mixing to try to find some way of achieving

ideally there would be some mixing between the different parts of state

Probably not the best approach to test mixing with a poorer quality generator like xorshift128+. Also we still won't have the math guarantees of the period/etc of the mixed generator.

@dhardy
Copy link
Member

dhardy commented Jul 2, 2018

I was just experimenting with a generator based on this. Unfortunately performance is abysmal using simd_shuffle16 etc (2-3GB/s) which means it's probably not worth further investigation.

@dhardy
Copy link
Member

dhardy commented Jul 2, 2018

In all honesty, I think it might be time to close this issue. Lets recap.

#523 landed support for SIMD types on the Standard and Uniform distributions (thanks especially to @pitdicker and @TheIronBorn!).

There has been some discussion of supporting SIMD types for other distributions, notably Normal and the Ziggurat algorithm, and possible alternatives (Central Limit Theorem).

There has been some discussion on adapting the RngCore type, though for now it looks like the current API might be sufficient.

There has been a lot of discussion on parallel/SIMD RNG implementations and @TheIronBorn has assembled a collection of SIMD generators. I remain sceptical about simply bolting together multiple small RNGs since there are several existing PRNGs with 128-bit output at least, and larger generators tend to have better quality than smaller ones, although for some applications multiple small generators may be faster and have sufficient quality.


At this point I think it makes sense to create smaller, more topical discussions on issues of continued interest.

I also suggest that when considering SIMD PRNGs we should try benchmarking with something more complex than micro-benchmarks where possible. @gnzlbg's Ambient Occlusion ray casting benchmark may be useful for this (though multiple applications would be preferred).

In fact, it may be worth opening an issue just to talk about benchmarking (SIMD as well as scalar PRNGs).

@dhardy dhardy closed this as completed Jul 2, 2018
@gnzlbg
Copy link

gnzlbg commented Jul 2, 2018

FYI the purpose of aobench is to compare Rust's code generation with that of ISPC, so I will keep the benchmark algorithmically/data-structure-wise as close as possible to the benchmark implementation in ISPC (see the volta subdirectory).

However, ISPC only provides one PRNG, and if we can beat it conveniently in Rust, that benchmark might be a good place to try that out and even report it. Just keep in mind what the original intent of the benchmark is, and if you submit a PR with a better PRNGs, put them behind a feature flag so that we can still use the benchmark in the lolbench suite to track Rust run-time performance regressions.

bors bot added a commit to rust-num/num-bigint that referenced this issue Jul 5, 2018
53: switch gen_biguint to fill_bytes r=cuviper a=TheIronBorn

Changes `gen_biguint` from a `push(gen::<u32>)` method to rand's [`fill_bytes`](https://docs.rs/rand/0.5.0/rand/trait.RngCore.html#tymethod.fill_bytes). This should improve performance in most cases.

- For small PRNGs which only natively generate 64 bits (like Xorshift64 or [`splitmix64.c`](http://prng.di.unimi.it/splitmix64.c)), this will no longer throw away half the bits generated. 
- For block PRNGs like `StdRng`, this should reduce overhead.
- For an SIMD PRNG (rust-random/rand#377), this would be a significant improvement.

```diff,ignore
 name         no_fill ns/iter  fill ns/iter  diff ns/iter   diff %  speedup 
+rand_1009    256              222                    -34  -13.28%   x 1.15 
+rand_131072  27,366           14,715             -12,651  -46.23%   x 1.86 
+rand_2048    459              357                   -102  -22.22%   x 1.29 
-rand_256     93               130                     37   39.78%   x 0.72 
+rand_4096    842              557                   -285  -33.85%   x 1.51 
-rand_64      69               92                      23   33.33%   x 0.75 
+rand_65536   13,625           7,382               -6,243  -45.82%   x 1.85 
+rand_8192    1,836            869                   -967  -52.67%   x 2.11
```
(i.e. `rand_1009` does `gen_biguint(1009)`. All benches are powers of two except `rand_1009`)

(Let me know if you want the `rand_` benches added)

Co-authored-by: TheIronBorn <>
Co-authored-by: Josh Stone <cuviper@gmail.com>
@dhardy dhardy added this to the 0.6 release milestone Aug 23, 2018
@martinothamar
Copy link

Hi! This topic is interesting to me, I have this pet project where I'm using the monte carlo method to simulate football matches. I've used this project as an excuse to learn more about SIMD and different topics. I also wanted to learn Rust so naturally I ended up here. I noticed that the current rand SIMD feature doesn't really vectorize random number generation, so I've ended up implementing some vectorized PRNGs here: https://github.com/martinothamar/simd-rand

I have some portable implementations which are using nightly portable SIMD, and some that are using SIMD intrinsics directly from std::arch. The fastest one is Xoshiro256+ for AVX512 which has reached ~50 GiB/s on my laptop when generating u64x8 vectors.

I thought I'd mention it here in case there is any interest in this still

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants