-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
switch to std::simd, expand SIMD & docs #1239
Conversation
Since std::simd uses the same LLVM API as packed_simd I doubt this will change benchmarks at all but I'll try to run some later. |
Turns out we don't have any SIMD benchmarks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good riddance, fragile packed_simd
! Lets hope std SIMD is better.
This isn't a full review. Thanks @TheIronBorn.
I'm agreed that introducing seemingly unpredictable type inference is bad. An extra trait or something would also make documentation easier, we could use The Bernoulli stuff should perhaps wait for #1227 |
acd5020
to
d41a948
Compare
move __m128i to stable, expand documentation, add SIMD to Bernoulli, add maskNxM, add __m512i genericize simd uniform int remove some debug stuff remove bernoulli foo foo
Removed the Bernoulli stuff and squashed some commits. Ran some benchmarks for 128/256 bit vectors and they're mostly the same except for small optimizations packed_simd had but std::simd isn't prioritizing. Maybe LLVM might optimize it better in the future. |
src/distributions/other.rs
Outdated
fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> Mask<T, LANES> { | ||
rng.gen().lanes_lt(Simd::default()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this compute to false
in all lanes?
And, regarding the proceeding doc, ... doesn't this make the Mark
type inappropriate if normally only one bit from each lane is used? Ugh, guess we can't fix that.
BTW one can assume that a random boolean is true
with 50% probability, but should one actually assume anything about what a random Mask
is?
Maybe instead we should provide a higher level API around select
etc... though given that SIMD is a little niche, maybe just a few recipes in the book would be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaskElement
must be a signed integer so we have some form of iNxM.lt(0)
There are other SIMD operations which use all the bits of a mask, like an SSE2 select (x & m) | (y & !m)
, and std::simd doesn't specify layout or value representation so we have to do this.
Correct, we can't assume anything about Mask
value representation but I don't know what users might need that for. Is there a specific usecase example you're thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No specific use-case; I was just going by what you mentioned (select). Probably this is unnecessarily to get into however.
Realized that the |
Any flags depending on |
Anything left to review here? |
Oh there's a weird new failure. I don't understand why it seems to be enabling |
src/distributions/integer.rs
Outdated
#[cfg(feature = "simd_support")] | ||
macro_rules! simd_impl { | ||
($(($intrinsic:ident, $vec:ty),)+) => {$( | ||
macro_rules! intrinsic_impl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is an x86 only impl, maybe the name should reflect that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
src/distributions/other.rs
Outdated
/// Note that on some hardware like x86/64 mask operations like [`_mm_blendv_epi8`] | ||
/// only care about a single bit. This means that you could use uniform random bits | ||
/// directly: | ||
/// | ||
/// ```ignore | ||
/// // this may be faster... | ||
/// let x = unsafe { _mm_blendv_epi8(a.into(), b.into(), rng.gen::<__m128i>()) }; | ||
/// | ||
/// // ...than this | ||
/// let x = rng.gen::<mask8x16>().select(b, a); | ||
/// ``` | ||
/// | ||
/// Since most bits are unused you could also generate only as many bits as you need. | ||
/// | ||
/// [`_mm_blendv_epi8`]: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_blendv_epi8&ig_expand=514/ | ||
/// [`simd_support`]: https://github.com/rust-random/rand#crate-features | ||
#[cfg(feature = "simd_support")] | ||
impl<T, const LANES: usize> Distribution<Mask<T, LANES>> for Standard | ||
where | ||
T: MaskElement + PartialOrd + SimdElement<Mask = T> + Default, | ||
LaneCount<LANES>: SupportedLaneCount, | ||
Standard: Distribution<Simd<T, LANES>>, | ||
{ | ||
#[inline] | ||
fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> Mask<T, LANES> { | ||
// `MaskElement` must be a signed integer, so this is equivalent | ||
// to the scalar `i32 < 0` method | ||
rng.gen().lanes_lt(Simd::default()) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand this is correct, but not necessarily the most efficient: only the high bit of each lane generated by rng.gen()
is used. I'm not sure if this is actually worth worrying about however.
I'm also just a little surprised it type-checks: it relies on only one type supporting lanes_lt(..) -> Mask<T, LANES>
. Better to clarify with rng.gen::<T>().lates_lt(T::default())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inefficiency can actually be worse in scalar. mask8x16
wastes 7 bits per lane while bool
wastes 31, mask64x2
though wastes 63. I'll look into more efficient methods but might just point readers to it in the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the above comments aren't blockers, so I'll also approve this PR.
Ah, the tests are failing because the stdsimd API changed in a recent nightly. Shame, I hoped stdsimd would be more stable, though I suppose it's a different kind of instability. We could add some compilation conditions or wait until the changes reach stable. |
I'm sure you know more than I do about the state of |
we could simply merge the packed_simd fix while we wait |
Can do. The only thing against is that your migrations here will likely end up stale. Any idea on the time frame for stability or how much churn is likely to |
I don't know what I was thinking earlier but since std::simd is only available on nightly we can just fix and merge right now |
Unfortunately there's no roadmap for stdsimd but there is likely to be at least a little churn. Though we aren't using too many features so I think we won't be affected much. |
The only remaining failure is crossbeam |
Closes #1232. Should also fix #1162.
The new API is pretty easy and the new trait system made for some convenient generic implementations.
Changes:
__m128i
&__m256i
away fromsimd_support
feature__m512i
and some internal AVX512 optimizationsu8x2
/i8x2
types due to lack of support instd::simd
NonZero
) to the list inStandard
type documentationDistribution<maskNxM> for Standard
, behaves likerng.gen::<bool>
implementedDistribution<mask64xM> for Bernoulli
, each lane uses the same probThe trait system allows impls to adapt to future SIMD types and doesn't clutter the rendered documentation. Using it in more places would mean some code duplication though.
__m512i
requires nightly Rust'sstdsimd
feature at the moment so I put it under thesimd_support
feature butnightly
might make more sense.