-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shl and Shr implementations should accept a (polymorphic?) scalar argument rather than a vector #225
Comments
If all lanes are the same then llvm does the expected thing and it "just works out". |
If that's the only happy path, isn't it better to encode it in the typesystem? struct RoundingDivide {
to_add: u16x8,
to_shift: u16x8,
}
impl RoundingDivide {
//a conscientious programmer pre-splats their shift offsets, "for performance"...
fn new(to_shift: u16) -> RoundingDivide {
RoundingDivide {
to_add: u16x8::splat(1 << (to_shift - 1)),
to_shift: u16x8::splat(to_shift)
}
}
//...unknowingly risking a large, counterintuitive performance penalty here
fn divide(&self, arg: u16x8) -> u16x8 {
(arg + self.to_add) >> self.to_shift
}
} |
Well my thoughts were about like this:
That said, now that you mention it, having a splatted simd value that llvm can't backtrack to understand that it's all lanes the same value, thus missing the optimization, does really suck. So yeah, now I agree that shift by scalar should be provided. If only rustdoc was able to be better about trait impl bloat on pages. |
It may be desirable from a usability perspective, but something like Shifting by a vector is not strictly unsupported on other platforms--it's implemented with blends, such as pshufb on x86-64. |
That's exactly what was just discussed. However, does llvm produce the good code when you didn't immediately splat the value? If instead you've passed in a simd value that was previously splatted elsewhere, perhaps during some long-ago constructor operation? |
Sorry--what I am saying isn't just that LLVM can optimize it, it's that LLVM doesn't even have a concept of "shift a vector by a scalar". All vector shifts are done by other vectors, and optimized when possible. |
Sure. That makes sense. However, if we offer a scalar shift to the programmer and then just internally the library splats the value anyway, that is much more likely to make the user stay on the happy path where we're confident that llvm will do the optimization. Because otherwise it's easy for the user to write code like the example. Code where they think that they've helpfully "pre splatted" the value, thus saving work at the use site. However, instead they've just accidentally tricked llvm out of seeing an optimization. And the only cost to having scalar shifts that splat internally is a small docs cost (and we can maybe ask the rustdoc folks to help fix that somehow). |
@calebzulawski: You make a good point that removing the shift-by-vector API would make some types of shift more difficult to implement. //this should compile to two shifts and a blend
fn interleaved_shift(x: u16x8, ev: u16, od: u16) -> u16x8 {
x << u16x8::from_array([ev, od, ev, od, ev, od, ev, od])
} On x86-64, is there any general implementation of |
Oh to be clear I don't support removing the vector shift at all. I just think we should add scalar shifting without removing anything. |
Has there been any previous discussion of this library's portability goals? Speaking as a user, if |
Offering operations which are only sure to be fast on all devices is basically not an option. There would be gaps all over. Even if we pretend there's just three types of device (wasm, neon, and avx2, for example), taking only the intersection of those three would still leave plenty of unpleasant gaps. So we're resigned from the start to simply offering a good api first, and hardware consideration comes in second. |
To add onto that, we never make the assumption that any particular operation will compile to a single instruction. In some cases, like this one, there are operations which may not be a single instruction, but still benefit from the compiler's ability to produce the optimal implementation. |
The sense that The goal is really that when used naively for a program including a random assortment of mathematical operations, But I think there is a generous assumption that "compiles to a single instruction" means "executes in O(1) time". That's not actually the case. There are SIMD instructions that do execute in ~O(n) time on some (or even all) microarchitectures. Now, obviously, no one expects VGATHER*PS to be fast, but it is also particularly common for early AVX processors to have the trait of "the VEX-prefixed op (so, using ymm registers) takes twice as much time, so no actual gain over xmm", even on common arithmetic ops. But, of course, if the compiler selects that "slow" operation for some code, that means the same code accelerates on later microarchitectures. Meanwhile, a JIT compiler like the one used for dispatching WebAssembly instructions will have intimate knowledge of processor at runtime, so can always in theory choose the code that is fastest now, but has to make that instruction selection quickly. So you can see, even from that simple example, we have different design pressures for improving performance in most cases, and thus our definition of "portability" is sculpted to be different than WebAssembly SIMD's version. |
Is there any consensus here regarding whether Although, at a minimum, shouldn't only |
The operation "shift with a different offset in each lane" is not portable: it's supported by NEON, unsupported by SSE 4.1 and Wasm, and only partially-supported by AVX2. It would probably make more sense for types like
Simd<i32, LANES>
to implementShl<i32>
, rather thanShl<Simd<i32, LANES>>
.Scalar integer types currently implement
Shl
andShr
for all other integer types, so that might be worth considering. One downside is that it would make thestruct Simd
rustdoc page even longer.The text was updated successfully, but these errors were encountered: