Shl and Shr implementations should accept a (polymorphic?) scalar argument rather than a vector #225

fleabitdev · 2022-01-14T11:17:40Z

The operation "shift with a different offset in each lane" is not portable: it's supported by NEON, unsupported by SSE 4.1 and Wasm, and only partially-supported by AVX2. It would probably make more sense for types like Simd<i32, LANES> to implement Shl<i32>, rather than Shl<Simd<i32, LANES>>.

Scalar integer types currently implement Shl and Shr for all other integer types, so that might be worth considering. One downside is that it would make the struct Simd rustdoc page even longer.

The text was updated successfully, but these errors were encountered:

Lokathor · 2022-01-14T15:21:28Z

If all lanes are the same then llvm does the expected thing and it "just works out".

fleabitdev · 2022-01-14T15:43:42Z

If that's the only happy path, isn't it better to encode it in the typesystem?

struct RoundingDivide {
    to_add: u16x8,
    to_shift: u16x8,
}

impl RoundingDivide {
    //a conscientious programmer pre-splats their shift offsets, "for performance"...
    fn new(to_shift: u16) -> RoundingDivide {
        RoundingDivide {
            to_add: u16x8::splat(1 << (to_shift - 1)),
            to_shift: u16x8::splat(to_shift)
        }
    }
    
    //...unknowingly risking a large, counterintuitive performance penalty here
    fn divide(&self, arg: u16x8) -> u16x8 {
        (arg + self.to_add) >> self.to_shift
    }
}

Lokathor · 2022-01-14T15:53:29Z

Well my thoughts were about like this:

If you need to shift each lane by the same amount, splatting that amount across a same-size simd value and then shifting by simd will work out right with llvm.
If you need to shift each lane by a separate value then you need to shift by a simd, and the fact that hardware doesn't support that does not actually matter because that's what you need to be doing so telling llvm "figure it out and do it for me already" is actually what you wanted anyway.

That said, now that you mention it, having a splatted simd value that llvm can't backtrack to understand that it's all lanes the same value, thus missing the optimization, does really suck.

So yeah, now I agree that shift by scalar should be provided.

If only rustdoc was able to be better about trait impl bloat on pages.

calebzulawski · 2022-01-14T16:50:22Z

It may be desirable from a usability perspective, but something like a << Simd::splat(b) is perfectly fine, and in fact this is the underlying LLVM that is produced. LLVM is capable of optimizing shifts when all lanes are the same value.

Shifting by a vector is not strictly unsupported on other platforms--it's implemented with blends, such as pshufb on x86-64.

Lokathor · 2022-01-14T16:59:38Z

That's exactly what was just discussed.

However, does llvm produce the good code when you didn't immediately splat the value? If instead you've passed in a simd value that was previously splatted elsewhere, perhaps during some long-ago constructor operation?

calebzulawski · 2022-01-14T17:03:41Z

Sorry--what I am saying isn't just that LLVM can optimize it, it's that LLVM doesn't even have a concept of "shift a vector by a scalar". All vector shifts are done by other vectors, and optimized when possible.

Lokathor · 2022-01-14T17:18:34Z

Sure. That makes sense.

However, if we offer a scalar shift to the programmer and then just internally the library splats the value anyway, that is much more likely to make the user stay on the happy path where we're confident that llvm will do the optimization.

Because otherwise it's easy for the user to write code like the example. Code where they think that they've helpfully "pre splatted" the value, thus saving work at the use site. However, instead they've just accidentally tricked llvm out of seeing an optimization.

And the only cost to having scalar shifts that splat internally is a small docs cost (and we can maybe ask the rustdoc folks to help fix that somehow).

fleabitdev · 2022-01-14T17:25:09Z

@calebzulawski: You make a good point that removing the shift-by-vector API would make some types of shift more difficult to implement.

//this should compile to two shifts and a blend
fn interleaved_shift(x: u16x8, ev: u16, od: u16) -> u16x8 {
    x << u16x8::from_array([ev, od, ev, od, ev, od, ev, od])
}

On x86-64, is there any general implementation of u16x8::shl(u16x8), accepting non-constant inputs, which you'd expect to be less than eight times slower than a constant shift?

Lokathor · 2022-01-14T17:27:41Z

Oh to be clear I don't support removing the vector shift at all.

I just think we should add scalar shifting without removing anything.

fleabitdev · 2022-01-14T17:48:17Z

Has there been any previous discussion of this library's portability goals?

Speaking as a user, if std::simd provides APIs which are an order of magnitude slower on x86-64, that seems like a serious footgun. However, I'm conscious that Wasm SIMD includes instructions like i8x16.bitmask which are quite slow on NEON (WebAssembly/simd#131). I'm not sure where the line is being drawn.

Lokathor · 2022-01-14T17:57:34Z

Offering operations which are only sure to be fast on all devices is basically not an option. There would be gaps all over. Even if we pretend there's just three types of device (wasm, neon, and avx2, for example), taking only the intersection of those three would still leave plenty of unpleasant gaps.

So we're resigned from the start to simply offering a good api first, and hardware consideration comes in second.

calebzulawski · 2022-01-14T18:14:15Z

To add onto that, we never make the assumption that any particular operation will compile to a single instruction. In some cases, like this one, there are operations which may not be a single instruction, but still benefit from the compiler's ability to produce the optimal implementation.

workingjubilee · 2022-01-17T06:42:14Z

The sense that std::simd is portable includes "portable to machines which don't implement vector operations at all".

The goal is really that when used naively for a program including a random assortment of mathematical operations, std::simd will offer a speedup when targeting a common vector-register-having architecture, and that ideally you can interlace it with architectural intrinsics or inline assembly to cover odd niches, and that sometimes it will scalarize an operation and That's Fine, Actually, for reasons Lokathor already noted (namely: you are stuck doing it). Making 80% of a calculation loop +300% as fast (numbers pulled out of a hat) still can be a +240% total speedup in the performance critical segment.

But I think there is a generous assumption that "compiles to a single instruction" means "executes in O(1) time". That's not actually the case. There are SIMD instructions that do execute in ~O(n) time on some (or even all) microarchitectures. Now, obviously, no one expects VGATHER*PS to be fast, but it is also particularly common for early AVX processors to have the trait of "the VEX-prefixed op (so, using ymm registers) takes twice as much time, so no actual gain over xmm", even on common arithmetic ops.

But, of course, if the compiler selects that "slow" operation for some code, that means the same code accelerates on later microarchitectures. Meanwhile, a JIT compiler like the one used for dispatching WebAssembly instructions will have intimate knowledge of processor at runtime, so can always in theory choose the code that is fastest now, but has to make that instruction selection quickly. So you can see, even from that simple example, we have different design pressures for improving performance in most cases, and thus our definition of "portability" is sculpted to be different than WebAssembly SIMD's version.

ghost · 2022-07-14T05:30:16Z

Is there any consensus here regarding whether Shl<Rhs = Int>/Shr<Rhs = Int> should be implemented for Simd?
(Would a PR be welcome?)

Although, at a minimum, shouldn't only Shl<Rhs = u8>/Shr<Rhs = u8> for Simd be strictly necessary, considering that the largest Simd-supported integer width is 64-bits? I consider Rust stdlib to be weird in this aspect. But of course, consistency ought to be maintained...

workingjubilee mentioned this issue Jan 17, 2022

Define "portable" and elaborate design constraints #227

Closed

calebzulawski mentioned this issue Jul 27, 2023

Add scalar shifts #360

Merged

calebzulawski closed this as completed in #360 Jul 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shl and Shr implementations should accept a (polymorphic?) scalar argument rather than a vector #225

Shl and Shr implementations should accept a (polymorphic?) scalar argument rather than a vector #225

fleabitdev commented Jan 14, 2022

Lokathor commented Jan 14, 2022

fleabitdev commented Jan 14, 2022

Lokathor commented Jan 14, 2022

calebzulawski commented Jan 14, 2022

Lokathor commented Jan 14, 2022

calebzulawski commented Jan 14, 2022

Lokathor commented Jan 14, 2022

fleabitdev commented Jan 14, 2022

Lokathor commented Jan 14, 2022

fleabitdev commented Jan 14, 2022

Lokathor commented Jan 14, 2022

calebzulawski commented Jan 14, 2022

workingjubilee commented Jan 17, 2022 •

edited

Loading

ghost commented Jul 14, 2022 •

edited by ghost

Loading

Shl and Shr implementations should accept a (polymorphic?) scalar argument rather than a vector #225

Shl and Shr implementations should accept a (polymorphic?) scalar argument rather than a vector #225

Comments

fleabitdev commented Jan 14, 2022

Lokathor commented Jan 14, 2022

fleabitdev commented Jan 14, 2022

Lokathor commented Jan 14, 2022

calebzulawski commented Jan 14, 2022

Lokathor commented Jan 14, 2022

calebzulawski commented Jan 14, 2022

Lokathor commented Jan 14, 2022

fleabitdev commented Jan 14, 2022

Lokathor commented Jan 14, 2022

fleabitdev commented Jan 14, 2022

Lokathor commented Jan 14, 2022

calebzulawski commented Jan 14, 2022

workingjubilee commented Jan 17, 2022 • edited Loading

ghost commented Jul 14, 2022 • edited by ghost Loading

workingjubilee commented Jan 17, 2022 •

edited

Loading

ghost commented Jul 14, 2022 •

edited by ghost

Loading