-
Notifications
You must be signed in to change notification settings - Fork 43
movemask instruction #131
Comments
Useful and hard to emulate operations are high value to be part of the wasm simd featureset. As reasonable cost can often be subjective, perf cliffs are ideally quantified through architecture-specific performance data when needed. |
This is one of the things that motivated the In theory, clang could support existing code that uses This approach could be extended by adding instructions such as |
I tend to agree with @sunfishcode in that many use cases could be better served by other instructions. It is unclear how much value a |
Just to be clear - I agree that all_true / any_true are good specific versions of movemask. They solve most of the usecases, and they can be more efficiently implemented on many architectures. The issue is that sometimes, they aren't enough and you need movemask specifically. When you do need movemask, it's not clear how to efficiently synthesize it out of existing WASM SIMD instructions. @gnzlbg can you clarify what would bitmask_select do? Is the idea that it would be similar to movemask but specify an immediate bit location, so bitmask_select(7) is equivalent to msb_bitmask? |
Right now we have a If we had such an instruction, it would be very useful to add a way to compute such masks from a vector efficiently, e.g., by also adding a That would mean that one can write |
I personally use movemask to implement fast clamp (restrict input to a [min, max] bound). The min/max vector intrinsics are very slow and have a carried dependency. Casting to int and using movemask allow me to check if those intrinsic are needed at very little cost or if I need to follow the slow path: https://github.com/numforge/laser/blob/e660eeeb723426e80a7b1187864323d85527d18c/laser/primitives/simd_math/exp_log_sse2.nim#L41 Usage: I use it to implement fast vectorized exponentiation (10x faster than math.h) and fast clamping brings noticeable benefits. |
@mratsim Note that the usecase you have can be solved using i32x4.any_true instruction. |
Through experimentation I've discovered a better way to emulate movemask, at least given a 64-bit target platform. It requires fewer scalar instructions at the cost of doing most math in "64-bit SIMD": static void wasmMoveMask(v128_t mask, unsigned char& mask0, unsigned char& mask1)
{
v128_t mask_0 = wasmx_shuffle_v32x4(mask, 0, 2, 1, 3);
// TODO: when Chrome supports v128.const we can try doing vectorized and?
uint64_t mask_1a = wasm_i64x2_extract_lane(mask_0, 0) & 0x0804020108040201ull;
uint64_t mask_1b = wasm_i64x2_extract_lane(mask_0, 1) & 0x8040201080402010ull;
uint64_t mask_2 = mask_1a | mask_1b;
uint64_t mask_4 = mask_2 | (mask_2 >> 16);
uint64_t mask_8 = mask_4 | (mask_4 >> 8);
mask0 = uint8_t(mask_8);
mask1 = uint8_t(mask_8 >> 32);
} This is still much more expensive than a single instruction on x64, but is starting to approach NEON emulation cost. I think the general sentiment here is that movemask emulation cost on various SIMD architectures is imbalanced enough that it's not an obvious candidate for the cross-platform SIMD subset. So with that in mind I'm going to close this for now; we can reopen it if more data is presented in favor of native support. |
I'd like to see how this I think we should not add an instruction that is too slow to be useful on NEON targets, but for your application at least, it seems like an emulated instruction would still beat what you can do without it. |
@AndrewScheidecker The NEON emulation path looks like this (AArch64 variant):
I would expect this to be faster than the scalar emulation above. However, for other architectures (MIPS/RISCV) I'm not sure what the codegen strategy even should be (use the scalar path?). |
@zeux uint16_t movemask_epu16(v128_t val) {
const uint16x8_t mask = { 1, 2, 4, 8, 16, 32, 64, 128 };
return vaddvq_u16(vandq_u16((uint16x8_t)val, mask));
}
uint32_t movemask_epu32(v128_t val) {
const uint32x4_t mask = { 1, 2, 4, 8 };
return vaddvq_u32(vandq_u32((uint32x4_t)val, mask));
}
uint64_t movemask_epu64(v128_t val) {
const uint64x2_t mask = { 1, 2 };
return vaddvq_u64(vandq_u64((uint64x2_t)val, mask));
} For 32-bit ARM: uint32_t movemask_epu32(v128_t val) {
const uint32x4_t mask = { 1, 2, 4, 8 };
const uint32x4_t av = vandq_u32((uint32x4_t)val, mask);
const uint32x4_t xv = vextq_u32(av, av, 2);
const uint32x4_t ov = vorrq_u32(av, xv);
return vgetq_lane_u32(vorrq_u32(ov, vextq_u32(ov, ov, 3)), 0);
} |
Hi! Brief introduction: I've had fun with SIMD since 2002, currently working on image compression (JPEG XL) and hashing at Google. Glad to see this discussion, we also have a use case for 'get bits from mask': comparing four things in parallel and interpreting the resulting 4 bits as a number 0-15 (for traversing a decision tree). I understand that performance portability is a concern. Looks like not supporting this means applications that want it spend ~11 operations (wasmMoveMask), whereas ARM emulation would involve only 5. Isn't that a win?
Ironically, I've thought that multi-operation intrinsics might be counterproductive if people use them more than necessary, but that's only when there are reasonable alternatives. all/any_true are nice, but we've seen several use cases where they are insufficient. BTW how about a function count_true that takes masks (FF..FF or 0) and returns the number that are not 0? x86 would be movemask+popcnt, ARMv8/v7 as described here, but ANDing with 1 instead of 1,2,4,8.. |
Since there's ongoing discussion on this I'll reopen this for now. @MaxGraey Thanks for bringing this up, this is true - I only considered 8-bit movemask because that's what I need usually. If something does get standardized it would make sense to at least standardize the 32-bit variant because that's also commonly useful. @jan-wassenberg Yeah, I think it is the case that when you have to use a movemask, it's going to be beneficial to have it as part of the spec, even if the lowering is inefficient - you can't implement it better yourself. So on these grounds I would prefer that the spec adds this instruction. However, an alternative side of this is something I've frequently encountered with the current SIMD spec: if you aren't sure you have to use a movemask (if you haven't exhausted alternative implementation strategies), it's really tempting to use it, and if there's a performance cliff it's going to be hard to find. This might be endemic to the idea of portable SIMD, but we can try to minimize this. The last few examples I've ran into is the v8x16.shuffle, which produces really poor fallback code in v8 if it fails to pattern-match to an optimized variant (although the codegen can be vastly improved in theory, see https://bugs.chromium.org/p/v8/issues/detail?id=10117) and f32x4.max, which sounds like a Great Idea to clamp a value to 0 ( So there's something to be said about not exposing movemask since that will make it less likely that, given alternate more efficient implementation options, movemask will be chosen and will result in a cliff on a less widely used architecture. For example, |
I agree that it helps to include all data types. @MaxGraey has shown they are efficient on ARM, and for u16 on x86 we can use the 8-bit path after _mm_packs_epi16(x, 0). @zeux I totally agree about the "moral hazard" (and was recently advocating the same position). Perhaps one compromise would be to include movemask but have some kind of naming that indicates this is potentially slow/exotic? In Highway we put such operations in an ext namespace. Interesting about f32x4.max. Would a zero_if_negative operation help? That could use blendvps on x86, or your implementation on other platforms. |
zero_if_negative is a bit too specific; one alternative implementation strategy that would help here is signselect which is proposed in #124. |
Oh, hadn't seen that one yet, thanks. signselect indeed looks like a good replacement for zero_if_negative. |
The lack/slowness of emulating movemask came up recently in our tests as well. Also visible in Emscripten support for SSE 1 and its synthetic microbenchmark. +1 for adding the a) cross-compiling native code to pseudoinstructions will not be possible. If Wasm SIMD MVP/v1 is to be like a set intersection of SSE and NEON, philosophically it would be strongly preferable for v2 to look more like a set union of SSE and NEON, as opposed to v2 becoming a "fantasy SIMD" instruction set that would try to catch high level use cases with virtual instructions that do not exist in any relevant hardware. |
This movemask_u8 could be slightly faster (1% faster on a rpi4):
|
@wuxb45 fwiw I've tested this variant on one of my native ARM benchmarks and it looks like it's actually slightly slower on Amazon Graviton2. So perhaps the delta varies based on the specific CPU used. |
…it(to #32373348) Summary: find_first_set and find_last_set method is not optimal for neon, it need to be improved by synthesized with horizontal adds(vaddv) which will reduce the generated assembly code; in the following cases, vaddvq_s16 will generate 2 instructions but vpadd_s16 will generate 4 instrunctions ``` #ifdef __aarch64__ return vaddvq_s16(__asint); // addv h0, v1.8h // smov w1, v0.h[0] #else return vpadd_s16( vpadd_s16(vpadd_s16(__lo64(__asint), __hi64(__asint)), __zero), __zero)[0]; // addp v1.8h,v1.8h,v2.8h // addp v1.8h,v1.8h,v2.8h // addp v1.8h,v1.8h,v2.8h // smov w1, v1.h[0] #endif ``` Further discussion following the linking: [1]. WebAssembly/simd#201 [2]. WebAssembly/simd#131 Test Plan: test_run.sh Reviewers: chengbin.cb, liangbin.mj, yifeng.dongyifeng, longfei.alf, chuanqi.xcq Issue: https://aone.alibaba-inc.com/req/32373348 CR: https://code.aone.alibaba-inc.com/cpp_libs/std-simd/codereview/4534679
SSE2 exposes the movemask instruction (pmovmskb) that extracts 16 most significant bits from each byte of the vector, and returns a 16-bit integer with all the bits combined.
This instruction is very useful for certain types of processing. For example, when performing byte-wise processing on strings, such as scanning the string for a specific character (memchr), pmovmskb can be used to produce a mask with occurrences of the character in a string; bsf/bsr (in WebAssembly that's equivalent to clz/ctz) can be used to quickly iterate over the bits set in this mask (see "Regular expression search" in https://zeux.io/2019/04/20/qgrep-internals/ as one example).
It is also used in fast integer decoding in my vertex data decompressor; in its absence I have to emulate it using scalar math, see https://github.com/zeux/meshoptimizer/blob/master/src/vertexcodec.cpp#L712 - on x64 using the same fallback results in a ~15% performance penalty to the overall benchmark despite the fact that the instruction is not dominating the execution cost otherwise.
On x86/x64, movemask directly maps to pmovmskb (SSE2).
On PowerPC movemask can be implemented with vbpermq instruction, typically either as lvlsl+vector shift+vbpermq or as load+vbpermq.
On NEON, movemask isn't available natively but it can be easily synthesized with horizontal adds from AArch64 - you need to take the mask, replace each byte with a high bit set with a power of two corresponding to the byte index (this takes a couple of vector shifts) and use vaddv_u8 for each half of the vector. On ARMv7 with NEON you can emulate two vaddv_u8 with three vpadd_u8 so the cost is still somewhat reasonable (6 vector instructions + a couple of scalar instructions to create 16-bit mask from two 8-bit lanes).
I'm not sure what the emulation strategy would be on MIPS / RISC-V.
I wanted to file this to get a sense of whether this meets the balance of "performance cliffs" available on various architectures.
I'm generally happy with WASM SIMD but the problem with movemask is that there are no other SIMD instructions in WASM SIMD that provide a reasonable emulation path (in particular, no horizontal adds - they are a bit less exotic than vbpermq). Of course horizontal adds also have a non-trivial cost on various architectures, including x64, so emulating movemask through horizontal adds on x64 is bound to result in worse performance on x64 compared to a natively supported instruction.
The text was updated successfully, but these errors were encountered: