-
Notifications
You must be signed in to change notification settings - Fork 43
Remove integer SIMD not-equals instructions #351
Conversation
What are the benefits of this? I don't see that two ops replacement would be faster than current single operation. In addition to that a single operation reduces code size and is also closer to what Wasm has ( |
The benefit is to motivate developers (both software developers coding in intrinsics and compiler developers) to use Note that unlike SIMD, in scalar code |
Isn't the resulting machine instruction sequence the same for, let's say, |
It is. My point is:
|
It sounds like the rewrites your thinking of, @Maratyszcza, are larger rewrites of the surrounding algorithms than just the simple rewrite of replacing |
I agree with the general principle of encouraging apps to use only what's efficient. |
Here are two more examples:
In both cases a rewrite completely removes the SIMD Not operation from generated code |
Ok, this seems reasonable to me. If we didn't already have the Note that removing it would be a breaking change for some OT users, but we explicitly need to not consider that as a downside in making a decision. |
The first step would be to stop generating these instructions in LLVM, and it is not blocked by anything. |
I don't see a strong benefit in removing:
(However I am not a SIMD writer, so my opinions should be listened to with much lower priority than the other experts on this thread.) That said, I also agree with Thomas that, if we did not have this instruction, and we were looking at this individually, we might not have added it, given the bar (post opcode renumbering when we slowed down additions). So in summary, is not-equals a big footgun? IMO, no. Is it painful enough that we should remove it? IMO, no as well. But if we do remove it I will not miss it particularly. I am also interested in how much friction it will cause existing users of SIMD:
|
Here is my take on the Also, removing it will incentive people to come up with workarounds that might just be wrong.
The actual test should be And finally, removing |
IMO it is far from evident that there's such a difference in performance between With WAsm SIMD being a performance feature, it should discourage such inefficient solutions. |
Agreed. @lemaitre thanks for pointing out the logic error. I think providing both == and != actually makes such errors more likely, because it incentivizes such transforms. Someone may realize (only after looking deep into the manuals or generated code) that != is best avoided, and then change the code at a late stage. If != were not provided at all, written-from-scratch SIMD code would not encounter this problem. |
I'm wondering if this is true. Multiple times throughout the development of the SIMD extension we've discussed the tradeoffs - when there's an instruction that logically belongs to the ISA, but has suboptimal lowering on current architectures, should we still include it? While I disagree that the penalty for emitting "not" is insignificant (it can have a noticeable perf impact on real workloads; of course "noticeable" means, say, 5% or less, but when you use SIMD I'd expect some users like myself to want to tune the code to get everything out of the hardware), I think the following considerations apply:
So my general observation is that in this proposal we have consistently said the same thing: If an operation is fundamental to the type and you'd naturally expect this to be available in a general purpose instruction set, we will include this in the proposal even if lowering is suboptimal. Case in point: f32->i32 conversions, byte shifts, probably others (there should be a host of issues in this repository by now). In my view we should make sure LLVM doesn't strength-reduce |
I think the confusion from missing NE might be greater than benefits we get from leaving it out. Many developers won't know (toolchain hides it), but we also have many authors using intrinsics. A missing NE could take more effort to explain why we left it out, compared to putting it there, given that emulating NE costs about the same as NE codegen itself. Regarding education, we can have documentation beside the intrinsics header file that notes the potential performance problem with NE instructions. This is probably the most visible place for intrinsic users. |
I agree that the principle of least surprise more strongly supports keeping it rather than taking it out. Keeping it reduces surprise among those who read the docs, which is hopefully a large group of people in the long run.Taking it out reduces surprise among those who use an poorly tuned compiler or hand-write assembly without reading the docs, which is hopefully small group of people in the long run. |
Quick comment in support of a principle of design here that motivates removing this (particularly responding to the earlier comments). The number one goal I think for WASM SIMD and all of WASM is near native performance and the best way to achieve that is by mapping directly to what the hardware gives you. In this case neither x86 nor Arm supports not equal directly which means a less efficient lowering as has been discussed. Symmetry is clearly desired and encouraged and an important goal for the spec but performance is the ultimate goal. In this case the hardware isn't symmetric (neither x86 nor arm) and the hardware is the lead to follow and the blueprint to map to. Obviously the points from @ngzhian, @tlively, etc for keeping (not(not removing)) are more than fair, with the impact of already being implemented/shipped being the biggest point that resonates. However, the principle @Maratyszcza express for removing here is spot on. |
Should we look at other instructions where codegen on both architectures are poor? There are a number of problematic instructions (https://github.com/zeux/wasm-simd/blob/master/Instructions.md for an overview, and https://github.com/WebAssembly/simd/issues/created_by/abrown for a specifc instructions, the titles all say x64, some of them are similarly poor for arm, such as all_true, any_true.) |
@ngzhian It's a good question. I think the reasoning for removing not-equals is a sound one that should be applied generally. I think the hardware provides the blueprint for the instructions desired and when all hardware is saying the same thing, that blueprint is even more solid. If getting it right is important and particular when the instructions put in will be legacy I do think its fair to consider revisiting and potentially holding back any instruction that could lead to toolchains and app developers baking in inefficient translations to hardware. This particular instruction of not-equals doesn't sound too controversial to consider or even too painful to hold off on, but there is an argument both for removing and against and so there are of course others instructions that would be even more controversial to consider. Now finding those candidate instructions to review is not too hard as the lists you point out from @zeux and @abrown contain them but I think the question is, is there significant appetite to revisit. Personally I do support doing that pass but understand that may not necessarily be a shared opinion. |
IMO there is value in allowing the developers & toolchains to see lowering inefficiencies so that spec itself plays a role in helping to avoid generating inefficient Wasm code in the first place. Documenting the cases might not be sufficient as it opens up proper maintenance concerns and there are no guarantees that developers and tool writers will always use them appropriately. I think the primary expectation of a user of Wasm SIMD is performance, convenience, and symmetry expectations are secondary and only 'nice to haves'. I would assume that an average simd developer must be familiar with few of these inconveniences that exist in the native world as well. Taking a dependency on the toolchains and runtimes to optimize or workaround inefficiencies will be a slippery slope as the developers will be unpleasantly surprised if they happen to use any sub-optimial toolchains and runtimes (beyond emscripten and V8) in the future. I am in favor of this proposed change by @Maratyszcza. Agree with @jib6740 that hardware support is the best indicator of an operation being best suited for fulfilling the developer expectation of performance from Wasm SIMD. For operations that are highly useful with less than ideal hardware support, they are great candidates to drive the future evolution of hardware support and can be included. As we have seen, this specific scenario of NE is not too severe compared to some of the other open issues. There is value in taking a closer look at the hardware support and real-world use cases requirement criteria and consistency to apply them across the known open Issues/proposals we have now. |
Here's a list of instructions that aren't supported well natively by either ARM64 or x64, based on v8 implementation: v8x16.shuffle Here's a list of instructions that have lowering on Intel architectures (pre-AVX512, haven't analyzed if that changes any of these) that is expensive to the point of being problematic: All unsigned integer comparisons If we consistently apply the hardware support principle, we should remove all of the above. I'm not sure I understand why removing not-equals specifically and keeping all of the above makes sense. The impact of a single bad shuffle dwarfs that of a not-equal comparison by a fair margin. |
I think there is value in assessing these on a case-by-case basis. An instruction with popular real world use-cases is a good candidate for future hw support considerations and has a better chance of becoming efficient as hw evolves. Instructions with no real-world uses will remain perf disadvantaged as there are not many incentives to support them and I suggest removing them from the MVP. It's not feasible to remove an op after being standardized, but they can always be reconsidered as future extensions.
Agree. We already have unresolved issues for almost all of these hw inefficient instructions. If we can establish a consensus on the criteria clarifications including the hw support requirements, it will help to resolve those issues as well. We have the 'Inclusion criteria' topic on agenda for the next sync #203 |
Working on top of Arseny's comment, I collated the "problematic" instructions as of V8 HEAD today, with their codegen instruction counts on x64 and ARM64. (Note that the instruction counts shown are worse case, so in the case of shifts, we assume non-immediate shifts, so we need 4 instructions rather than 1.) Some highlights:
If we were to tackle things in terms of severity:
I haven't looked too closely at the outstanding PRs that suggest new instructions, I suppose we should have consensus on requirements and expectations before we look at them with fresh lenses. A couple that stand out are: #379 (>8 on x64 until AVX512), #350 (>1 on both arch for 3 out of 8 suggestion instructions), #247 (>7 on x64). There is an outstanding issue that V8 cannot put 128-bit constants into memory to make use of rip-relative loads, and so must always materialize masks using SIMD instructions. I eyeballed cases where we are doing that, and some of those instructions (integer negation, float abs an neg) don't make it to the top of "worst offender" list. So I think this discussion can ignore whether this particular V8 limitation. I probably missed something, lmk so I can update the sheet. I hope this is useful data to guide our future discussions. [0] load_splat might be surprising, on ARM, the addressing mode is not able to handle what memarg immediate + i32 offset that Wasm has for memory loads, so we need an additional add. |
Thank you for the detailed analysis, @ngzhian. However, I think the number of instructions generated in V8 doesn't tell the whole story. First, we need to consider how expensive it would be to emulate these operations with other WAsm SIMD instructions if we take the corresponding instruction out of the spec. E.g. Secondly, due to limitation on loading 128-bit constants in V8, it often prefers longer instruction sequences to avoid loading constants. As far as I understand, this limitation is likely to be removed in the future as WAsm SIMD support in V8 gets more mature. Thus, we should avoid making decisions grounded in temporary limitations of a single WAsm engine. I think a better way to compare instruction complexity is through unrestricted assembly implementation like I do in my PRs. IMO instructions in WebAssembly SIMD specification should be strict Pareto-improvements over their polyfill on the curve of (ARM64, x86-64 SSE4) performance, but it doesn't mean that they necessarily lower to a single native instruction on either architecture. |
I agree with this argument but only when the difference in instruction count is below a threshold that raises arch-specific perf cliffs (a threshold we can see group consensus on). I disagree that the spirit of avoiding instructions without good (2+) hw support is penalizing few architectures. On the contrary, it helps developers to avoid inefficient code that will run faster only on certain devices. If an op emulation takes a significantly higher number of instructions on one platform vs. other and if used in a perf critical control path, it could easily tank perf on certain devices and surprise developers with loss of perf or even regressions by using Wasm SIMD. This runs the risk of affecting the usability of a performance-oriented feature like simd. These operations are safely included in the spec with some sort of feature detection providing developers a way to get around such perf cliffs and will be better candidates for post mvp additions. This re-emphasizes the usefulness for conditional sections and feature detection proposals. AVX512 availability is very limited and there is a high likelihood that a significant % of platforms running Wasm simd code won't be able to use them in the future too. So support in AVX512 is not a good indicator of architecture support. |
I think "perf cliff" is irrelevant and to some extent unavoidable. For instance To me, the only interested metric is: is it beneficial on at least one common architecture while not being detrimental on all the others?
Many middle-end Intel laptops recently released have AVX512 support, so it will be a significant proportion in the future. Also, I often see comments about SSE4, but according to the number of AVX2 machines in the wild, I think WASM should be optimized for AVX2 and just be compatible with SSE4 (ie: not optimized for). |
Just a quick note about AArch64 - since people are considering different SIMD ISA extensions on x86 (e.g. AVX-512), it makes sense to do the same for AArch64. SVE(2) helps with the implementation of some of the other operations, in particular bitmask extraction; not NE comparisons, however. |
Closing as it was decided in the 10/16/2020 meeting to keep the SIMD Not Equals instructions for consistency with scalar instructions. |
Currently WebAssembly SIMD specification includes integer Not Equals instructions. These instructions are problematic, as neither x86 SSE4.1 nor ARM NEON include equivalents of these instructions. In practice, software developers almost always can replace expressions using these Not Equals instructions into equivalent expressions using Equals instructions, and the latter would perform better due to direct mapping on native SIMD instruction sets. In case where Not Equals predicate is desired, it can be emulated as e.g.
i32x4.not(i32x4.eq(a, b))
and would result in the same native code generation as thei32x4.ne(a, b)
.