-
Notifications
You must be signed in to change notification settings - Fork 43
LLVM's "optimization" of shuffles penalizes x64 codegen #196
Comments
Ouch, that's a major pessimization. This is the kind of performance cliff that we are all trying to avoid. |
This might be the same issue as in #118. This looks like the same operation, @zeux please correct me if I am wrong. Back in #118 (comment) we realized that it might be necessary to add some of the common operations, like unpack, or at least we should consider adding them. Such operations are common among platforms, and code depends on those particular patterns being faster than generic shuffle operation. |
Thanks for the writeup, @zeux! Having real numbers is helpful for this conversation. The pessimization here looks serious enough that I believe disabling shuffle merging in the LLVM backend is a prudent short-term and medium-term solution. In the long term, I would like to get inline assembly working so we can solve this in user code rather than in the compiler. Does that plan sound reasonable to everyone? |
Inline asm as a long term solution would imply that it's ok for developers to use a cost model for shuffles. If we're going to say that's ok, would we be ok for LLVM to have a cost model too? If so, LLVM could use it to avoid merging shuffles in cases where the cost model advises against it, and then we wouldn't need to depend on inline asm :-). |
LLVM only optimizes shuffles if it can prove that the combined shuffle is better than the original (one There are options:
1 and 2 both allow the toolchain to have a cost model, meaning that optimizations should have positive impact. 3 would fix this case, but would disable combining multiple slow shuffles into one, which would come with a performance penalty at some point. In general I think wasm should have a cost model for shuffles because all platforms we care about have a cost model and fast shuffle instructions are pretty similar across the board. In my opinion the best way to do it is via new instructions - that would keep the spec for generic shuffle simple and would make instruction choices more explicit. The shuffle ops that hardware accelerates that come to mind:
@zeux can you think of anything else? |
@penzn I'd add a few to the list:
I am not actually sure I agree that new instructions are the optimal way to handle this though. It adds a lot of spec complexity. Granted, this will come with code size / codegen time benefits, but I'd still be curious to know if there are real-world scenarios where shuffle merging is at all beneficial. |
@tlively By "inline assembly", do you mean using inline wasm code that LLVM can't reason about instead of intrinsics for shuffles? |
Yes, precisely. We could use it in the intrinsics header to create a shuffle that is opaque to the optimizer. |
One possible alternative (if I understand this right) is that we could add a wasm-specific shuffle instruction that is opaque to the optimizer, and keep the current generic shuffle. That way we can, in user code, arbitrate between those. E.g. maybe we expose a few pseudo-intrinsics we discussed here that mirror the hardware capabilities through a wasm-specific shuffle instruction and keep the general one using the optimizable instruction, or something to that extent. I don't know enough about how inline assembly works in LLVM so maybe these are equivalent, but maybe a custom opaque intrinsic results in the optimizer having freedom to move the instruction or eliminate it entirely, without replacing the instruction with another one. |
Yes, a custom intrinsic would work nicely. I actually already have a patch implementing this solution ready to go if that sounds appropriate to everyone. |
@zeux, shuffle merging is used extensively in the CPU backends. There must be exact a motivation for it, but in general operations like shuffle bytes are costly (as a general rule, but more on some platforms and less on the other). I haven't found exact workload yet, but will post if I do.
I doubt it would be optimal from LLVM point of view, as all of the shuffle transformations tied to its single IR instruction for shuffle. The change would need to justify why wasm is different from other platforms, especially since we are trying to do this to mimic other platform's behavior (that shuffle masks are not equivalent). I think we should make such assumptions explicit in the description of shuffle instruction or the spec in general. |
A clear cut case where shuffle merging is beneficial is when a user uses multiple calls to
I disagree. The spec does not describe a cost model for WebAssembly because engines should not need to have any particular cost model to be spec-compliant. But that does not mean that toolchains should not do work to optimize the real-world performance of WebAssembly applications. What we have in this conversation that we did not have before is hard data showing a large real-world slowdown due to shuffle merging, and that is justification enough. As a pragmatic matter, we have to provide a workaround in the tools to support our users even if there are no spec changes. |
This is true of the normative text of the spec, but the spec can also provide non-normative performance guidance. For precedent, see the Defining new instructions which perform special cases of shuffles and are intended to be "fast", or just adding non-normative notes, or perhaps even a non-normative appendix, to the spec, describing which special cases of shuffles are expected to be "fast", are both plausible approaches. |
Just to have more than one example... I ran into this again today :) While optimizing the store part of a SIMD kernel, I wanted to optimize some shuffle/mixing of values needed to pack multiple 16-bit values into the output vector. I initially had code like this:
In an attempt to optimize this by removing redundant ands/masks, I decided to use shuffles that should map directly to SSE pblendw instruction like so:
This replaces 3 and and 2 ors with 2 shuffles that should directly map to pblendw - clearly, a win! Of course, LLVM thought otherwise - it looked at the shuffles used in blends, understood that res_0 and res_1 have components that are not used (because they are masked by the following shuffles), and replaced these two with shuffles that don't pattern-match:
As a result, a 4 GB/s kernel started running at 1.4 GB/s. edit noteworthy is the fact that in both examples, the optimization LLVM makes is always making things strictly worse or not-better - it doesn't merge any instructions in both of the cases I mentioned, it just changes one of the shuffle masks involved which is very unlikely to replace a "bad" shuffle with a "good" shuffle and much more likely to make a good shuffle bad. |
Ouch. I have heard that this have come up with native shuffles as at some point and was fixed by distinguishing between shuffle masks, which would work for this case. @topperc, do you have any more context? @sunfishcode - thanks for pointing out alignment. I think we should add that to the spec. I would be prefer separate instructions, mainly because those masks represent separate instructions in hardware, and correspond to separate instinsics in existing code (also it would make instruction descriptions more self contained). However I don't feel strongly one way or the other. I don't think we should disable all shuffle transformations to preserve masks - that would mean we are encouraging users to write shuffle masks matching "instruction X on platform Y", which would bake exact hardware knowledge into the code and would not necessarily be cross-platform. |
We discussed this at the sync meeting today and reached tentative consensus on a path forward. To check that we have wider community consensus, please consider this proposed solution and vote with a 👍 if you support it, 👀 if you're neutral, and 👎 if you oppose it. The proposed solution is to:
|
Would it be fair to characterize this position as:
? |
Yes, that might be fair depending on your expectation of how official the cost model will be. We don't expect there to be a cost model in any normative part of the spec, but we would be happy for a cost model based on empirical, cross-platform data to be implemented in the tools and documented non-normatively. I would expect such documentation to be about as official as the tool-conventions repo. |
I think being conservative would be better than have undefined, or worse - wrong cost model (which is likely what is happening now). I think we need to keep the door open not only for the documenting lowering, but also breaking up the shuffle operations, this way may prove to be too generic. Though I do know having one shuffle operation was intentional and I am personally at peace with this work not happening, at least in current |
It's worth noting that there are some other cases where a cost model may be appropriate for LLVM. Not sure what the full extent of SIMD transformations looks like, but one basic example I've ran into is LLVM replacing |
@zeux It would be great if you could start a new tracking issue for small codegen issues like that. One of the downsides of the division between the tool and engine work is that I don't have great visibility into that sort of issue, so this is very helpful feedback. |
Summary: Although using `__builtin_shufflevector` and the `shufflevector` instruction works fine, they are not opaque to the optimizer. As a result, DAGCombine can potentially reduce the number of shuffles and change the shuffle masks. This is unexpected behavior for users of the WebAssembly SIMD intrinsics who have crafted their shuffles to optimize the code generated by engines. This patch solves the problem by adding a new shuffle intrinsic that is opaque to the optimizers in line with the decision of the WebAssembly SIMD contributors at WebAssembly/simd#196 (comment). In the future we may implement custom DAG combines to properly optimize shuffles and replace this solution. Reviewers: aheejin, dschuff Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D66983
Summary: Although using `__builtin_shufflevector` and the `shufflevector` instruction works fine, they are not opaque to the optimizer. As a result, DAGCombine can potentially reduce the number of shuffles and change the shuffle masks. This is unexpected behavior for users of the WebAssembly SIMD intrinsics who have crafted their shuffles to optimize the code generated by engines. This patch solves the problem by adding a new shuffle intrinsic that is opaque to the optimizers in line with the decision of the WebAssembly SIMD contributors at WebAssembly/simd#196 (comment). In the future we may implement custom DAG combines to properly optimize shuffles and replace this solution. Reviewers: aheejin, dschuff Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D66983
I believe this can now be closed as LLVM should use a shuffle instruction that's opaque to the optimizer. |
So we've discussed this previously at some point on the WASM call, but I keep hitting this issue and I feel like something must be done.
Briefly, the issue is as follows:
Here's the recent example I hit. Given this code that tries to extract 16-bit component from 4 64-bit halves of two vectors:
With
wasmx_shuffle_v32x4
defined as follows to simulate SSE2shufps
:LLVM notices that the results of the shuffle have two bytes in each 32-bit lane unused and "optimizes" the shuffle to:
V8 doesn't recognize this as a valid shuffle, and generates this:
This code sequence is catastrophically bad. To put it in perspective, this code without this problem runs at 3.2 GB/s, and with this problem it runs at 1.1 GB/s - this is though the loop without this code sequence is actually pretty large, ~70 SSE/AVX instructions plus some amount of loop/branch scaffolding that v8 emits - this shuffle is merely a small piece of a larger transform, and it alone wreaks havoc on the performance.
Now, clearly the instruction sequence doesn't have to be this bad. However, note that this is actually a shuffle with two distinct arguments - so even if v8 could pre-compute the shuffle masks, this would still result in something like
So LLVM actively pessimizes the code that the programmer writes. I've hit this issue so many times and every time it takes time to figure out how to apply a fragile workaround - in this case I had to mark
zmask
as volatile, spending an extra stack memory load just to make sure LLVM doesn't do anything stupid.I know we discussed that adding target-specific logic to LLVM transforms seems problematic, since that's a lot of code to maintain. Have we considered not optimizing shuffles at all? I've yet to see evidence that LLVM combining shuffles without being aware of the target platform can produce beneficial results.
The text was updated successfully, but these errors were encountered: