-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift AArch64: Support the SIMD bitmask extraction operations #2296
Comments
Anton, thank you for making these measurements. It's a shame that it isn't a clear win on all targets. One other variant that I'd like to ask about is formation of the magic constant without doing a load. Currently what SpiderMonkey has implemented is formation of the constant like this:
I remember reading somewhere (in one of the optimisation guides) that loads have a 3- or 4-cycle latency. Whereas |
I am actually working on improving vector constant materialization right now, and in particular on replicated patterns. While I am thinking mostly about using My suggestion is to use |
My understanding that the |
In PR #2310 The idea is actually to concentrate the handling of constants in a single place, so that we can easily improve code generation in the future. We can also add special cases inside I intend to run some benchmarks again, but they will be slightly different because we are no longer interested in the approach to calculate the bitmask, but in the best way to materialize the magic constants (which is an orthogonal issue). |
I have finally managed to run several microbenchmarks for the constant materialization. My base case is the sequence of moves above:
The first alternative is a straight literal load, which I'll call
The second one is a literal load, but only of 64 bits (since we are dealing with a replicated pattern), followed by a move from the first vector element to the other (
And the third one is similar to the second one, but uses a replicated load instead (so it's called
Each of the alternatives also has a variant in which the literal is positioned immediately after the load, so it has to be branched over - I'll add a
Note that it is not really possible to establish a dependency chain between loop iterations, so I have done an essentially throughput (not latency) measurement, especially on out-or-order processors. Here are the results for various microarchitectures in terms of speedup achieved by the alternatives, i.e. higher numbers mean that the latter are faster (have higher throughput):
The results are based on median runtimes from 20 runs; standard errors are 0.21% or less. Overall, the literal load approaches without a branch are a clear win, with some microarchitectures (especially the little ones like A53 and A55) preferring the variants that transfer only 64 bits from memory like |
Shift and accumulate instructions could be used instead of a constant mask. |
We pass all SIMD tests on aarch64, so I believe these operations should be supported; closing issue now. |
The bitmask extraction operations (WebAssembly/simd#201) have already become a part of the fixed-width SIMD proposal, so we should definitely implement them in Cranelift's AArch64 backend (and elsewhere, of course, but that's a separate discussion), so this issue is going to concentrate on implementation approaches, and in particular the instruction sequences that will be used. That is probably going to be the most challenging part because those operations don't have a straightforward mapping into the A64 instruction set.
Let's start with
i8x16.bitmask
, since it is the most complicated one - the suggested lowering from the proposal is:where
Vs
is the input vector register,Wd
is the output general-purpose register, andVt
andVt2
- temporary vector registers.The main issue with this instruction sequence is that it uses the horizontal reduction
ADDV
, which is quite expensive on some microarchitectures, so here's an alternative:We also use
CMLT #0
instead ofSSHR
, since it has better throughput in some cases, but, of course, the best approach would be to avoid it completely. AFAIK the expected usage of the bitmask extraction operations is to work on vector comparison results, which are already in the required form (the bits in each lane are either all 0 or 1), so we could avoid the initial instruction by clever IR pattern matching (bare minimum would bemaybe_input_insn(ctx, inputs[0], Opcode::Icmp)
and so on).I wrote a microbenchmark for each instruction sequence by putting it in a loop and completing it with a
dup vs.16b, wd
, which establishes a loop-carried dependency and makes it possible to measure latency. Here are the results for various microarchitectures in terms of speedup achieved by the alternative sequence, i.e. higher numbers mean that the latter is faster (has lower latency):The results are based on median runtimes from 20 runs; standard errors were 0.02% or less, with one exception on the Neoverse N1, which was 1.95% because the very first run took significantly longer than the rest.
Unfortunately, there isn't a clear winner - the alternative instruction sequence is faster on big cores, while the one from the proposal is faster on the little core, i.e. Cortex-A53. However, the speedups achieved on the big cores are greater than the slowdown experienced on the little core.
Future architectural extensions could enable further simplification; in particular, the bit permute instructions that are part of the second version of the Scalable Vector Extension (SVE2) may reduce the number of instructions and avoid the literal load at the same time.
cc @julian-seward1
The text was updated successfully, but these errors were encountered: