Cranelift AArch64: Support the SIMD bitmask extraction operations #2296

akirilov-arm · 2020-10-17T00:46:16Z

The bitmask extraction operations (WebAssembly/simd#201) have already become a part of the fixed-width SIMD proposal, so we should definitely implement them in Cranelift's AArch64 backend (and elsewhere, of course, but that's a separate discussion), so this issue is going to concentrate on implementation approaches, and in particular the instruction sequences that will be used. That is probably going to be the most challenging part because those operations don't have a straightforward mapping into the A64 instruction set.

Let's start with i8x16.bitmask, since it is the most complicated one - the suggested lowering from the proposal is:

  sshr  vt.16b, vs.16b, #7
  ldr   qt2, .Lmask
  and   vt.16b, vt.16b, vt2.16b
  ext   vt2.16b, vt.16b, vt.16b, #8
  zip1  vt.16b, vt.16b, vt2.16b
  addv  ht, vt.8h
  umov  wd, vt.h[0]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128

where Vs is the input vector register, Wd is the output general-purpose register, and Vt and Vt2 - temporary vector registers.

The main issue with this instruction sequence is that it uses the horizontal reduction ADDV, which is quite expensive on some microarchitectures, so here's an alternative:

  cmlt  vt.16b, vs.16b, #0
  ldr   qt2, .Lmask
  and   vt.16b, vt.16b, vt2.16b
  addp  vt.16b, vt.16b, vt.16b
  addp  vt.16b, vt.16b, vt.16b
  addp  vt.16b, vt.16b, vt.16b
  umov  wd, vt.h[0]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128

We also use CMLT #0 instead of SSHR, since it has better throughput in some cases, but, of course, the best approach would be to avoid it completely. AFAIK the expected usage of the bitmask extraction operations is to work on vector comparison results, which are already in the required form (the bits in each lane are either all 0 or 1), so we could avoid the initial instruction by clever IR pattern matching (bare minimum would be maybe_input_insn(ctx, inputs[0], Opcode::Icmp) and so on).

I wrote a microbenchmark for each instruction sequence by putting it in a loop and completing it with a dup vs.16b, wd, which establishes a loop-carried dependency and makes it possible to measure latency. Here are the results for various microarchitectures in terms of speedup achieved by the alternative sequence, i.e. higher numbers mean that the latter is faster (has lower latency):

CPU core	Speedup
Ampere eMAG	114.71%
Arm Cortex-A53	90.49%
Arm Cortex-A72	114.35%
Arm Cortex-A73	112.50%
Arm Neoverse N1	126.66%

The results are based on median runtimes from 20 runs; standard errors were 0.02% or less, with one exception on the Neoverse N1, which was 1.95% because the very first run took significantly longer than the rest.

Unfortunately, there isn't a clear winner - the alternative instruction sequence is faster on big cores, while the one from the proposal is faster on the little core, i.e. Cortex-A53. However, the speedups achieved on the big cores are greater than the slowdown experienced on the little core.

Future architectural extensions could enable further simplification; in particular, the bit permute instructions that are part of the second version of the Scalable Vector Extension (SVE2) may reduce the number of instructions and avoid the literal load at the same time.

cc @julian-seward1

The text was updated successfully, but these errors were encountered:

julian-seward1 · 2020-10-17T06:44:24Z

Anton, thank you for making these measurements. It's a shame that it isn't a clear win on all targets. One other variant that I'd like to ask about is formation of the magic constant without doing a load. Currently what SpiderMonkey has implemented is formation of the constant like this:

                    // mov   tmp_gpr, #0x0201
                    // movk  tmp_gpr, #0x0804, lsl 16
                    // movk  tmp_gpr, #0x2010, lsl 32
                    // movk  tmp_gpr, #0x8040, lsl 48
                    // dup   tmp_vec.2d, tmp_gpr

I remember reading somewhere (in one of the optimisation guides) that loads have a 3- or 4-cycle latency. Whereas mov/movk arranged like this can dual-issue in pairs, hence producing the tmp_gpr value in two cycles. If the dup can execute in one or two cycles then it's not a loss, and has the advantage of not causing extra memory traffic. But I'm just guessing here. What do you think?

akirilov-arm · 2020-10-19T13:49:52Z

I am actually working on improving vector constant materialization right now, and in particular on replicated patterns. While I am thinking mostly about using MOVI/MVNI/FMOV (immediate form), I can add this to what I am focusing on. You are right about the load timing and the possibility of dual-issue, but that is still a chain of dependent instructions, so I'd like to get some microbenchmark data first. The other issue is that in contrast with the literal load approach it is impossible to share the constant value with other instances of the bitmask extraction operation. Yes, we don't support proper constant pools in the AArch64 backend yet (refer to #1549), so sharing is not possible anyway, but that may change in the future.

My suggestion is to use lower_constant_f128() and then we can tackle this separately.

yurydelendik · 2020-10-23T20:48:00Z

My suggestion is to use lower_constant_f128() and then we can tackle this separately.

My understanding that the lower_constant_f128 has jump instruction, which might affect the benchmark mentioned in the description. Is it possible to re-run this benchmark that includes loading from pc + and B?

akirilov-arm · 2020-10-23T21:12:52Z

In PR #2310 lower_constant_f128() no longer generates a literal load for replicated 64-bit patterns and uses the sequence of moves above instead, which was part of the reason I suggested it (I expected to push my changes shortly afterwards, which I did).

The idea is actually to concentrate the handling of constants in a single place, so that we can easily improve code generation in the future. We can also add special cases inside lower_constant_f128() for well-known values like the bitmask extraction ones.

I intend to run some benchmarks again, but they will be slightly different because we are no longer interested in the approach to calculate the bitmask, but in the best way to materialize the magic constants (which is an orthogonal issue).

akirilov-arm · 2021-01-27T11:20:42Z

I have finally managed to run several microbenchmarks for the constant materialization. My base case is the sequence of moves above:

  mov xt, #0x0201
  movk xt, #0x0804, lsl #16
  movk xt, #0x2010, lsl #32
  movk xt, #0x8040, lsl #48
  dup vd.2d, xt

The first alternative is a straight literal load, which I'll call literal:

  ldr qd, .Lmask
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128

The second one is a literal load, but only of 64 bits (since we are dealing with a replicated pattern), followed by a move from the first vector element to the other (literal_64):

  ldr dd, .Lmask
  mov vd.d[1], vd.d[0]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128

And the third one is similar to the second one, but uses a replicated load instead (so it's called literal_replicate) and requires a temporary register - the instruction count stays the same because LD1R lacks a literal addressing mode:

  adr xt, .Lmask
  ld1r { vd.2d }, [xt]
  ...
.Lmask:
  .byte 1, 2, 4, 8, 16, 32, 64, 128

Each of the alternatives also has a variant in which the literal is positioned immediately after the load, so it has to be branched over - I'll add a _branch suffix to denote it. For example, here's what literal_64_branch looks like:

  ldr dd, 2f
  b 3f
2:
  .byte 1, 2, 4, 8, 16, 32, 64, 128
3:
  mov vd.d[1], vd.d[0]

Note that it is not really possible to establish a dependency chain between loop iterations, so I have done an essentially throughput (not latency) measurement, especially on out-or-order processors.

Here are the results for various microarchitectures in terms of speedup achieved by the alternatives, i.e. higher numbers mean that the latter are faster (have higher throughput):

CPU core	literal speedup	literal_64 speedup	literal_replicate speedup	literal_branch speedup	literal_64_branch speedup	literal_replicate_branch speedup
Ampere eMAG	109.40%	163.48%	164.41%	51.73%	53.59%	54.91%
Arm Cortex-A53	249.95%	125.03%	249.72%	166.17%	110.51%	124.40%
Arm Cortex-A55	150.33%	75.17%	300.00%	99.71%	66.41%	74.92%
Arm Cortex-A72	200.23%	200.23%	150.00%	66.59%	57.10%	57.10%
Arm Neoverse N1	346.51%	346.51%	231.01%	77.00%	65.93%	65.93%

The results are based on median runtimes from 20 runs; standard errors are 0.21% or less.

Overall, the literal load approaches without a branch are a clear win, with some microarchitectures (especially the little ones like A53 and A55) preferring the variants that transfer only 64 bits from memory like literal_replicate (explained by the fact that A53 and A55 have a 64-bit read path from the data L1 cache). On the other hand, if execution must branch over the literal, then there is almost always a (significant) regression, so a sequence of moves is preferable.

aqrit · 2021-04-11T21:29:38Z

Shift and accumulate instructions could be used instead of a constant mask.
https://stackoverflow.com/a/58381188/5691983
https://bugs.llvm.org/show_bug.cgi?id=49577

cfallin · 2022-05-04T22:54:03Z

We pass all SIMD tests on aarch64, so I believe these operations should be supported; closing issue now.

julian-seward1 mentioned this issue Oct 23, 2020

CL/aarch64 back end: implement the wasm SIMD bitmask instructions #2308

Merged

akirilov-arm mentioned this issue Feb 4, 2021

Cranelift aarch64 backend: use constant pool to defer constant emission to the end, unless out of range #1549

Open

akirilov-arm added the cranelift:area:aarch64 Issues related to AArch64 backend. label Feb 4, 2021

akirilov-arm added the cranelift Issues related to the Cranelift code generator label Oct 1, 2021

cfallin closed this as completed May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift AArch64: Support the SIMD bitmask extraction operations #2296

Cranelift AArch64: Support the SIMD bitmask extraction operations #2296

akirilov-arm commented Oct 17, 2020

julian-seward1 commented Oct 17, 2020

akirilov-arm commented Oct 19, 2020

yurydelendik commented Oct 23, 2020

akirilov-arm commented Oct 23, 2020 •

edited

Loading

akirilov-arm commented Jan 27, 2021

aqrit commented Apr 11, 2021

cfallin commented May 4, 2022

Cranelift AArch64: Support the SIMD bitmask extraction operations #2296

Cranelift AArch64: Support the SIMD bitmask extraction operations #2296

Comments

akirilov-arm commented Oct 17, 2020

julian-seward1 commented Oct 17, 2020

akirilov-arm commented Oct 19, 2020

yurydelendik commented Oct 23, 2020

akirilov-arm commented Oct 23, 2020 • edited Loading

akirilov-arm commented Jan 27, 2021

aqrit commented Apr 11, 2021

cfallin commented May 4, 2022

akirilov-arm commented Oct 23, 2020 •

edited

Loading