Various AArch64 speed hacks #809

easyaspi314 · 2023-03-02T18:53:38Z

Use inline assembly to force UMADDL in the scalarRound code
- Prevents the slow 64-bit MADD which has +4c latency and a 2c stall on little cores.
- Also prevents a mask
- Improves scalar speed even on big cores, should also improve mixed NEON speed on those little cores, although I haven't tested it.
Use inline assembly for vmlal_u32
- Fixes NEON performance being garbage on GCC < 11 because it can't fold vget_low/high into vmlal (arm_neon.h uses inline assembly)
- Documents the presumed reason why umlal must come before add.
- Doubles as Clang's asm guard. (The old asm guard is still used on the single vector path)
- Much cleaner in the code itself
Fix header detection
- Include arm_neon.h when __ARM_FEATURE_SVE is defined
- Don't include arm_neon.h if only __aarch64__ is defined (e.g. -march=armv8-a+nofp)
- Don't include arm_neon.h on < ARMv7 MSVC (which only would happen on old WinCE toolchains)
Fix the generateSecret MOVK hack
- Add it for GCC, it does that too now
- Remove the assert because the change to use unreachable() undoes it
Make asm guards non-volatile to allow reordering
Fix GCC strict aliasing issues with both NEON and VSX (same bug). __sync_synchronize() is no longer necessary.

- Use inline assembly to force umaddl in the scalarRound code - Prevents the slow 64-bit `MADD` which has +4c latency and a 2c stall on little cores. - Also prevents a mask - Improves scalar speed even on big cores, should also improve mixed NEON speed on those little cores, although I haven't tested it. - Use inline assembly for `vmlal_u32` - Fixes NEON performance being garbage on GCC < 11 because it can't fold `vget_low/high` into `vmlal` (arm_neon.h uses inline assembly) - Documents the presumed reason why umlal must come before add. - Doubles as Clang's asm guard. (The old asm guard is still used on the single vector path) - Much cleaner in the code itself - Fix header detection - Include arm_neon.h when `__ARM_FEATURE_SVE` is defined - Don't include arm_neon.h if only `__aarch64__` is defined (e.g. -march=armv8-a+nofp) - Don't include arm_neon.h on < ARMv7 MSVC (which only would happen on old WinCE toolchains) - Fix the generateSecret MOVK hack - Add it for GCC, it does that too now - Remove the assert because the change to use `unreachable()` undoes it

easyaspi314 · 2023-03-03T00:02:56Z

Hmm...Interestingly I also get up to 21 GB/s (instead of ~14, and almost as good as AFAIK optimal XXH3, 23 GB/s) if I put a compiler guard at the end of XXH64_round(), which forces the compiler to put the instructions sequentially. However, my phone's processor has a fast multiplier.

// Before
.L197:
        ldp     x7, x4, [x5]
        ldp     x3, x6, [x5, #16]
        add     x5, x5, #32
        madd    x7, x7, x9, x12
        madd    x4, x4, x9, x13
        madd    x3, x3, x9, x10
        madd    x6, x6, x9, x1
        ror     x7, x7, #33
        ror     x4, x4, #33
        ror     x3, x3, #33
        mul     x12, x7, x8
        ror     x6, x6, 33
        mul     x13, x4, x8
        mul     x10, x3, x8
        mul     x1, x6, x8
        cmp     x2, x5
        bhi     .L197
// After
.L197:
        ldr     x7, [x3]
        madd    x7, x7, x11, x10
        ror     x7, x7, #33
        mul     x10, x7, x9
        ldr     x6, [x3, #8]
        madd    x6, x6, x11, x12
        ror     x6, x6, #33
        mul     x12, x6, x9
        ldr     x6, [x3, #16]
        madd    x4, x6, x11, x4
        ror     x4, x4, #33
        mul     x4, x4, x9
        ldr     x5, [x3, #24]
        madd    x5, x5, x11, x8
        ror     x5, x5, #33
        mul     x8, x5, x9
        add     x3, x3, #32
        cmp     x1, x3
        bhi     .L197

While it is bad that the loads are split into ldr instead of ldp, this might be better even on small multipliers.

As mentioned before, after a 64-bit multiply, the multiply pipeline is stalled for 2 cycles, so in-order chips will be forced to wait 2 cycles before it can begin dispatching the next multiply.

Also all small multiplier chips I know of only have one multiply pipeline so sequential multiplies can't be dual issued.

So as a result the first code output would have ~12 stall cycles from the sequential multiplies, which adds up to be almost as bad as the latency from the dependencies. 🤔

I'm gonna have to boot up my old phones and tablets to test I guess.

TL;DR: the ARM pipeline never ceases to both amaze me and frustrate me.

Edit: Nope, the Cortex-A53 does not like that one bit. Goes from 1.6 GB/s to 993 MB/s on GCC 12.

Also this is actually the same effect that XXH32 has. XXH32 gets a whopping 660 MB/s on this, when the same ordering that the "Before" is but with 32-bit stuff gets 1.6 GB/s like XXH64.

If I run the same code on my good phone, it goes from 10 GB/s to 7 GB/s.

easyaspi314 · 2023-03-03T02:57:48Z

Cortex-A53 results are in.

XXH3_64b 100KB benchmark

Compiler	old XXH_SCALAR	old XXH_NEON	new XXH_SCALAR	new XXH_NEON
GCC 12	1903 MB/s	2263 MB/s	2267 MB/s	2450 MB/s
GCC 10	1825 MB/s	1877 MB/s	2289 MB/s	2065 MB/s
Clang 15	1783 MB/s	1946 MB/s	2055 MB/s	2035 MB/s
Clang 11	1717 MB/s	1819 MB/s	1917 MB/s	1964 MB/s

Clang is really bad at interleaving XXH3 (even with -mtune=cortex-a53), which greatly impacts things on an in-order processor.

Additionally, 2.4 GB/s is what I get from assembly so that is pretty dope.

easyaspi314 · 2023-03-03T04:21:57Z

The XXH32/XXH64 ordering thing interests me, as the optimizations vary wildly on the processor.

Just take XXH32.

Kind	Speed (A53)	Speed (X1)
NEON	1.8 GB/s	3.5 GB/s
Scalar, rounds in order	660 MB/s	10 GB/s
Scalar, rounds interleaved	1.6 GB/s	7.7 GB/s

I wonder if the optimal play is to go for the happy medium. A 2.4x speedup on the A53 is definitely more significant than 30% performance decrease on the X1. 🤔

(It also sucks that it is impossible to dispatch either — while you can potentially figure it out temporarily with some syscalls, that can change at any time, since you can be rescheduled from a big core to a little core and back)

Cyan4973 · 2023-03-03T04:27:12Z

I wonder if the optimal play is to go for the happy medium. 🤔

It seems so

easyaspi314 · 2023-03-05T05:09:20Z

Just realized that it could be possible for the compiler to generate the Cortex-A53 errata with the umaddl. My tablet supposedly has this errata and hasn't seen an issue, but the fix is to prevent it from putting a load/store before a multiply-accumulate. (Usually a nop). However, I have an idea that ensures this with the least penalty, and that is using that stall cycle to do the shift, which if I use the variable form, is always 1 cycle even on the A53 line. Still not great having the dependency there but I'd rather that than the multiply result randomly being corrupted.

Nevermind, the conditions are never going to be met, it requires the multiply result to be forwarded directly to the next one. While this could happen if the compiler decides to unroll and reorder the adds and multiplies, it won't do that because since it is an inline assembly instruction, it doesn't know it is commutative 😜

Additionally, one frustrating thing about XXH32 is that the best codegen (outside of -fno-tree-slp-vectorize) is from putting all 4 guards at the bottom of the loop, which is no problem on AArch64 with its 31 GPRs, but x86 is probably going to have issues because of how few registers it has.

Also guarding one or even two variables isn't enough — Clang so aggressively wants to vectorize that it will literally attempt to scatter-gather load the lanes into half vectors if they aren't consecutive, which requires 2 slow load instructions and some pointer math because NEON loads don't have immediate offsets.

easyaspi314 · 2023-03-05T06:08:42Z

Scratch that, I found out a better solution for XXH32. The asm guards don't need to be marked volatile. This was causing a reorder barrier since it thought there were external side effects, when we just needed to tell the compiler "put this in a register and mark it modified sometime between the two times it is accessed". This was also why Clang was comedically bad at interleaving XXH3, it was because it couldn't.

It still doesn't get the max XXH32 speed on my phone though, but I'm willing to say that my phone is the weird one since the optimal code for this one is bad for any in-order processor.

Cyan4973 · 2023-03-05T15:46:26Z

What's the current status for this PR ? Still work in progress ?

The volatile in the asm guards was causing a reorder barrier which was preventing interleaving of XXH32 and XXH3 NEON. Additionally, with the better asm guard, Clang does better with intrinsics.

easyaspi314 · 2023-03-05T19:19:12Z

That should do it. I removed the volatile from the asm guards and brought back the clang guards.

This does cause the slight slowdown on XXH32 for the X1, but a great benefit on the A53. This should also benefit x86 XXH32.

Also, Clang's XXH3 is now (almost) on par with GCC, as it can interleave now.

easyaspi314 · 2023-03-05T19:29:37Z

Also I'd like to see how the XXH3_NEON_LANES=6 works on Apple Silicon. With the reordering fixed it could actually be faster than 8 NEON.

~~I'd test it myself on the compiler farm but I have to wait for my new SSH key to sync 💀~~ wait I think the key is on my old phone.

easyaspi314 · 2023-03-05T21:33:28Z

Update. I found my old SSH key, and the M1 still prefers 8 NEON lanes. GCC 12 oddly has a strict aliasing issue which doesn't occur on Linux. I might need to add a may_alias attribute. Come to think of it, that might be the source of the synchronize issue on S390X

Since I needed to boot my old phone (Snap 730/Cortex-A76) anyways, I did a quick bench on it. GCC 12 had a slight performance regression, but it is only by like 4% (12.3 -> 11.8). This is despite the original code having the slower madd.

Might be some ordering shenanigans that can be eventually worked out. Clang 15 did increase performance slightly (10.3 -> 10.8). Still not optimal though. This might be worth looking into in the future.

Neither of these performance changes are particularly bad though.

This fixes GCC 12 on macOS ARM64, as well as the existing s390x zVector bug that was worked around with a `__sync_synchronize()` by using `__attribute__((may_alias))` which excludes a type from strict aliasing optimizations. `vldq`/`vst1q` are still avoided due to older versions having issues with load-store optimizations. This is not needed on x86 because the SSE types are impossible to not alias.

easyaspi314 · 2023-03-05T22:41:27Z

Killed two birds with one stone. The issue was in fact the strict aliasing violation from the xacc pointer, and it was the exact same bug that was breaking s390x and forcing a memory barrier.

The xacc pointers for NEON and VSX are now pointers to vectors with the __attribute__((may_alias)) which excludes them from GCC's strict aliasing ~~breaking~~ "optimizations".

I think this is good for now.

Cyan4973 · 2023-03-10T09:29:02Z

I made some measurements of this version on some ARM devices around, and here is what I found :

`arm64` cpu	vector	`dev`	`pr809`	diff
`M1`	`NEON`	35.5 GB/s	35.5 GB/s	none
`M1`	Scalar`	20.0 GB/s	22.2 GB/s	+11%
`Snap8g1`	`NEON`	25.5 GB/s	26.6 GB/s	+4%
`Snap8g1`	Scalar	13.6 GB/s	15.8 GB/s	+16%

So this PR seems to provide more benefits to the Scalar mode.
Is that the expectation ?

easyaspi314 · 2023-03-10T16:52:32Z

I made some measurements of this version on some ARM devices around, and here is what I found :

arm64 cpu vector dev pr809 diff

M1 NEON 35.5 GB/s 35.5 GB/s none

M1 Scalar` 20.0 GB/s 22.2 GB/s +11%

Snap8g1 NEON 25.5 GB/s 26.6 GB/s +4%

Snap8g1 Scalar 13.6 GB/s 15.8 GB/s +16%

So this PR seems to provide more benefits to the Scalar mode.
Is that the expectation ?

Yes. The benefits to scalar also apply to hybrid NEON, primarily on the in-order cores since they can't dispatch the scalar instructions for free unless the NEON pipelines are waiting.

It also affects GCC 10 and below. Since the implementation of vmlal_u32 used inline assembly, GCC couldn't detect the pattern to merge the intrinsics. This caused it to output this garbage:

// (Literal vget_low_u32)
dup     d4, v1.d[0]
dup     d5, v2.d[0]
umlal   v0.2d, v4.2s, v5.2s
// (Literal vget_high_u32)
dup     d4, v1.d[1]
dup     d5, v2.d[1]
umlal   v3.2d, v4.2s, v5.2s

instead of the intended result of this:

umlal   v0.2d, v1.2s, v2.2s
umlal2  v3.2d, v1.4s, v2.4s

Also the scalar changes make it go from

    lsr     x8, x1, #32
    and     x9, x1, #0xFFFFFFFF
    // stalls the multiplier for 2c on little cores
    madd    x0, x8, x9, x0

To this

    lsr     x8, x1, #32
    umaddl  x0, w8, w1, x0

Which avoids the and instruction and doesn't stall the little cores.

easyaspi314 · 2023-03-11T18:53:55Z

I think I know the exact reason why add must come after umlal.

umlal isn't fast, and the compiler is aware of that in its cost model. Compared to most NEON instructions which execute in 2 cycles, umlal is a 5 cycle instruction and is additionally restricted to half of the NEON pipelines.

Normally, the most logical thing would be to put the slowest instructions last, so the compiler puts the umlals at the very end of the loop.

However, on the pipeline level this is the issue. In the double width NEON path, the two umlal(2) instructions both share two dependencies, from the uzp1/uzp2, the only thing that isn't dependent being the add. Since there is nothing else to process, there is nothing to do on the other NEON pipelines. All it can do is wait, and this ends up in a scalarization while it waits for the pipelines that can execute umlal(2).

F0	F1
add	uzp2
umlal	add
(latency)	(idle)
umlal2	(idle)
(latency)	(idle)

When the umlal is forced to come first, the compiler reorders it so that it comes early in the loop, alongside instructions that can execute on the other pipelines. This means that the pipelines are almost always processing data, and we are at maximum efficiency.

goldsteinn · 2023-03-28T16:25:42Z

The XXH32/XXH64 ordering thing interests me, as the optimizations vary wildly on the processor.

Just take XXH32.

Kind Speed (A53) Speed (X1)
NEON 1.8 GB/s 3.5 GB/s
Scalar, rounds in order 660 MB/s 10 GB/s
Scalar, rounds interleaved 1.6 GB/s 7.7 GB/s
I wonder if the optimal play is to go for the happy medium. A 2.4x speedup on the A53 is definitely more significant than 30% performance decrease on the X1. thinking

(It also sucks that it is impossible to dispatch either — while you can potentially figure it out temporarily with some syscalls, that can change at any time, since you can be rescheduled from a big core to a little core and back)

We could do per-cpu dispatch. With rseq getting cpuid is just the cost of a single read. Something similiar might also be preferable on Intel ADL machines.

easyaspi314 · 2023-04-02T16:46:28Z

We could do per-cpu dispatch. With rseq getting cpuid is just the cost of a single read. Something similiar might also be preferable on Intel ADL machines.

I don't think we can do that with the pure function guarantee.

goldsteinn · 2023-04-02T16:57:15Z

We could do per-cpu dispatch. With rseq getting cpuid is just the cost of a single read. Something similiar might also be preferable on Intel ADL machines.

I don't think we can do that with the pure function guarantee.

I think since the result is the same no matter which impl we use it's fine despite being pure

easyaspi314 · 2023-04-06T04:13:11Z

Actually, this is seemingly impossible to do safely. rseq is blocked by seccomp on Android (and presumably a few other distros). Trying to execute rseq(NULL, 0, 0, 0) will raise a SIGSYS and crash the program.

Remove volatile from asm guards, return Clang guards

e6732e9

The volatile in the asm guards was causing a reorder barrier which was preventing interleaving of XXH32 and XXH3 NEON. Additionally, with the better asm guard, Clang does better with intrinsics.

Cyan4973 approved these changes Mar 10, 2023

View reviewed changes

Cyan4973 merged commit a60bd05 into Cyan4973:dev Mar 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various AArch64 speed hacks #809

Various AArch64 speed hacks #809

easyaspi314 commented Mar 2, 2023 •

edited

Loading

easyaspi314 commented Mar 3, 2023 •

edited

Loading

easyaspi314 commented Mar 3, 2023 •

edited

Loading

easyaspi314 commented Mar 3, 2023 •

edited

Loading

Cyan4973 commented Mar 3, 2023

easyaspi314 commented Mar 5, 2023 •

edited

Loading

easyaspi314 commented Mar 5, 2023 •

edited

Loading

Cyan4973 commented Mar 5, 2023

easyaspi314 commented Mar 5, 2023 •

edited

Loading

easyaspi314 commented Mar 5, 2023 •

edited

Loading

easyaspi314 commented Mar 5, 2023 •

edited

Loading

easyaspi314 commented Mar 5, 2023 •

edited

Loading

Cyan4973 commented Mar 10, 2023

easyaspi314 commented Mar 10, 2023 •

edited

Loading

easyaspi314 commented Mar 11, 2023 •

edited

Loading

goldsteinn commented Mar 28, 2023 •

edited

Loading

easyaspi314 commented Apr 2, 2023

goldsteinn commented Apr 2, 2023

easyaspi314 commented Apr 6, 2023 •

edited

Loading

Various AArch64 speed hacks #809

Various AArch64 speed hacks #809

Conversation

easyaspi314 commented Mar 2, 2023 • edited Loading

easyaspi314 commented Mar 3, 2023 • edited Loading

easyaspi314 commented Mar 3, 2023 • edited Loading

easyaspi314 commented Mar 3, 2023 • edited Loading

Cyan4973 commented Mar 3, 2023

easyaspi314 commented Mar 5, 2023 • edited Loading

easyaspi314 commented Mar 5, 2023 • edited Loading

Cyan4973 commented Mar 5, 2023

easyaspi314 commented Mar 5, 2023 • edited Loading

easyaspi314 commented Mar 5, 2023 • edited Loading

easyaspi314 commented Mar 5, 2023 • edited Loading

easyaspi314 commented Mar 5, 2023 • edited Loading

Cyan4973 commented Mar 10, 2023

easyaspi314 commented Mar 10, 2023 • edited Loading

easyaspi314 commented Mar 11, 2023 • edited Loading

goldsteinn commented Mar 28, 2023 • edited Loading

easyaspi314 commented Apr 2, 2023

goldsteinn commented Apr 2, 2023

easyaspi314 commented Apr 6, 2023 • edited Loading

easyaspi314 commented Mar 2, 2023 •

edited

Loading

easyaspi314 commented Mar 3, 2023 •

edited

Loading

easyaspi314 commented Mar 3, 2023 •

edited

Loading

easyaspi314 commented Mar 3, 2023 •

edited

Loading

easyaspi314 commented Mar 5, 2023 •

edited

Loading

easyaspi314 commented Mar 5, 2023 •

edited

Loading

easyaspi314 commented Mar 5, 2023 •

edited

Loading

easyaspi314 commented Mar 5, 2023 •

edited

Loading

easyaspi314 commented Mar 5, 2023 •

edited

Loading

easyaspi314 commented Mar 5, 2023 •

edited

Loading

easyaspi314 commented Mar 10, 2023 •

edited

Loading

easyaspi314 commented Mar 11, 2023 •

edited

Loading

goldsteinn commented Mar 28, 2023 •

edited

Loading

easyaspi314 commented Apr 6, 2023 •

edited

Loading