-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various AArch64 speed hacks #809
Conversation
- Use inline assembly to force umaddl in the scalarRound code - Prevents the slow 64-bit `MADD` which has +4c latency and a 2c stall on little cores. - Also prevents a mask - Improves scalar speed even on big cores, should also improve mixed NEON speed on those little cores, although I haven't tested it. - Use inline assembly for `vmlal_u32` - Fixes NEON performance being garbage on GCC < 11 because it can't fold `vget_low/high` into `vmlal` (arm_neon.h uses inline assembly) - Documents the presumed reason why umlal must come before add. - Doubles as Clang's asm guard. (The old asm guard is still used on the single vector path) - Much cleaner in the code itself - Fix header detection - Include arm_neon.h when `__ARM_FEATURE_SVE` is defined - Don't include arm_neon.h if only `__aarch64__` is defined (e.g. -march=armv8-a+nofp) - Don't include arm_neon.h on < ARMv7 MSVC (which only would happen on old WinCE toolchains) - Fix the generateSecret MOVK hack - Add it for GCC, it does that too now - Remove the assert because the change to use `unreachable()` undoes it
Hmm...Interestingly I also get up to 21 GB/s (instead of ~14, and almost as good as AFAIK optimal XXH3, 23 GB/s) if I put a compiler guard at the end of XXH64_round(), which forces the compiler to put the instructions sequentially. However, my phone's processor has a fast multiplier. // Before
.L197:
ldp x7, x4, [x5]
ldp x3, x6, [x5, #16]
add x5, x5, #32
madd x7, x7, x9, x12
madd x4, x4, x9, x13
madd x3, x3, x9, x10
madd x6, x6, x9, x1
ror x7, x7, #33
ror x4, x4, #33
ror x3, x3, #33
mul x12, x7, x8
ror x6, x6, 33
mul x13, x4, x8
mul x10, x3, x8
mul x1, x6, x8
cmp x2, x5
bhi .L197
// After
.L197:
ldr x7, [x3]
madd x7, x7, x11, x10
ror x7, x7, #33
mul x10, x7, x9
ldr x6, [x3, #8]
madd x6, x6, x11, x12
ror x6, x6, #33
mul x12, x6, x9
ldr x6, [x3, #16]
madd x4, x6, x11, x4
ror x4, x4, #33
mul x4, x4, x9
ldr x5, [x3, #24]
madd x5, x5, x11, x8
ror x5, x5, #33
mul x8, x5, x9
add x3, x3, #32
cmp x1, x3
bhi .L197 While it is bad that the loads are split into As mentioned before, after a 64-bit multiply, the multiply pipeline is stalled for 2 cycles, so in-order chips will be forced to wait 2 cycles before it can begin dispatching the next multiply. Also all small multiplier chips I know of only have one multiply pipeline so sequential multiplies can't be dual issued. So as a result the first code output would have ~12 stall cycles from the sequential multiplies, which adds up to be almost as bad as the latency from the dependencies. 🤔 I'm gonna have to boot up my old phones and tablets to test I guess. TL;DR: the ARM pipeline never ceases to both amaze me and frustrate me. Edit: Nope, the Cortex-A53 does not like that one bit. Goes from 1.6 GB/s to 993 MB/s on GCC 12. Also this is actually the same effect that XXH32 has. XXH32 gets a whopping 660 MB/s on this, when the same ordering that the "Before" is but with 32-bit stuff gets 1.6 GB/s like XXH64. If I run the same code on my good phone, it goes from 10 GB/s to 7 GB/s. |
Cortex-A53 results are in.
Clang is really bad at interleaving XXH3 (even with Additionally, 2.4 GB/s is what I get from assembly so that is pretty dope. |
The XXH32/XXH64 ordering thing interests me, as the optimizations vary wildly on the processor. Just take XXH32.
I wonder if the optimal play is to go for the happy medium. A 2.4x speedup on the A53 is definitely more significant than 30% performance decrease on the X1. 🤔 (It also sucks that it is impossible to dispatch either — while you can potentially figure it out temporarily with some syscalls, that can change at any time, since you can be rescheduled from a big core to a little core and back) |
It seems so |
Just realized that it could be possible for the compiler to generate the Cortex-A53 errata with the umaddl. My tablet supposedly has this errata and hasn't seen an issue, but the fix is to prevent it from putting a load/store before a multiply-accumulate. (Usually a Nevermind, the conditions are never going to be met, it requires the multiply result to be forwarded directly to the next one. While this could happen if the compiler decides to unroll and reorder the adds and multiplies, it won't do that because since it is an inline assembly instruction, it doesn't know it is commutative 😜 Additionally, one frustrating thing about XXH32 is that the best codegen (outside of Also guarding one or even two variables isn't enough — Clang so aggressively wants to vectorize that it will literally attempt to scatter-gather load the lanes into half vectors if they aren't consecutive, which requires 2 slow load instructions and some pointer math because NEON loads don't have immediate offsets. |
Scratch that, I found out a better solution for XXH32. The asm guards don't need to be marked volatile. This was causing a reorder barrier since it thought there were external side effects, when we just needed to tell the compiler "put this in a register and mark it modified sometime between the two times it is accessed". This was also why Clang was comedically bad at interleaving XXH3, it was because it couldn't. It still doesn't get the max XXH32 speed on my phone though, but I'm willing to say that my phone is the weird one since the optimal code for this one is bad for any in-order processor. |
What's the current status for this PR ? Still work in progress ? |
The volatile in the asm guards was causing a reorder barrier which was preventing interleaving of XXH32 and XXH3 NEON. Additionally, with the better asm guard, Clang does better with intrinsics.
That should do it. I removed the volatile from the asm guards and brought back the clang guards. This does cause the slight slowdown on XXH32 for the X1, but a great benefit on the A53. This should also benefit x86 XXH32. Also, Clang's XXH3 is now (almost) on par with GCC, as it can interleave now. |
Also I'd like to see how the
|
Update. I found my old SSH key, and the M1 still prefers 8 NEON lanes. GCC 12 oddly has a strict aliasing issue which doesn't occur on Linux. I might need to add a Since I needed to boot my old phone (Snap 730/Cortex-A76) anyways, I did a quick bench on it. GCC 12 had a slight performance regression, but it is only by like 4% (12.3 -> 11.8). This is despite the original code having the slower Might be some ordering shenanigans that can be eventually worked out. Clang 15 did increase performance slightly (10.3 -> 10.8). Still not optimal though. This might be worth looking into in the future. Neither of these performance changes are particularly bad though. |
This fixes GCC 12 on macOS ARM64, as well as the existing s390x zVector bug that was worked around with a `__sync_synchronize()` by using `__attribute__((may_alias))` which excludes a type from strict aliasing optimizations. `vldq`/`vst1q` are still avoided due to older versions having issues with load-store optimizations. This is not needed on x86 because the SSE types are impossible to not alias.
Killed two birds with one stone. The issue was in fact the strict aliasing violation from the The I think this is good for now. |
I made some measurements of this version on some ARM devices around, and here is what I found :
So this PR seems to provide more benefits to the Scalar mode. |
Yes. The benefits to scalar also apply to hybrid NEON, primarily on the in-order cores since they can't dispatch the scalar instructions for free unless the NEON pipelines are waiting. It also affects GCC 10 and below. Since the implementation of // (Literal vget_low_u32)
dup d4, v1.d[0]
dup d5, v2.d[0]
umlal v0.2d, v4.2s, v5.2s
// (Literal vget_high_u32)
dup d4, v1.d[1]
dup d5, v2.d[1]
umlal v3.2d, v4.2s, v5.2s instead of the intended result of this: umlal v0.2d, v1.2s, v2.2s
umlal2 v3.2d, v1.4s, v2.4s Also the scalar changes make it go from lsr x8, x1, #32
and x9, x1, #0xFFFFFFFF
// stalls the multiplier for 2c on little cores
madd x0, x8, x9, x0 To this lsr x8, x1, #32
umaddl x0, w8, w1, x0 Which avoids the |
I think I know the exact reason why
Normally, the most logical thing would be to put the slowest instructions last, so the compiler puts the However, on the pipeline level this is the issue. In the double width NEON path, the two
When the |
We could do per-cpu dispatch. With |
I don't think we can do that with the pure function guarantee. |
I think since the result is the same no matter which impl we use it's fine despite being pure |
Actually, this is seemingly impossible to do safely. |
UMADDL
in the scalarRound codeMADD
which has +4c latency and a 2c stall on little cores.vmlal_u32
vget_low/high
intovmlal
(arm_neon.h uses inline assembly)__ARM_FEATURE_SVE
is defined__aarch64__
is defined (e.g. -march=armv8-a+nofp)MOVK
hackunreachable()
undoes it__sync_synchronize()
is no longer necessary.