Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various AArch64 speed hacks #809

Merged
merged 3 commits into from
Mar 11, 2023
Merged

Conversation

easyaspi314
Copy link
Contributor

@easyaspi314 easyaspi314 commented Mar 2, 2023

  • Use inline assembly to force UMADDL in the scalarRound code
    • Prevents the slow 64-bit MADD which has +4c latency and a 2c stall on little cores.
    • Also prevents a mask
    • Improves scalar speed even on big cores, should also improve mixed NEON speed on those little cores, although I haven't tested it.
  • Use inline assembly for vmlal_u32
    • Fixes NEON performance being garbage on GCC < 11 because it can't fold vget_low/high into vmlal (arm_neon.h uses inline assembly)
    • Documents the presumed reason why umlal must come before add.
    • Doubles as Clang's asm guard. (The old asm guard is still used on the single vector path)
    • Much cleaner in the code itself
  • Fix header detection
    • Include arm_neon.h when __ARM_FEATURE_SVE is defined
    • Don't include arm_neon.h if only __aarch64__ is defined (e.g. -march=armv8-a+nofp)
    • Don't include arm_neon.h on < ARMv7 MSVC (which only would happen on old WinCE toolchains)
  • Fix the generateSecret MOVK hack
    • Add it for GCC, it does that too now
    • Remove the assert because the change to use unreachable() undoes it
  • Make asm guards non-volatile to allow reordering
  • Fix GCC strict aliasing issues with both NEON and VSX (same bug). __sync_synchronize() is no longer necessary.

 - Use inline assembly to force umaddl in the scalarRound code
   - Prevents the slow 64-bit `MADD` which has +4c latency and a 2c stall on little cores.
   - Also prevents a mask
   - Improves scalar speed even on big cores, should also improve mixed NEON speed on those
     little cores, although I haven't tested it.
 - Use inline assembly for `vmlal_u32`
   - Fixes NEON performance being garbage on GCC < 11 because it can't fold `vget_low/high`
     into `vmlal` (arm_neon.h uses inline assembly)
   - Documents the presumed reason why umlal must come before add.
   - Doubles as Clang's asm guard. (The old asm guard is still used on the single vector path)
   - Much cleaner in the code itself
 - Fix header detection
   - Include arm_neon.h when `__ARM_FEATURE_SVE` is defined
   - Don't include arm_neon.h if only `__aarch64__` is defined (e.g. -march=armv8-a+nofp)
   - Don't include arm_neon.h on < ARMv7 MSVC (which only would happen on old WinCE toolchains)
 - Fix the generateSecret MOVK hack
   - Add it for GCC, it does that too now
   - Remove the assert because the change to use `unreachable()` undoes it
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 3, 2023

Hmm...Interestingly I also get up to 21 GB/s (instead of ~14, and almost as good as AFAIK optimal XXH3, 23 GB/s) if I put a compiler guard at the end of XXH64_round(), which forces the compiler to put the instructions sequentially. However, my phone's processor has a fast multiplier.

// Before
.L197:
        ldp     x7, x4, [x5]
        ldp     x3, x6, [x5, #16]
        add     x5, x5, #32
        madd    x7, x7, x9, x12
        madd    x4, x4, x9, x13
        madd    x3, x3, x9, x10
        madd    x6, x6, x9, x1
        ror     x7, x7, #33
        ror     x4, x4, #33
        ror     x3, x3, #33
        mul     x12, x7, x8
        ror     x6, x6, 33
        mul     x13, x4, x8
        mul     x10, x3, x8
        mul     x1, x6, x8
        cmp     x2, x5
        bhi     .L197
// After
.L197:
        ldr     x7, [x3]
        madd    x7, x7, x11, x10
        ror     x7, x7, #33
        mul     x10, x7, x9
        ldr     x6, [x3, #8]
        madd    x6, x6, x11, x12
        ror     x6, x6, #33
        mul     x12, x6, x9
        ldr     x6, [x3, #16]
        madd    x4, x6, x11, x4
        ror     x4, x4, #33
        mul     x4, x4, x9
        ldr     x5, [x3, #24]
        madd    x5, x5, x11, x8
        ror     x5, x5, #33
        mul     x8, x5, x9
        add     x3, x3, #32
        cmp     x1, x3
        bhi     .L197

While it is bad that the loads are split into ldr instead of ldp, this might be better even on small multipliers.

As mentioned before, after a 64-bit multiply, the multiply pipeline is stalled for 2 cycles, so in-order chips will be forced to wait 2 cycles before it can begin dispatching the next multiply.

Also all small multiplier chips I know of only have one multiply pipeline so sequential multiplies can't be dual issued.

So as a result the first code output would have ~12 stall cycles from the sequential multiplies, which adds up to be almost as bad as the latency from the dependencies. 🤔

I'm gonna have to boot up my old phones and tablets to test I guess.

TL;DR: the ARM pipeline never ceases to both amaze me and frustrate me.

Edit: Nope, the Cortex-A53 does not like that one bit. Goes from 1.6 GB/s to 993 MB/s on GCC 12.

Also this is actually the same effect that XXH32 has. XXH32 gets a whopping 660 MB/s on this, when the same ordering that the "Before" is but with 32-bit stuff gets 1.6 GB/s like XXH64.

If I run the same code on my good phone, it goes from 10 GB/s to 7 GB/s.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 3, 2023

Cortex-A53 results are in.

XXH3_64b 100KB benchmark

Compiler old XXH_SCALAR old XXH_NEON new XXH_SCALAR new XXH_NEON
GCC 12 1903 MB/s 2263 MB/s 2267 MB/s 2450 MB/s
GCC 10 1825 MB/s 1877 MB/s 2289 MB/s 2065 MB/s
Clang 15 1783 MB/s 1946 MB/s 2055 MB/s 2035 MB/s
Clang 11 1717 MB/s 1819 MB/s 1917 MB/s 1964 MB/s

Clang is really bad at interleaving XXH3 (even with -mtune=cortex-a53), which greatly impacts things on an in-order processor.

Additionally, 2.4 GB/s is what I get from assembly so that is pretty dope.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 3, 2023

The XXH32/XXH64 ordering thing interests me, as the optimizations vary wildly on the processor.

Just take XXH32.

Kind Speed (A53) Speed (X1)
NEON 1.8 GB/s 3.5 GB/s
Scalar, rounds in order 660 MB/s 10 GB/s
Scalar, rounds interleaved 1.6 GB/s 7.7 GB/s

I wonder if the optimal play is to go for the happy medium. A 2.4x speedup on the A53 is definitely more significant than 30% performance decrease on the X1. 🤔

(It also sucks that it is impossible to dispatch either — while you can potentially figure it out temporarily with some syscalls, that can change at any time, since you can be rescheduled from a big core to a little core and back)

@Cyan4973
Copy link
Owner

Cyan4973 commented Mar 3, 2023

I wonder if the optimal play is to go for the happy medium. 🤔

It seems so

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 5, 2023

Just realized that it could be possible for the compiler to generate the Cortex-A53 errata with the umaddl. My tablet supposedly has this errata and hasn't seen an issue, but the fix is to prevent it from putting a load/store before a multiply-accumulate. (Usually a nop). However, I have an idea that ensures this with the least penalty, and that is using that stall cycle to do the shift, which if I use the variable form, is always 1 cycle even on the A53 line. Still not great having the dependency there but I'd rather that than the multiply result randomly being corrupted.

Nevermind, the conditions are never going to be met, it requires the multiply result to be forwarded directly to the next one. While this could happen if the compiler decides to unroll and reorder the adds and multiplies, it won't do that because since it is an inline assembly instruction, it doesn't know it is commutative 😜


Additionally, one frustrating thing about XXH32 is that the best codegen (outside of -fno-tree-slp-vectorize) is from putting all 4 guards at the bottom of the loop, which is no problem on AArch64 with its 31 GPRs, but x86 is probably going to have issues because of how few registers it has.

Also guarding one or even two variables isn't enough — Clang so aggressively wants to vectorize that it will literally attempt to scatter-gather load the lanes into half vectors if they aren't consecutive, which requires 2 slow load instructions and some pointer math because NEON loads don't have immediate offsets.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 5, 2023

Scratch that, I found out a better solution for XXH32. The asm guards don't need to be marked volatile. This was causing a reorder barrier since it thought there were external side effects, when we just needed to tell the compiler "put this in a register and mark it modified sometime between the two times it is accessed". This was also why Clang was comedically bad at interleaving XXH3, it was because it couldn't.

It still doesn't get the max XXH32 speed on my phone though, but I'm willing to say that my phone is the weird one since the optimal code for this one is bad for any in-order processor.

@Cyan4973
Copy link
Owner

Cyan4973 commented Mar 5, 2023

What's the current status for this PR ? Still work in progress ?

The volatile in the asm guards was causing a reorder barrier which was
preventing interleaving of XXH32 and XXH3 NEON.

Additionally, with the better asm guard, Clang does better with intrinsics.
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 5, 2023

That should do it. I removed the volatile from the asm guards and brought back the clang guards.

This does cause the slight slowdown on XXH32 for the X1, but a great benefit on the A53. This should also benefit x86 XXH32.

Also, Clang's XXH3 is now (almost) on par with GCC, as it can interleave now.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 5, 2023

Also I'd like to see how the XXH3_NEON_LANES=6 works on Apple Silicon. With the reordering fixed it could actually be faster than 8 NEON.

I'd test it myself on the compiler farm but I have to wait for my new SSH key to sync 💀 wait I think the key is on my old phone.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 5, 2023

Update. I found my old SSH key, and the M1 still prefers 8 NEON lanes. GCC 12 oddly has a strict aliasing issue which doesn't occur on Linux. I might need to add a may_alias attribute. Come to think of it, that might be the source of the synchronize issue on S390X

Since I needed to boot my old phone (Snap 730/Cortex-A76) anyways, I did a quick bench on it. GCC 12 had a slight performance regression, but it is only by like 4% (12.3 -> 11.8). This is despite the original code having the slower madd.

Might be some ordering shenanigans that can be eventually worked out. Clang 15 did increase performance slightly (10.3 -> 10.8). Still not optimal though. This might be worth looking into in the future.

Neither of these performance changes are particularly bad though.

This fixes GCC 12 on macOS ARM64, as well as the existing s390x zVector
bug that was worked around with a `__sync_synchronize()` by using
`__attribute__((may_alias))` which excludes a type from strict aliasing
optimizations.

`vldq`/`vst1q` are still avoided due to older versions having issues with
load-store optimizations.

This is not needed on x86 because the SSE types are impossible to not
alias.
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 5, 2023

Killed two birds with one stone. The issue was in fact the strict aliasing violation from the xacc pointer, and it was the exact same bug that was breaking s390x and forcing a memory barrier.

The xacc pointers for NEON and VSX are now pointers to vectors with the __attribute__((may_alias)) which excludes them from GCC's strict aliasing breaking "optimizations".

I think this is good for now.

@Cyan4973
Copy link
Owner

I made some measurements of this version on some ARM devices around, and here is what I found :

arm64 cpu vector dev pr809 diff
M1 NEON 35.5 GB/s 35.5 GB/s none
M1 Scalar` 20.0 GB/s 22.2 GB/s +11%
Snap8g1 NEON 25.5 GB/s 26.6 GB/s +4%
Snap8g1 Scalar 13.6 GB/s 15.8 GB/s +16%

So this PR seems to provide more benefits to the Scalar mode.
Is that the expectation ?

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 10, 2023

I made some measurements of this version on some ARM devices around, and here is what I found :

arm64 cpu vector dev pr809 diff
M1 NEON 35.5 GB/s 35.5 GB/s none
M1 Scalar` 20.0 GB/s 22.2 GB/s +11%
Snap8g1 NEON 25.5 GB/s 26.6 GB/s +4%
Snap8g1 Scalar 13.6 GB/s 15.8 GB/s +16%

So this PR seems to provide more benefits to the Scalar mode.
Is that the expectation ?

Yes. The benefits to scalar also apply to hybrid NEON, primarily on the in-order cores since they can't dispatch the scalar instructions for free unless the NEON pipelines are waiting.

It also affects GCC 10 and below. Since the implementation of vmlal_u32 used inline assembly, GCC couldn't detect the pattern to merge the intrinsics. This caused it to output this garbage:

// (Literal vget_low_u32)
dup     d4, v1.d[0]
dup     d5, v2.d[0]
umlal   v0.2d, v4.2s, v5.2s
// (Literal vget_high_u32)
dup     d4, v1.d[1]
dup     d5, v2.d[1]
umlal   v3.2d, v4.2s, v5.2s

instead of the intended result of this:

umlal   v0.2d, v1.2s, v2.2s
umlal2  v3.2d, v1.4s, v2.4s

Also the scalar changes make it go from

    lsr     x8, x1, #32
    and     x9, x1, #0xFFFFFFFF
    // stalls the multiplier for 2c on little cores
    madd    x0, x8, x9, x0

To this

    lsr     x8, x1, #32
    umaddl  x0, w8, w1, x0

Which avoids the and instruction and doesn't stall the little cores.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 11, 2023

I think I know the exact reason why add must come after umlal.

umlal isn't fast, and the compiler is aware of that in its cost model. Compared to most NEON instructions which execute in 2 cycles, umlal is a 5 cycle instruction and is additionally restricted to half of the NEON pipelines.

Normally, the most logical thing would be to put the slowest instructions last, so the compiler puts the umlals at the very end of the loop.

However, on the pipeline level this is the issue. In the double width NEON path, the two umlal(2) instructions both share two dependencies, from the uzp1/uzp2, the only thing that isn't dependent being the add. Since there is nothing else to process, there is nothing to do on the other NEON pipelines. All it can do is wait, and this ends up in a scalarization while it waits for the pipelines that can execute umlal(2).

F0 F1
add uzp2
umlal add
(latency) (idle)
umlal2 (idle)
(latency) (idle)

When the umlal is forced to come first, the compiler reorders it so that it comes early in the loop, alongside instructions that can execute on the other pipelines. This means that the pipelines are almost always processing data, and we are at maximum efficiency.

@Cyan4973 Cyan4973 merged commit a60bd05 into Cyan4973:dev Mar 11, 2023
@goldsteinn
Copy link
Contributor

goldsteinn commented Mar 28, 2023

The XXH32/XXH64 ordering thing interests me, as the optimizations vary wildly on the processor.

Just take XXH32.

Kind Speed (A53) Speed (X1)
NEON 1.8 GB/s 3.5 GB/s
Scalar, rounds in order 660 MB/s 10 GB/s
Scalar, rounds interleaved 1.6 GB/s 7.7 GB/s
I wonder if the optimal play is to go for the happy medium. A 2.4x speedup on the A53 is definitely more significant than 30% performance decrease on the X1. thinking

(It also sucks that it is impossible to dispatch either — while you can potentially figure it out temporarily with some syscalls, that can change at any time, since you can be rescheduled from a big core to a little core and back)

We could do per-cpu dispatch. With rseq getting cpuid is just the cost of a single read. Something similiar might also be preferable on Intel ADL machines.

@easyaspi314
Copy link
Contributor Author

We could do per-cpu dispatch. With rseq getting cpuid is just the cost of a single read. Something similiar might also be preferable on Intel ADL machines.

I don't think we can do that with the pure function guarantee.

@goldsteinn
Copy link
Contributor

We could do per-cpu dispatch. With rseq getting cpuid is just the cost of a single read. Something similiar might also be preferable on Intel ADL machines.

I don't think we can do that with the pure function guarantee.

I think since the result is the same no matter which impl we use it's fine despite being pure

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Apr 6, 2023

Actually, this is seemingly impossible to do safely. rseq is blocked by seccomp on Android (and presumably a few other distros). Trying to execute rseq(NULL, 0, 0, 0) will raise a SIGSYS and crash the program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants