[NEON] Split XXH3 into 6 NEON lanes and 2 scalar lanes on aarch64 #632

easyaspi314 · 2021-12-01T00:59:18Z

Instead of pure NEON, on AArch64 with size optimizations disabled, XXH3 will now use 6 lanes with NEON and 2 lanes with scalar by default.

This makes use of the otherwise inactive integer pipeline and results in massive speedup, especially with the previously underperforming GCC.

Note that this doesn't benefit ARMv7-a in the slightest.

This can be configured with the XXH3_NEON_LANES macro, which can be either 2, 4, 6, or 8.

Google Pixel 4a (2.21 GHz Snapdragon 730/Cortex-A76), xxhsum -b

	Before	After	Diff.
GCC 11.1	7434.8 MB/s	9814.9 MB/s	+32.0%
Clang 13	8788.4 MB/s	10158.2 MB/s	+15.5%

Instead of pure NEON, on AArch64 with sizeopt disabled, XXH3 will now use 6 lanes with NEON and 2 lanes with scalar by default. This makes use of the otherwise inactive integer pipeline and results in massive speedup, especially with the previously underperforming GCC. Note that this doesn't benefit ARMv7-a in the slightest. This can be configured with the `XXH3_NEON_LANES` macro, which can be either 2, 4, 6, or 8. Google Pixel 4a (2.21 GHz Snapdragon 730/Cortex-A76), xxhsum -b | | Before | After | Diff. | | GCC 11.1 | 7434.8 MB/s | 9814.9 MB/s | +32.0% | | Clang 13 | 8788.4 MB/s | 10158.2 MB/s | +15.5% |

easyaspi314 · 2021-12-01T01:54:09Z

And with that:

We finally break the 10 GB/s barrier on my phone
GCC is finally pulling its own weight
My phone now beats my i5-430m laptop (it gets 9.8 GB/s on Clang 13)

Unfortunately, this doesn't seem to affect x86_64 in any positive manner if I do it with SSE2.

Cyan4973 · 2021-12-01T02:05:26Z

Somehow, this feels like a distant cousin of heterogenous multi-cores processing, though at a more granular instruction port level.
It's nice, innovative and interesting, I like it.

easyaspi314 · 2021-12-01T02:14:14Z

It's Quake for the Pentium all over again 😅

I honestly don't know why I didn't think of that before, I knew most ARM chips execute ARM and NEON instructions at the same time.

easyaspi314 · 2021-12-01T02:48:56Z

Should I move the scalar variant to the top?

Cyan4973 · 2021-12-01T03:19:41Z

Should I move the scalar variant to the top?

So that they stand closer to their component XXH3_scalar*Round() ?
Yes, maybe but this can wait another PR.

easyaspi314 · 2021-12-01T03:20:58Z

Well I am already moving the round implementation to the top because I need to use it in both, and it isn't used as the last #else case.

Cyan4973 · 2021-12-01T06:20:34Z

You could also declare them as the top, and keep their implementation in the scalar category.

Whatever seems simpler / more readable.

easyaspi314 · 2021-12-01T08:56:43Z

You could also declare them as the top, and keep their implementation in the scalar category.

Whatever seems simpler / more readable.

I did the forward decl for now.

I also added some explanations of ARM's uop dispatcher for why it works.

Or, if you want the TL;DR:

Edit: I guess I was too slow lol, I pushed after the PR was merged 😂

Cyan4973 · 2021-12-01T16:42:40Z

Just make another PR ;)
Your additional explanations are well worth being integrated .

easyaspi314 · 2021-12-01T17:50:40Z

I should test this on the in-order base model Cortex-A53. My mom has a Verizon Ellipsis 10 with one.

But yeah I'll make a new PR with some new documentation.

easyaspi314 · 2021-12-01T21:33:18Z

Cortex-A53 test results

"Mom, can I have an AArch64 CPU?"
"We already have an AArch64 CPU at home."
The AArch64 CPU at home:

$ ./xxhsum-full-neon -b
xxhsum-full-neon 0.8.1 by Yann Collet 
compiled as 64-bit aarch64 + NEON little endian with Clang 13.0.1  
Sample of 100 KB...        
 1#XXH32                         :     102400 ->     7130 it/s (  696.3 MB/s)   
 3#XXH64                         :     102400 ->    17398 it/s ( 1699.0 MB/s)   
 5#XXH3_64b                      :     102400 ->    19757 it/s ( 1929.4 MB/s)   
11#XXH128                        :     102400 ->    19737 it/s ( 1927.4 MB/s)   
$ ./xxhsum-6-neon -b
xxhsum-6-neon 0.8.1 by Yann Collet 
compiled as 64-bit aarch64 + NEON little endian with Clang 13.0.1  
Sample of 100 KB...        
 1#XXH32                         :     102400 ->     7128 it/s (  696.1 MB/s)   
 3#XXH64                         :     102400 ->    17400 it/s ( 1699.2 MB/s)   
 5#XXH3_64b                      :     102400 ->    19918 it/s ( 1945.1 MB/s)   
11#XXH128                        :     102400 ->    19907 it/s ( 1944.1 MB/s)

So that means there are no major drawbacks between the two, which is good.
But that XXH32 result is interesting in its own right.

Cyan4973 · 2021-12-01T22:08:52Z

meaning, XXH32 is rather slower than expected ?

easyaspi314 · 2021-12-02T03:25:09Z

meaning, XXH32 is rather slower than expected ?

Actually, I forgot that XXH32 was slower on the older CPUs lol. Although I do see a slight optimization I can make in load/store.

I also tested this PR on my old Pixel 2 XL (Snap 835/Cortex-A73). This has the older 3 micro-op dispatch buffer, and only gets a ~4% speedup from this (5.1->5.3 GB/s).

So for the three main dispatch styles, we don't ever lose performance, as I predicted.

In-order dual issue (A53): no change
OoO 3 micro-op (A73): slight improvement
OoO 8 micro-op (A76): big improvement

I will make a note of it.

easyaspi314 mentioned this pull request Dec 1, 2021

[Discussion] Hybrid SIMD+scalar #630

Closed

easyaspi314 force-pushed the hybrid_neon branch from 1c1a830 to df8f699 Compare December 1, 2021 01:29

easyaspi314 mentioned this pull request Dec 1, 2021

XXH32/XXH64 modernization #607

Closed

Cyan4973 approved these changes Dec 1, 2021

View reviewed changes

Cyan4973 merged commit 4f38270 into Cyan4973:dev Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEON] Split XXH3 into 6 NEON lanes and 2 scalar lanes on aarch64 #632

[NEON] Split XXH3 into 6 NEON lanes and 2 scalar lanes on aarch64 #632

easyaspi314 commented Dec 1, 2021 •

edited

Loading

easyaspi314 commented Dec 1, 2021 •

edited

Loading

Cyan4973 commented Dec 1, 2021 •

edited

Loading

easyaspi314 commented Dec 1, 2021 •

edited

Loading

easyaspi314 commented Dec 1, 2021

Cyan4973 commented Dec 1, 2021

easyaspi314 commented Dec 1, 2021 •

edited

Loading

Cyan4973 commented Dec 1, 2021

easyaspi314 commented Dec 1, 2021 •

edited

Loading

Cyan4973 commented Dec 1, 2021

easyaspi314 commented Dec 1, 2021

easyaspi314 commented Dec 1, 2021 •

edited

Loading

Cyan4973 commented Dec 1, 2021

easyaspi314 commented Dec 2, 2021 •

edited

Loading

[NEON] Split XXH3 into 6 NEON lanes and 2 scalar lanes on aarch64 #632

[NEON] Split XXH3 into 6 NEON lanes and 2 scalar lanes on aarch64 #632

Conversation

easyaspi314 commented Dec 1, 2021 • edited Loading

easyaspi314 commented Dec 1, 2021 • edited Loading

Cyan4973 commented Dec 1, 2021 • edited Loading

easyaspi314 commented Dec 1, 2021 • edited Loading

easyaspi314 commented Dec 1, 2021

Cyan4973 commented Dec 1, 2021

easyaspi314 commented Dec 1, 2021 • edited Loading

Cyan4973 commented Dec 1, 2021

easyaspi314 commented Dec 1, 2021 • edited Loading

Cyan4973 commented Dec 1, 2021

easyaspi314 commented Dec 1, 2021

easyaspi314 commented Dec 1, 2021 • edited Loading

Cyan4973 commented Dec 1, 2021

easyaspi314 commented Dec 2, 2021 • edited Loading

easyaspi314 commented Dec 1, 2021 •

edited

Loading

easyaspi314 commented Dec 1, 2021 •

edited

Loading

Cyan4973 commented Dec 1, 2021 •

edited

Loading

easyaspi314 commented Dec 1, 2021 •

edited

Loading

easyaspi314 commented Dec 1, 2021 •

edited

Loading

easyaspi314 commented Dec 1, 2021 •

edited

Loading

easyaspi314 commented Dec 1, 2021 •

edited

Loading

easyaspi314 commented Dec 2, 2021 •

edited

Loading