Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEON] Split XXH3 into 6 NEON lanes and 2 scalar lanes on aarch64 #632

Merged
merged 1 commit into from
Dec 1, 2021

Conversation

easyaspi314
Copy link
Contributor

@easyaspi314 easyaspi314 commented Dec 1, 2021

Instead of pure NEON, on AArch64 with size optimizations disabled, XXH3 will now use 6 lanes with NEON and 2 lanes with scalar by default.

This makes use of the otherwise inactive integer pipeline and results in massive speedup, especially with the previously underperforming GCC.

Note that this doesn't benefit ARMv7-a in the slightest.

This can be configured with the XXH3_NEON_LANES macro, which can be either 2, 4, 6, or 8.

Google Pixel 4a (2.21 GHz Snapdragon 730/Cortex-A76), xxhsum -b

Before After Diff.
GCC 11.1 7434.8 MB/s 9814.9 MB/s +32.0%
Clang 13 8788.4 MB/s 10158.2 MB/s +15.5%

Instead of pure NEON, on AArch64 with sizeopt disabled, XXH3 will now use
6 lanes with NEON and 2 lanes with scalar by default.

This makes use of the otherwise inactive integer pipeline and results
in massive speedup, especially with the previously underperforming GCC.

Note that this doesn't benefit ARMv7-a in the slightest.

This can be configured with the `XXH3_NEON_LANES` macro, which can be
either 2, 4, 6, or 8.

Google Pixel 4a (2.21 GHz Snapdragon 730/Cortex-A76), xxhsum -b
|          |    Before   |     After    |  Diff. |
| GCC 11.1 | 7434.8 MB/s |  9814.9 MB/s | +32.0% |
| Clang 13 | 8788.4 MB/s | 10158.2 MB/s | +15.5% |
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 1, 2021

And with that:

  • We finally break the 10 GB/s barrier on my phone
  • GCC is finally pulling its own weight
  • My phone now beats my i5-430m laptop (it gets 9.8 GB/s on Clang 13)

Unfortunately, this doesn't seem to affect x86_64 in any positive manner if I do it with SSE2.

@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 1, 2021

Somehow, this feels like a distant cousin of heterogenous multi-cores processing, though at a more granular instruction port level.
It's nice, innovative and interesting, I like it.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 1, 2021

It's Quake for the Pentium all over again 😅

I honestly don't know why I didn't think of that before, I knew most ARM chips execute ARM and NEON instructions at the same time.

@easyaspi314
Copy link
Contributor Author

Should I move the scalar variant to the top?

@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 1, 2021

Should I move the scalar variant to the top?

So that they stand closer to their component XXH3_scalar*Round() ?
Yes, maybe but this can wait another PR.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 1, 2021

Well I am already moving the round implementation to the top because I need to use it in both, and it isn't used as the last #else case.

@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 1, 2021

You could also declare them as the top, and keep their implementation in the scalar category.

Whatever seems simpler / more readable.

@Cyan4973 Cyan4973 merged commit 4f38270 into Cyan4973:dev Dec 1, 2021
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 1, 2021

You could also declare them as the top, and keep their implementation in the scalar category.

Whatever seems simpler / more readable.

I did the forward decl for now.

I also added some explanations of ARM's uop dispatcher for why it works.

Or, if you want the TL;DR:
arm all pipelines.jpg

Edit: I guess I was too slow lol, I pushed after the PR was merged 😂

@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 1, 2021

Just make another PR ;)
Your additional explanations are well worth being integrated .

@easyaspi314
Copy link
Contributor Author

I should test this on the in-order base model Cortex-A53. My mom has a Verizon Ellipsis 10 with one.

But yeah I'll make a new PR with some new documentation.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 1, 2021

Cortex-A53 test results

"Mom, can I have an AArch64 CPU?"
"We already have an AArch64 CPU at home."
The AArch64 CPU at home:

$ ./xxhsum-full-neon -b
xxhsum-full-neon 0.8.1 by Yann Collet 
compiled as 64-bit aarch64 + NEON little endian with Clang 13.0.1  
Sample of 100 KB...        
 1#XXH32                         :     102400 ->     7130 it/s (  696.3 MB/s)   
 3#XXH64                         :     102400 ->    17398 it/s ( 1699.0 MB/s)   
 5#XXH3_64b                      :     102400 ->    19757 it/s ( 1929.4 MB/s)   
11#XXH128                        :     102400 ->    19737 it/s ( 1927.4 MB/s)   
$ ./xxhsum-6-neon -b
xxhsum-6-neon 0.8.1 by Yann Collet 
compiled as 64-bit aarch64 + NEON little endian with Clang 13.0.1  
Sample of 100 KB...        
 1#XXH32                         :     102400 ->     7128 it/s (  696.1 MB/s)   
 3#XXH64                         :     102400 ->    17400 it/s ( 1699.2 MB/s)   
 5#XXH3_64b                      :     102400 ->    19918 it/s ( 1945.1 MB/s)   
11#XXH128                        :     102400 ->    19907 it/s ( 1944.1 MB/s)   

So that means there are no major drawbacks between the two, which is good.
But that XXH32 result is interesting in its own right.

@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 1, 2021

meaning, XXH32 is rather slower than expected ?

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 2, 2021

meaning, XXH32 is rather slower than expected ?

Actually, I forgot that XXH32 was slower on the older CPUs lol. Although I do see a slight optimization I can make in load/store.

I also tested this PR on my old Pixel 2 XL (Snap 835/Cortex-A73). This has the older 3 micro-op dispatch buffer, and only gets a ~4% speedup from this (5.1->5.3 GB/s).

So for the three main dispatch styles, we don't ever lose performance, as I predicted.

  • In-order dual issue (A53): no change
  • OoO 3 micro-op (A73): slight improvement
  • OoO 8 micro-op (A76): big improvement

I will make a note of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants