-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEON] Split XXH3 into 6 NEON lanes and 2 scalar lanes on aarch64 #632
Conversation
Instead of pure NEON, on AArch64 with sizeopt disabled, XXH3 will now use 6 lanes with NEON and 2 lanes with scalar by default. This makes use of the otherwise inactive integer pipeline and results in massive speedup, especially with the previously underperforming GCC. Note that this doesn't benefit ARMv7-a in the slightest. This can be configured with the `XXH3_NEON_LANES` macro, which can be either 2, 4, 6, or 8. Google Pixel 4a (2.21 GHz Snapdragon 730/Cortex-A76), xxhsum -b | | Before | After | Diff. | | GCC 11.1 | 7434.8 MB/s | 9814.9 MB/s | +32.0% | | Clang 13 | 8788.4 MB/s | 10158.2 MB/s | +15.5% |
1c1a830
to
df8f699
Compare
And with that:
Unfortunately, this doesn't seem to affect x86_64 in any positive manner if I do it with SSE2. |
Somehow, this feels like a distant cousin of heterogenous multi-cores processing, though at a more granular instruction port level. |
It's Quake for the Pentium all over again 😅 I honestly don't know why I didn't think of that before, I knew most ARM chips execute ARM and NEON instructions at the same time. |
Should I move the scalar variant to the top? |
So that they stand closer to their component |
Well I am already moving the round implementation to the top because I need to use it in both, and it isn't used as the last |
You could also declare them as the top, and keep their implementation in the scalar category. Whatever seems simpler / more readable. |
I did the forward decl for now. I also added some explanations of ARM's uop dispatcher for why it works. Edit: I guess I was too slow lol, I pushed after the PR was merged 😂 |
Just make another PR ;) |
I should test this on the in-order base model Cortex-A53. My mom has a Verizon Ellipsis 10 with one. But yeah I'll make a new PR with some new documentation. |
Cortex-A53 test results "Mom, can I have an AArch64 CPU?"
So that means there are no major drawbacks between the two, which is good. |
meaning, |
Actually, I forgot that XXH32 was slower on the older CPUs lol. Although I do see a slight optimization I can make in load/store. I also tested this PR on my old Pixel 2 XL (Snap 835/Cortex-A73). This has the older 3 micro-op dispatch buffer, and only gets a ~4% speedup from this (5.1->5.3 GB/s). So for the three main dispatch styles, we don't ever lose performance, as I predicted.
I will make a note of it. |
Instead of pure NEON, on AArch64 with size optimizations disabled, XXH3 will now use 6 lanes with NEON and 2 lanes with scalar by default.
This makes use of the otherwise inactive integer pipeline and results in massive speedup, especially with the previously underperforming GCC.
Note that this doesn't benefit ARMv7-a in the slightest.
This can be configured with the
XXH3_NEON_LANES
macro, which can be either 2, 4, 6, or 8.Google Pixel 4a (2.21 GHz Snapdragon 730/Cortex-A76),
xxhsum -b