-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize XXH3_accumulate_512_neon #734
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dougallj
force-pushed
the
neon-opt
branch
2 times, most recently
from
August 31, 2022 02:53
7ec024a
to
1132f36
Compare
cc @easyaspi314 I can confirm the performance gains on M1. |
Slight boost on my Pixel 4a (10.4->10.8) I get 10.9 (no clue why) if I do uint32x4x2_t zipped = vuzpq_u32(vreinterpretq_u32_u64(data_key1), vreinterpretq_u32_u64(data_key2));
uint32x4_t data_key_lo = zipped.val[0];
uint32x4_t data_key_hi = zipped.val[1]; Plus this also makes this block compatible with armv7-a. |
dougallj
force-pushed
the
neon-opt
branch
2 times, most recently
from
August 31, 2022 16:37
1e10bf2
to
90cf8fb
Compare
Cyan4973
approved these changes
Sep 2, 2022
pkhuong
pushed a commit
to backtrace-labs/umash
that referenced
this pull request
Sep 11, 2022
neon's pmull extension only multiplies low halves with low halves, or high halves with high halves. This means we need a shuffle to implement umash's PH mixer. We can at least shuffle *two* 128-bit values at once with a single VEXTQ, then compute the carryless product of the low and high halves separately. We also want prefer to implement this small operation with inline assembly to make sure the PMULL and EOR instructions are paired correctly for fusion on the M1's firestorm unit (https://dougallj.github.io/applecpu/firestorm.html). Why firestorm? That's what I have easy access to, I don't know if any other uarch has similar fusion that might be helpful, and the core's width makes it a great fit for umash: with this and previous patches, umash hits over than 16.5 byte/cycle (52.8 GB/s) on the 3.2 GHz performance cores for aligned inputs in L1, and 14.9 b/c (47.7 GB/s) for misaligned inputs in L1. That's ~40% more throughput than xxh3 with Cyan4973/xxHash#734 applied. TESTED=on gcc103, clang and gcc, with and without inline asm. ``` pkhuong@penguin:~/umash$ (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - pkhuong@gcc103:~/umash$ clang-14 $CFLAGS umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - pkhuong@gcc103:~/umash$ clang-14 $CFLAGS -DUMASH_INLINE_ASM=0 umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - pkhuong@gcc103:~/umash$ gcc-12 $CFLAGS umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - pkhuong@gcc103:~/umash$ gcc-12 $CFLAGS -DUMASH_INLINE_ASM=0 umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; do ne) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - ``` Signed-off-by: Paul Khuong <pvk@pvk.ca>
pkhuong
pushed a commit
to backtrace-labs/umash
that referenced
this pull request
Sep 12, 2022
neon's pmull extension only multiplies low halves with low halves, or high halves with high halves. This means we need a shuffle to implement umash's PH mixer. We can at least shuffle *two* 128-bit values at once with a single VEXTQ, then compute the carryless product of the low and high halves separately. We also want prefer to implement this small operation with inline assembly to make sure the PMULL and EOR instructions are paired correctly for fusion on the M1's firestorm unit (https://dougallj.github.io/applecpu/firestorm.html). Why firestorm? That's what I have easy access to, I don't know if any other uarch has similar fusion that might be helpful, and the core's width makes it a great fit for umash: with this and previous patches, umash hits over than 16.5 byte/cycle (52.8 GB/s) on the 3.2 GHz performance cores for aligned inputs in L1, and 14.9 b/c (47.7 GB/s) for misaligned inputs in L1. That's ~40% more throughput than xxh3 with Cyan4973/xxHash#734 applied. TESTED=on gcc103, clang and gcc, with and without inline asm. ``` pkhuong@penguin:~/umash$ (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - pkhuong@gcc103:~/umash$ clang-14 $CFLAGS umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - pkhuong@gcc103:~/umash$ clang-14 $CFLAGS -DUMASH_INLINE_ASM=0 umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - pkhuong@gcc103:~/umash$ gcc-12 $CFLAGS umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - pkhuong@gcc103:~/umash$ gcc-12 $CFLAGS -DUMASH_INLINE_ASM=0 umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; do ne) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - ``` Signed-off-by: Paul Khuong <pvk@pvk.ca>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Curiosity got the better of me, so I made the changes described in #733.
In a custom benchmark on M1, those are ~4% speedup (34.6GB/s -> 36.1GB/s). Bumping
XXH3_NEON_LANES
to 8 reaches 37.3GB/s for a ~7.7% speedup overall.