Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize XXH3_accumulate_512_neon #734

Merged
merged 1 commit into from
Sep 7, 2022
Merged

Conversation

dougallj
Copy link
Contributor

Curiosity got the better of me, so I made the changes described in #733.

In a custom benchmark on M1, those are ~4% speedup (34.6GB/s -> 36.1GB/s). Bumping XXH3_NEON_LANES to 8 reaches 37.3GB/s for a ~7.7% speedup overall.

@dougallj dougallj force-pushed the neon-opt branch 2 times, most recently from 7ec024a to 1132f36 Compare August 31, 2022 02:53
@Cyan4973
Copy link
Owner

Cyan4973 commented Aug 31, 2022

cc @easyaspi314

I can confirm the performance gains on M1.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Aug 31, 2022

Slight boost on my Pixel 4a (10.4->10.8)

I get 10.9 (no clue why) if I do

uint32x4x2_t zipped = vuzpq_u32(vreinterpretq_u32_u64(data_key1), vreinterpretq_u32_u64(data_key2));
uint32x4_t data_key_lo = zipped.val[0];
uint32x4_t data_key_hi = zipped.val[1];

Plus this also makes this block compatible with armv7-a.

@dougallj dougallj force-pushed the neon-opt branch 2 times, most recently from 1e10bf2 to 90cf8fb Compare August 31, 2022 16:37
@Cyan4973 Cyan4973 merged commit c420b59 into Cyan4973:dev Sep 7, 2022
pkhuong pushed a commit to backtrace-labs/umash that referenced this pull request Sep 11, 2022
neon's pmull extension only multiplies low halves with low halves, or
high halves with high halves.  This means we need a shuffle to
implement umash's PH mixer.  We can at least shuffle *two* 128-bit
values at once with a single VEXTQ, then compute the carryless product
of the low and high halves separately.

We also want prefer to implement this small operation with inline
assembly to make sure the PMULL and EOR instructions are paired
correctly for fusion on the M1's firestorm
unit (https://dougallj.github.io/applecpu/firestorm.html).

Why firestorm? That's what I have easy access to, I don't know if any
other uarch has similar fusion that might be helpful, and the core's
width makes it a great fit for umash: with this and previous patches,
umash hits over than 16.5 byte/cycle (52.8 GB/s) on the 3.2 GHz
performance cores for aligned inputs in L1, and 14.9 b/c (47.7 GB/s)
for misaligned inputs in L1.  That's ~40% more throughput than xxh3
with Cyan4973/xxHash#734 applied.

TESTED=on gcc103, clang and gcc, with and without inline asm.
```
pkhuong@penguin:~/umash$ (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum
cf89ae90a5797a1d56b1ecb53cd9b7c5  -
pkhuong@gcc103:~/umash$ clang-14 $CFLAGS umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum
cf89ae90a5797a1d56b1ecb53cd9b7c5  -
pkhuong@gcc103:~/umash$ clang-14 $CFLAGS -DUMASH_INLINE_ASM=0 umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)";
done) | md5sum
cf89ae90a5797a1d56b1ecb53cd9b7c5  -
pkhuong@gcc103:~/umash$ gcc-12 $CFLAGS umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum
cf89ae90a5797a1d56b1ecb53cd9b7c5  -
pkhuong@gcc103:~/umash$ gcc-12 $CFLAGS -DUMASH_INLINE_ASM=0 umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; do
ne) | md5sum
cf89ae90a5797a1d56b1ecb53cd9b7c5  -
```

Signed-off-by: Paul Khuong <pvk@pvk.ca>
pkhuong pushed a commit to backtrace-labs/umash that referenced this pull request Sep 12, 2022
neon's pmull extension only multiplies low halves with low halves, or
high halves with high halves.  This means we need a shuffle to
implement umash's PH mixer.  We can at least shuffle *two* 128-bit
values at once with a single VEXTQ, then compute the carryless product
of the low and high halves separately.

We also want prefer to implement this small operation with inline
assembly to make sure the PMULL and EOR instructions are paired
correctly for fusion on the M1's firestorm
unit (https://dougallj.github.io/applecpu/firestorm.html).

Why firestorm? That's what I have easy access to, I don't know if any
other uarch has similar fusion that might be helpful, and the core's
width makes it a great fit for umash: with this and previous patches,
umash hits over than 16.5 byte/cycle (52.8 GB/s) on the 3.2 GHz
performance cores for aligned inputs in L1, and 14.9 b/c (47.7 GB/s)
for misaligned inputs in L1.  That's ~40% more throughput than xxh3
with Cyan4973/xxHash#734 applied.

TESTED=on gcc103, clang and gcc, with and without inline asm.
```
pkhuong@penguin:~/umash$ (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum
cf89ae90a5797a1d56b1ecb53cd9b7c5  -
pkhuong@gcc103:~/umash$ clang-14 $CFLAGS umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum
cf89ae90a5797a1d56b1ecb53cd9b7c5  -
pkhuong@gcc103:~/umash$ clang-14 $CFLAGS -DUMASH_INLINE_ASM=0 umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)";
done) | md5sum
cf89ae90a5797a1d56b1ecb53cd9b7c5  -
pkhuong@gcc103:~/umash$ gcc-12 $CFLAGS umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum
cf89ae90a5797a1d56b1ecb53cd9b7c5  -
pkhuong@gcc103:~/umash$ gcc-12 $CFLAGS -DUMASH_INLINE_ASM=0 umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; do
ne) | md5sum
cf89ae90a5797a1d56b1ecb53cd9b7c5  -
```

Signed-off-by: Paul Khuong <pvk@pvk.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants