Optimize XXH3_accumulate_512_neon #734

dougallj · 2022-08-31T02:42:30Z

Curiosity got the better of me, so I made the changes described in #733.

In a custom benchmark on M1, those are ~4% speedup (34.6GB/s -> 36.1GB/s). Bumping XXH3_NEON_LANES to 8 reaches 37.3GB/s for a ~7.7% speedup overall.

Cyan4973 · 2022-08-31T05:05:54Z

cc @easyaspi314

I can confirm the performance gains on M1.

easyaspi314 · 2022-08-31T15:28:30Z

Slight boost on my Pixel 4a (10.4->10.8)

I get 10.9 (no clue why) if I do

uint32x4x2_t zipped = vuzpq_u32(vreinterpretq_u32_u64(data_key1), vreinterpretq_u32_u64(data_key2));
uint32x4_t data_key_lo = zipped.val[0];
uint32x4_t data_key_hi = zipped.val[1];

Plus this also makes this block compatible with armv7-a.

neon's pmull extension only multiplies low halves with low halves, or high halves with high halves. This means we need a shuffle to implement umash's PH mixer. We can at least shuffle *two* 128-bit values at once with a single VEXTQ, then compute the carryless product of the low and high halves separately. We also want prefer to implement this small operation with inline assembly to make sure the PMULL and EOR instructions are paired correctly for fusion on the M1's firestorm unit (https://dougallj.github.io/applecpu/firestorm.html). Why firestorm? That's what I have easy access to, I don't know if any other uarch has similar fusion that might be helpful, and the core's width makes it a great fit for umash: with this and previous patches, umash hits over than 16.5 byte/cycle (52.8 GB/s) on the 3.2 GHz performance cores for aligned inputs in L1, and 14.9 b/c (47.7 GB/s) for misaligned inputs in L1. That's ~40% more throughput than xxh3 with Cyan4973/xxHash#734 applied. TESTED=on gcc103, clang and gcc, with and without inline asm. ``` pkhuong@penguin:~/umash$ (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - pkhuong@gcc103:~/umash$ clang-14 $CFLAGS umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - pkhuong@gcc103:~/umash$ clang-14 $CFLAGS -DUMASH_INLINE_ASM=0 umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - pkhuong@gcc103:~/umash$ gcc-12 $CFLAGS umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; done) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - pkhuong@gcc103:~/umash$ gcc-12 $CFLAGS -DUMASH_INLINE_ASM=0 umash.c example.c -o example && (for i in `seq 0 70000`; do ./example "$(seq 0 70000 | head -c $i)"; do ne) | md5sum cf89ae90a5797a1d56b1ecb53cd9b7c5 - ``` Signed-off-by: Paul Khuong <pvk@pvk.ca>

dougallj force-pushed the neon-opt branch 2 times, most recently from 7ec024a to 1132f36 Compare August 31, 2022 02:53

dougallj force-pushed the neon-opt branch 2 times, most recently from 1e10bf2 to 90cf8fb Compare August 31, 2022 16:37

optimize XXH3_accumulate_512_neon

620facc

dougallj force-pushed the neon-opt branch from 90cf8fb to 620facc Compare August 31, 2022 16:48

Cyan4973 approved these changes Sep 2, 2022

View reviewed changes

Cyan4973 merged commit c420b59 into Cyan4973:dev Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize XXH3_accumulate_512_neon #734

Optimize XXH3_accumulate_512_neon #734

dougallj commented Aug 31, 2022

Cyan4973 commented Aug 31, 2022 •

edited

Loading

easyaspi314 commented Aug 31, 2022 •

edited

Loading

Optimize XXH3_accumulate_512_neon #734

Optimize XXH3_accumulate_512_neon #734

Conversation

dougallj commented Aug 31, 2022

Cyan4973 commented Aug 31, 2022 • edited Loading

easyaspi314 commented Aug 31, 2022 • edited Loading

Cyan4973 commented Aug 31, 2022 •

edited

Loading

easyaspi314 commented Aug 31, 2022 •

edited

Loading