-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flatten ZSTD_row_getMatchMask #2681
Conversation
* Remove the SIMD abstraction layer. * Add big endian support. * Align `hashTags` within `tagRow` to a 16-byte boundary. * Switch SSE2 to use aligned reads. * Optimize scalar path using SWAR. * Optimize neon path for `n == 32` * Work around minor clang issue for NEON (https://bugs.llvm.org/show_bug.cgi?id=49577)
AFAICT, the test failures are all: which leaves me mystified on how any SIMD ever passed these tests... Do we need (or want) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these contributions!! These NEON and scalar optimizations look great - I've left some comments that mostly have to do with style/code structure. I'll get around to benchmarking the improvements scalar mode at some point as well.
Amusingly, NEON gets cheaper when
|
That's actually good to know - we're going to also add support for 64 row entries as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! @terrelln do you have any thoughts?
Thanks @aqrit for this great contribution 🚀 |
Excellent work @aqrit ! |
const U16 hi = (U16)vgetq_lane_u8(t3, 8); | ||
const U16 lo = (U16)vgetq_lane_u8(t3, 0); | ||
return ZSTD_rotateRight_U16((hi << 8) | lo, head); | ||
} else { /* rowEntries == 32 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aqrit :
This NEON
code must be designed specifically for each rowEntries
.
Hence, we have an implementation for 16, for 32 and then for 64.
If we would like to introduce 128 or even 256, it's necessary to write another implementation.
Ignoring the issue of generating the final mask (which width directly depends on the nb of entries),
does that seem possible to rewrite the main test loop in a way which would be generic,
and scale naturally with rowEntries
(provided it's a power of 2) ?
For an example of what I mean by "an implementation which can scale with rowEntries
",
please have a look at : https://github.com/facebook/zstd/blob/dev/lib/compress/zstd_lazy.c#L992 ,
which uses SSE2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
128/256 would just loop the 64 byte path.
I've done no benchmarking but...
it seems really painful to emulate pmovmskb
using ARMv7-A NEON.
AArch64 added the SIMD instruction addv
which might be useful for simplifying all this.
On AArch64, is the ARMv7-A path actually faster than the scalar fallback?
hashTags
withintagRow
to a 16-byte boundary.n == 32
Benchmark:
tl;dr: compression speed of scalar path increases by about 22%.
Scalar path is now only about 4% slower than the SSE2 path.
benched on my desktop
i7-8700K
usinggcc version 9.3.0