-
-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial aarch64 neon support #114
Conversation
Oops, in the comment it says what command was used, sorry for the ping! |
Here are the initial benchmark numbers (tested on an 8-core M1 Pro): So it seems like we only lose for small/tiny haystacks with more common occurrences. I think reducing unrolling would probably help in that case (since there is a lot of overhead in searching much more than we need to when the needle is likely to be found within only the first 16 bytes, especially for memchr2/3), but it seems difficult to analyze when to reduce unrolling based on the haystack size and frequency of needle occurrences in the haystack. |
I think this is what is remaining for this PR:
Hoping to get any sort of initial feedback though to see if the PR is going in the right direction so far. |
@redzic What do you most need help with to complete this PR? Sounds like 2 and 4 from your list above are the most obvious candidates. |
@CeleritasCelery Any help would be good, but I guess the "bureaucratic" stuff like the build.rs version check is what's preventing this from being merged now. Or maybe @BurntSushi has been busy and hasn't had time to review the code and make sure everything is correct and up to the same quality as the rest of the codebase (and also I think if I remember correctly he didn't have an arm machine to test on which makes things more difficult). |
I don't believe we need a build.rs version check. If the rustc version is too low, |
I plan to bump the MSRV to Rust 1.60 soon, so we definitely shouldn't need to worry about version sniffing here. Also, I have an M2 mac mini in the mail so that I can actually start playing around and testing this stuff too. (And get it working for |
I took some time over the weekend to look into this implementation. I thought the use of pair-wise addition as a replacement for movemask was pretty clever. Also using highly generic code resulted in saving about 200 lines overall. I do have a concern (perhaps unfounded) that it will be harder to reliable optimize code that is so generic. Particularly when comparing I did a couple of refactoring things like replacing manual loops with map. before: let mut nv: [uint8x16_t; N] = [vdupq_n_u8(0); N];
for i in 0..N {
nv[i] = vdupq_n_u8(n[i]);
} after: let nv = n.map(|x| vdupq_n_u8(x)); I found an article from arm about how to emulate SSE movemask in NEON. It even specifically mentions memchr. Their recommendation was to use shrn (shift right and narrow). I implemented that and saw a notable improvement in the mem{r}chr{2,3} benchmarks. However the mem{r}chr1 benchmarks were more mixed. I saw improvements every benchmark but the repomy repo is located at https://github.com/CeleritasCelery/memchr/tree/aarch64-neon @redzic let me know if you would like me to merge any of these changes into your PR. benchmarkscurrent PR baseline compared to shrn versionshrn compared with shrn and pre-computed masks |
@CeleritasCelery Thank you! Especially for that link. I just got my M2 mac mini setup yesterday and I'm now looking into this. I think I grok all the critical parts of
With the relevant snippet of code being:
And the specific thing they're talking about in the article,
But I don't understand the utility of this function. It looks like The code then seems to go on to only check the lower 64 bits of I'll note that there is a horizontal max vector op, and that seems like what you'd want here. But that's |
cc @danlark1 in case you were inclined to share your wisdom here. It looks like you are the author of https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon :-) |
This is describing how we check if the vector is zero. unsafe fn eq0(x: uint8x16_t) -> bool {
let low = vreinterpretq_u64_u8(vpmaxq_u8(x, x));
vgetq_lane_u64(low, 0) == 0
} vpmaxq_u8 is the vector version of umaxp. it takes 2 16-byte vectors and pair-wise reduces it down to a single 16-byte vector. calling it on the same vector leads to the result of the pair-wise max being duplicated in the upper 8 bytes and the lower 8 bytes. then we can extract the lower 8 bytes as a Here is a diagram to (hopefully) make it clearer a = [1, 2, 3, 4, ..., 15, 16; 16]
b = [1, 2, 3, 4, ..., 15, 16; 16]
# after calling vpmaxq on vector, each lane contains the max of a pair of lanes
vpmaxq = [1&2, 2&3, 4&5, ..., 15&16, 1&2, 3&4, ..., 15&16; 16] |
Hi, thanks for including me umaxp (unsigned maximum pairwise) is an instruction which takes 2 consecutive bytes of the concatenation of second and third operands and produces the biggest byte of every pair. It means if the second and third operands are the same, then first 16 bytes will be reduced to 8 bytes where every Then fmov synd, dend
cbnz synd, L(end) takes 64 bit integer and compares it with zero. It is correct to check whether you have any match inside a vector. As said, umaxp is better when the range is big enough as it utilizes 2 pipelines where shrn just 1 on modern server hardware. You are also correct about umaxv, it's a vertical minimum, however, on server platforms for Neoverse N2 it has latency 2 or 4 and uses V1 (but no latency 2 and both V0/V1) whereas umaxp's latency is only 2 where both pipelines are used. Say, if you test it on Graviton 3, umaxp will be better. For Apple I think it should be the same SHRN is good for emulating movemask and thus pretty good for small/mid ranges for mem* functions. umaxp is good for long ranges and increasing the throughput. I believe it's consistent with the benchmarks. Then there comes the question of what you are optimizing for. Most memchr are called for strings up to 64-128 bytes where shrn dominates. In aarch64 optimized routines we decided to have shrn for strings <=16 bytes, umaxp for >16 bytes because we haven't seen much regression for small-mid ranges. I wanted to go only with shrn as memchr for big strings is rare in production but did not push hard. Any version will be a great default in my opinion. |
Nice nice. So it turns out I was looking at |
OK, so now that I've sat down and familiarized myself enough with neon such that I could write a |
Hi! We (Arm) are happy to look at this kind of thing. I'm not sure if rustbot works here, but if it does, then you can ping us that way. Alternatively, we lurk on Zulip in |
@jacobbramley Oh nice! There is no rustbot here, but once I have something for aarch64 I'll ping y'all for review. Thank you for chiming in. |
I think I have an alternative to the movemask dance: If we can afford a pre-loaded |
@llogiq I almost have this done. I'm basically in the final polishing/testing/measuring phases. It should be very easy to experiment with alternative approaches at that point. :-) |
Closed by #129 (see PR comment for benchmark results). Thanks everyone for pushing this forward. I don't think I would have ended up with something so good without it. :) |
This PR adds a NEON implementation of all
mem{r}chr
functions. This PR does not add NEON support formemmem
, as deeper changes in the code are needed for theVector
trait for it to work efficiently with NEON.memmem
support could possibly be addressed in a future PR.Since AArch64 was introduced as part of ARMv8, and since ARMv8 requires NEON support, no runtime detection mechanism is needed for AArch64.