Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial aarch64 neon support #114

Closed
wants to merge 3 commits into from

Conversation

redzic
Copy link

@redzic redzic commented Jul 8, 2022

This PR adds a NEON implementation of all mem{r}chr functions. This PR does not add NEON support for memmem, as deeper changes in the code are needed for the Vector trait for it to work efficiently with NEON. memmem support could possibly be addressed in a future PR.

Since AArch64 was introduced as part of ARMv8, and since ARMv8 requires NEON support, no runtime detection mechanism is needed for AArch64.

@redzic
Copy link
Author

redzic commented Jul 8, 2022

@BurntSushi do you know how to run the entire benchmark suite for the functions in this crate only (i.e., not libc or anything else) so that they are formatted like they are in this comment: #103 (comment)?

Oops, in the comment it says what command was used, sorry for the ping!

@redzic
Copy link
Author

redzic commented Jul 8, 2022

Here are the initial benchmark numbers (tested on an 8-core M1 Pro):

Screen Shot 2022-07-08 at 6 41 22 PM

So it seems like we only lose for small/tiny haystacks with more common occurrences. I think reducing unrolling would probably help in that case (since there is a lot of overhead in searching much more than we need to when the needle is likely to be found within only the first 16 bytes, especially for memchr2/3), but it seems difficult to analyze when to reduce unrolling based on the haystack size and frequency of needle occurrences in the haystack.

@redzic
Copy link
Author

redzic commented Jul 9, 2022

I think this is what is remaining for this PR:

  1. Ensure this works on big-endian (since there are apparently some big endian ARM processors). I'll try to see if I can run the test suite on QEMU or something like that to make sure it works.
  2. Try aligning memory accesses. I only saw about a 2-3% improvement (if that) on the memchr1/huge/never benchmark, but maybe it's still worth trying again and testing the other benchmarks as well.
  3. Probably should add some comments explaining how certain parts of the code work, since there are notable differences from the sse2/avx2 implementation (mostly that the x86 code uses movemask while the ARM code uses masking and pairwise addition to ultimately figure out the matching index, and pairwise max to figure out whether the equality vector contains all zeros).
  4. Add some kind of check in build.rs to disable the NEON code if the rustc version is too low. I'll have to do some more research to figure out how to do this.

Hoping to get any sort of initial feedback though to see if the PR is going in the right direction so far.

@CeleritasCelery
Copy link

@redzic What do you most need help with to complete this PR? Sounds like 2 and 4 from your list above are the most obvious candidates.

@redzic
Copy link
Author

redzic commented Jun 10, 2023

@CeleritasCelery Any help would be good, but I guess the "bureaucratic" stuff like the build.rs version check is what's preventing this from being merged now. Or maybe @BurntSushi has been busy and hasn't had time to review the code and make sure everything is correct and up to the same quality as the rest of the codebase (and also I think if I remember correctly he didn't have an arm machine to test on which makes things more difficult).

@CeleritasCelery
Copy link

I don't believe we need a build.rs version check. If the rustc version is too low, target_has_feature("neon") will return false and it won't compile the neon code. I tested this with 1.58.0 (the version before NEON intrinsics were stabilized) and it worked as expected (neon code was conditionally removed).

@BurntSushi
Copy link
Owner

I plan to bump the MSRV to Rust 1.60 soon, so we definitely shouldn't need to worry about version sniffing here.

Also, I have an M2 mac mini in the mail so that I can actually start playing around and testing this stuff too. (And get it working for memmem and also Teddy in aho-corasick.)

@CeleritasCelery
Copy link

CeleritasCelery commented Jul 10, 2023

I took some time over the weekend to look into this implementation. I thought the use of pair-wise addition as a replacement for movemask was pretty clever. Also using highly generic code resulted in saving about 200 lines overall. I do have a concern (perhaps unfounded) that it will be harder to reliable optimize code that is so generic. Particularly when comparing [T; N] to just having multiple variables. But I don't have a lot to base that on and would need to test it out.

I did a couple of refactoring things like replacing manual loops with map.

before:

let mut nv: [uint8x16_t; N] = [vdupq_n_u8(0); N];
for i in 0..N {
    nv[i] = vdupq_n_u8(n[i]);
}

after:

let nv = n.map(|x| vdupq_n_u8(x));

I found an article from arm about how to emulate SSE movemask in NEON. It even specifically mentions memchr. Their recommendation was to use shrn (shift right and narrow). I implemented that and saw a notable improvement in the mem{r}chr{2,3} benchmarks. However the mem{r}chr1 benchmarks were more mixed. I saw improvements every benchmark but the huge/common ones. Those saw a significant 20% slowdown. I tracked this down to moving the "movemask" out of the if-chain or not. If I precompute the masks, then huge/common is about as fast as the pair-wise add version (the current PR). But all the other benchmarks are 5-20% worse compare to computing the masks in the if-chains. Not sure why that would be the case. My best guess is that huge/common tends to have the needle at the end of a 64 byte chunk, so computing upfront will give better ILP in that case.

repo

my repo is located at https://github.com/CeleritasCelery/memchr/tree/aarch64-neon

@redzic let me know if you would like me to merge any of these changes into your PR.

benchmarks

current PR baseline compared to shrn version

Screenshot 2023-07-10 at 2 41 49 PM

shrn compared with shrn and pre-computed masks

Screenshot 2023-07-10 at 2 40 34 PM

@BurntSushi
Copy link
Owner

@CeleritasCelery Thank you! Especially for that link. I just got my M2 mac mini setup yesterday and I'm now looking into this. I think I grok all the critical parts of memchr on aarch64 except for one: the use of umaxp. The article says this:

We would like to note that the main loop in memchr still looks for the maximum value out of four 32-bit integers through umaxp instruction: when comparing, it checks that the max byte is not zero. If it is not, it uses shrn to get the mask. Experiments showed this is faster for strings (>128 characters) as on cores like Neoverse N1, it uses 2 pipelines V0/V1 whereas shrn uses only one pipeline V1, but both have the same latency. Such an approach showed better results overall and is suitable for more workloads. So, if you are checking for existence in a loop, consider using umaxp instruction followed by shrn: it might have a threshold where it is faster than only using shrn.

With the relevant snippet of code being:

  92 L(loop32):
  93         ldr     qdata, [src, 16]!
  94         cmeq    vhas_chr.16b, vdata.16b, vrepchr.16b
  95         umaxp   vend.16b, vhas_chr.16b, vhas_chr.16b            /* 128->64 */
  96         fmov    synd, dend
  97         cbnz    synd, L(end)

And the specific thing they're talking about in the article, umaxp:

umaxp   vend.16b, vhas_chr.16b, vhas_chr.16b

But I don't understand the utility of this function. It looks like vend.16b is just going to be equivalent to vhas_chr.16b, since it's just taking the pairwise max of each element, but is using the same vector for both source registers.

The code then seems to go on to only check the lower 64 bits of vend.16b. Which... can't be right. So I'm missing something about how umaxp is being used. I'm pretty sure I'm just misreading it. Can anyone shine a light here?

I'll note that there is a horizontal max vector op, and that seems like what you'd want here. But that's umaxv, not umaxp. So why not use umaxv here?

@BurntSushi
Copy link
Owner

cc @danlark1 in case you were inclined to share your wisdom here. It looks like you are the author of https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon :-)

@CeleritasCelery
Copy link

CeleritasCelery commented Jul 11, 2023

But I don't understand the utility of this function. It looks like vend.16b is just going to be equivalent to vhas_chr.16b, since it's just taking the pairwise max of each element, but is using the same vector for both source registers.
The code then seems to go on to only check the lower 64 bits of vend.16b. Which... can't be right. So I'm missing something about how umaxp is being used. I'm pretty sure I'm just misreading it. Can anyone shine a light here?

This is describing how we check if the vector is zero.

unsafe fn eq0(x: uint8x16_t) -> bool {
    let low = vreinterpretq_u64_u8(vpmaxq_u8(x, x));
    vgetq_lane_u64(low, 0) == 0
}

vpmaxq_u8 is the vector version of umaxp. it takes 2 16-byte vectors and pair-wise reduces it down to a single 16-byte vector. calling it on the same vector leads to the result of the pair-wise max being duplicated in the upper 8 bytes and the lower 8 bytes. then we can extract the lower 8 bytes as a u64 and compare to zero. Or if you apply a mask first, you could use leading/trailing bit count to get the offset just like with movemask.

Here is a diagram to (hopefully) make it clearer

a = [1, 2, 3, 4, ..., 15, 16; 16]
b = [1, 2, 3, 4, ..., 15, 16; 16]
# after calling vpmaxq on vector, each lane contains the max of a pair of lanes
vpmaxq = [1&2, 2&3, 4&5, ..., 15&16, 1&2, 3&4, ..., 15&16; 16]

@danlark1
Copy link

danlark1 commented Jul 11, 2023

Hi, thanks for including me

umaxp (unsigned maximum pairwise) is an instruction which takes 2 consecutive bytes of the concatenation of second and third operands and produces the biggest byte of every pair. It means if the second and third operands are the same, then first 16 bytes will be reduced to 8 bytes where every result_byte[i] = max(source_byte[2 * i], source_byte[2 * i + 1]);

Then

fmov    synd, dend
cbnz    synd, L(end)

takes 64 bit integer and compares it with zero. It is correct to check whether you have any match inside a vector. As said, umaxp is better when the range is big enough as it utilizes 2 pipelines where shrn just 1 on modern server hardware.

You are also correct about umaxv, it's a vertical minimum, however, on server platforms for Neoverse N2 it has latency 2 or 4 and uses V1 (but no latency 2 and both V0/V1) whereas umaxp's latency is only 2 where both pipelines are used.

Say, if you test it on Graviton 3, umaxp will be better.

image

For Apple I think it should be the same

SHRN is good for emulating movemask and thus pretty good for small/mid ranges for mem* functions. umaxp is good for long ranges and increasing the throughput. I believe it's consistent with the benchmarks.

Then there comes the question of what you are optimizing for. Most memchr are called for strings up to 64-128 bytes where shrn dominates. In aarch64 optimized routines we decided to have shrn for strings <=16 bytes, umaxp for >16 bytes because we haven't seen much regression for small-mid ranges. I wanted to go only with shrn as memchr for big strings is rare in production but did not push hard. Any version will be a great default in my opinion.

@BurntSushi
Copy link
Owner

Nice nice. So it turns out I was looking at vmaxq_u8 instead of vpmaxq_u8. Derp. Thank you all for the explanations, they were very helpful. Also thank you @danlark1 for that extra info about umaxv. I suspected as much that it would be sub-optimal in some respect.

@BurntSushi
Copy link
Owner

OK, so now that I've sat down and familiarized myself enough with neon such that I could write a memchr routine (and the memmem routines too), I've finally looked at this PR with a closer eye. I think that while this looks great, it is generally not my style and not something I'm keen on maintaining. The code is way too hard for me to follow. I generally prefer the duplicative but simpler code found in x86-64. I'm going to experiment with writing memchr routines for aarch64 that more closely mimic the x86-64 routines, and potentially even make those generic based on a lower level generic Vector trait like I did for memmem. I'm not sure if I'll wind up there though because of the movemask shenaningans. But if I don't my answer then will be to just duplicate code.

@jacobbramley
Copy link

Hi! We (Arm) are happy to look at this kind of thing. I'm not sure if rustbot works here, but if it does, then you can ping us that way. Alternatively, we lurk on Zulip in t-compiler/arm (amongst other streams). If there's anywhere else we should watch for requests like this, please let us know!

@BurntSushi
Copy link
Owner

@jacobbramley Oh nice! There is no rustbot here, but once I have something for aarch64 I'll ping y'all for review. Thank you for chiming in.

@llogiq
Copy link

llogiq commented Aug 24, 2023

I think I have an alternative to the movemask dance: If we can afford a pre-loaded POS_MASK = uint8x16_t(0, 1, 2, 3, 4, ..., 15), we can use vminvq_u8(vorrq_u8(vmvnq_u8(eq), POS_MASK)) to get either the first matching position (in 0..16) or 255 if no match was found. Not sure if that's actually faster, but it seems to use less steps. I don't have an ARM to test with though.

@BurntSushi
Copy link
Owner

@llogiq I almost have this done. I'm basically in the final polishing/testing/measuring phases. It should be very easy to experiment with alternative approaches at that point. :-)

@BurntSushi
Copy link
Owner

Closed by #129 (see PR comment for benchmark results).

Thanks everyone for pushing this forward. I don't think I would have ended up with something so good without it. :)

@BurntSushi BurntSushi closed this Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants