-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Neon SIMD #39
Conversation
@@ -71,11 +97,75 @@ impl u32x8 { | |||
Self(cmp_eq_mask_i32_m128i(self.0, rhs.0), cmp_eq_mask_i32_m128i(self.1, rhs.1)) | |||
} else if #[cfg(all(feature = "simd", target_feature = "simd128"))] { | |||
Self(u32x4_eq(self.0, rhs.0), u32x4_eq(self.1, rhs.1)) | |||
} else if #[cfg(all(feature = "simd", target_arch = "aarch64", target_feature = "neon"))] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aarch64 can be without neon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it just uses the fallback code then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the actual hardware. Seems like a redundant check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I'm not entirely sure tbh. You can certainly tell Rust to compile for aarch64 without that target feature though, which would start breaking things. So you might say "alright let's just remove the target_arch check then instead", but there's Neon on ARMv7 too and it doesn't have some of these instructions (I may look into adding support for that before merging too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, Rust crashes when a user tries to disable sse/sse2, so... If you think it's needed, then fine. As long as it runs on M1 I'm fine.
No way to test ARM code on CI. This is the real bummer. |
stdarch tests ARM code on CI: |
@jrmuizel They are using docker. Seems like overly complicated solution. Looks like Circle CI provides ARM VMs. |
The cross tool is trivial to set up in CI and all you need to do is |
Interesting. Would you mind including this into this PR? Also, I would love to reduce unsafe usage, but looks like safe_arch doesn't support arm. Right now, tiny-skia doesn't rely on any unsafe code blocks and I want to keep it that way. |
Alright, so NEON intrinsics are about to stabilize in Rust 1.59, so I've started working on this again.
I've written a PR against bytemuck that allows removing the unsafe transmutes for the NEON vectors. It already got merged. I also ported all the NEON intrinsics into safe functions for merging into safe_arch. I'll open that PR soon. A thing I've noticed though is that this will almost definitely require 1.59 as the minimum Rust version on at least AArch64, unless we introduce some kind of extra feature that gates AArch64 from using SIMD on earlier Rust versions. Also since the const generics are unavoidable, I think we need at least 1.51 as the new minimum Rust version (the new bytemuck and safe_arch will require this too, at least according to tiny-skia's Cargo.toml). |
I see. I guess MSRV bump is unavoidable. But I would to keep ARM feature-gated or something. Or maybe we could specify two MSRV for each target? Could we be able to build x86 on older version still? Also, thanks a lot for you work. I really don't have time lately and those changes are far from trivial. |
Could you also update the Readme? Currently, it mentions that ARM is not supported in the Performance section. Thanks. |
Hi! Any updates on this? |
Not yet, I have a WIP branch for safe arch, but I haven't opened a PR for it yet. I'll probably revisit this soon. |
We can test it via |
@CryZe I've dropped |
This adds support for the NEON SIMD instructions.
d173dce
to
89cf8e7
Compare
Well that was easy :D (didn't expect the CI to work first try, especially considering I didn't even run the tests locally). |
It's quite easy to run tests for other architectures with `cross`.
@@ -104,3 +104,11 @@ impl core::ops::Div<f32x2> for f32x2 { | |||
]) | |||
} | |||
} | |||
|
|||
fn pmax(a: f32, b: f32) -> f32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CryZe Is this a no_std variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, basically as I've noticed back then when I opened this PR, there's a bunch of platform differences when it comes to NaN, but Skia doesn't seem to care about them. Rust's default max is "unnecessarily slow" in that it cares for NaN (and I believe other edge cases), so I ported the pmin/pmax logic from wasm (that also is meant to ignore NaN and just do the fastest thing possible) so it can be used in the fallback code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? It seems it's just an intrinsic: ihttps://doc.rust-lang.org/stable/src/core/num/f32.rs.html#749
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compiled this list back then that really shows how much of a mess min and max are: https://gist.github.com/CryZe/30cc76f4629cb0846d5a9b8d13144649
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it just calls into fmaxf, which is C's overcomplicated function: https://rust.godbolt.org/z/KKz7875sr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will look into it in more details. We can leave it as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. We technically shouldn't have NaN values to begin with, so a custom version should be fine.
@@ -95,6 +113,10 @@ impl core::ops::Add for f32x4 { | |||
Self(unsafe { _mm_add_ps(self.0, rhs.0) }) | |||
} else if #[cfg(all(feature = "simd", target_feature = "simd128"))] { | |||
Self(f32x4_add(self.0, rhs.0)) | |||
} else if #[cfg(all(feature = "simd", target_arch = "aarch64", target_feature = "neon"))] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need target_arch = "aarch64"
? Shouldn't target_feature = "neon"
be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we had this discussion before. I believe I want to keep this for the eventual followup PR that includes 32-bit ARM into the equation (which doesn't have neon by default).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes. Verbose, but fine.
} else if #[cfg(all(feature = "simd", target_arch = "aarch64", target_feature = "neon"))] { | ||
unsafe { | ||
Self( | ||
core::mem::transmute(vceqq_f32(self.0, rhs.0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need transmute here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because your methods here are "weird" in that they do a comparison but return a floating point type, even though they really are masks. On the other architectures we got "lucky" in that they don't differentiate the types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is definitely Skia being "weird". And seems like they use vreinterpretq_f32_u32 instead. Should we as well? Not sure how different it from transmute
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I believe I added Neon types to bytemuck, so we could just use bytemuck here.
Thanks for the patch! I took a while to merge (10 months...), but we're finally here. Hopefully I will publish a new release this weekends. |
Almost 3x faster on Apple M1. Amazing! |
Excellent! I think the README needs an update, as it still reads:
|
Yes, thanks! |
This adds support for the NEON SIMD instructions. They are still nightly only and some instructions are missing, but they are close to finishing implementing all of them, so stabilization shouldn't be too far off. There's a few semantic differences between the different targets which leaves some open questions in the meantime.