Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Neon SIMD #39

Merged
merged 2 commits into from
Aug 24, 2022
Merged

Implement Neon SIMD #39

merged 2 commits into from
Aug 24, 2022

Conversation

CryZe
Copy link
Contributor

@CryZe CryZe commented Oct 21, 2021

This adds support for the NEON SIMD instructions. They are still nightly only and some instructions are missing, but they are close to finishing implementing all of them, so stabilization shouldn't be too far off. There's a few semantic differences between the different targets which leaves some open questions in the meantime.

src/wide/f32x2_t.rs Outdated Show resolved Hide resolved
src/wide/f32x4_t.rs Outdated Show resolved Hide resolved
src/wide/f32x8_t.rs Outdated Show resolved Hide resolved
src/wide/f32x8_t.rs Outdated Show resolved Hide resolved
src/wide/f32x16_t.rs Outdated Show resolved Hide resolved
src/wide/f32x2_t.rs Outdated Show resolved Hide resolved
@@ -71,11 +97,75 @@ impl u32x8 {
Self(cmp_eq_mask_i32_m128i(self.0, rhs.0), cmp_eq_mask_i32_m128i(self.1, rhs.1))
} else if #[cfg(all(feature = "simd", target_feature = "simd128"))] {
Self(u32x4_eq(self.0, rhs.0), u32x4_eq(self.1, rhs.1))
} else if #[cfg(all(feature = "simd", target_arch = "aarch64", target_feature = "neon"))] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aarch64 can be without neon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it just uses the fallback code then.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the actual hardware. Seems like a redundant check.

Copy link
Contributor Author

@CryZe CryZe Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I'm not entirely sure tbh. You can certainly tell Rust to compile for aarch64 without that target feature though, which would start breaking things. So you might say "alright let's just remove the target_arch check then instead", but there's Neon on ARMv7 too and it doesn't have some of these instructions (I may look into adding support for that before merging too).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, Rust crashes when a user tries to disable sse/sse2, so... If you think it's needed, then fine. As long as it runs on M1 I'm fine.

@RazrFalcon
Copy link
Collaborator

No way to test ARM code on CI. This is the real bummer.

@jrmuizel
Copy link

@CryZe CryZe changed the title Implements Neon SIMD Implement Neon SIMD Oct 21, 2021
@RazrFalcon
Copy link
Collaborator

@jrmuizel They are using docker. Seems like overly complicated solution. Looks like Circle CI provides ARM VMs.

@CryZe
Copy link
Contributor Author

CryZe commented Oct 21, 2021

The cross tool is trivial to set up in CI and all you need to do is cross test instead of cargo test. It supports aarch64 and many other architectures (internally also uses docker and qemu).

@RazrFalcon
Copy link
Collaborator

Interesting. Would you mind including this into this PR?

Also, I would love to reduce unsafe usage, but looks like safe_arch doesn't support arm. Right now, tiny-skia doesn't rely on any unsafe code blocks and I want to keep it that way.

src/pipeline/highp.rs Outdated Show resolved Hide resolved
@CryZe
Copy link
Contributor Author

CryZe commented Dec 23, 2021

Alright, so NEON intrinsics are about to stabilize in Rust 1.59, so I've started working on this again.

Also, I would love to reduce unsafe usage, but looks like safe_arch doesn't support arm. Right now, tiny-skia doesn't rely on any unsafe code blocks and I want to keep it that way.

I've written a PR against bytemuck that allows removing the unsafe transmutes for the NEON vectors. It already got merged. I also ported all the NEON intrinsics into safe functions for merging into safe_arch. I'll open that PR soon.

A thing I've noticed though is that this will almost definitely require 1.59 as the minimum Rust version on at least AArch64, unless we introduce some kind of extra feature that gates AArch64 from using SIMD on earlier Rust versions.

Also since the const generics are unavoidable, I think we need at least 1.51 as the new minimum Rust version (the new bytemuck and safe_arch will require this too, at least according to tiny-skia's Cargo.toml).

@RazrFalcon
Copy link
Collaborator

I see. I guess MSRV bump is unavoidable. But I would to keep ARM feature-gated or something. Or maybe we could specify two MSRV for each target? Could we be able to build x86 on older version still?

Also, thanks a lot for you work. I really don't have time lately and those changes are far from trivial.

@RazrFalcon
Copy link
Collaborator

Could you also update the Readme? Currently, it mentions that ARM is not supported in the Performance section. Thanks.

@RazrFalcon
Copy link
Collaborator

Hi! Any updates on this?

@CryZe
Copy link
Contributor Author

CryZe commented Jun 4, 2022

Not yet, I have a WIP branch for safe arch, but I haven't opened a PR for it yet. I'll probably revisit this soon.

@Brooooooklyn
Copy link

Brooooooklyn commented Aug 11, 2022

No way to test ARM code on CI. This is the real bummer.

We can test it via multiarch/ubuntu-core:aarch64-focal docker image, which is qemu under the hood, and it supports all neon instructions: https://github.com/Brooooooklyn/canvas/runs/7711666646?check_suite_focus=true

@RazrFalcon RazrFalcon mentioned this pull request Aug 23, 2022
@RazrFalcon
Copy link
Collaborator

@CryZe I've dropped safe_arch support in favor of explicit SIMD. This should make this patch way easier.

This adds support for the NEON SIMD instructions.
@CryZe CryZe force-pushed the neon-simd branch 2 times, most recently from d173dce to 89cf8e7 Compare August 24, 2022 19:07
@CryZe CryZe marked this pull request as ready for review August 24, 2022 19:07
@CryZe
Copy link
Contributor Author

CryZe commented Aug 24, 2022

Well that was easy :D (didn't expect the CI to work first try, especially considering I didn't even run the tests locally).

It's quite easy to run tests for other architectures with `cross`.
@@ -104,3 +104,11 @@ impl core::ops::Div<f32x2> for f32x2 {
])
}
}

fn pmax(a: f32, b: f32) -> f32 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CryZe Is this a no_std variant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, basically as I've noticed back then when I opened this PR, there's a bunch of platform differences when it comes to NaN, but Skia doesn't seem to care about them. Rust's default max is "unnecessarily slow" in that it cares for NaN (and I believe other edge cases), so I ported the pmin/pmax logic from wasm (that also is meant to ignore NaN and just do the fastest thing possible) so it can be used in the fallback code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? It seems it's just an intrinsic: ihttps://doc.rust-lang.org/stable/src/core/num/f32.rs.html#749

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compiled this list back then that really shows how much of a mess min and max are: https://gist.github.com/CryZe/30cc76f4629cb0846d5a9b8d13144649

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it just calls into fmaxf, which is C's overcomplicated function: https://rust.godbolt.org/z/KKz7875sr

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will look into it in more details. We can leave it as is for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see. We technically shouldn't have NaN values to begin with, so a custom version should be fine.

@@ -95,6 +113,10 @@ impl core::ops::Add for f32x4 {
Self(unsafe { _mm_add_ps(self.0, rhs.0) })
} else if #[cfg(all(feature = "simd", target_feature = "simd128"))] {
Self(f32x4_add(self.0, rhs.0))
} else if #[cfg(all(feature = "simd", target_arch = "aarch64", target_feature = "neon"))] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need target_arch = "aarch64"? Shouldn't target_feature = "neon" be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we had this discussion before. I believe I want to keep this for the eventual followup PR that includes 32-bit ARM into the equation (which doesn't have neon by default).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes. Verbose, but fine.

} else if #[cfg(all(feature = "simd", target_arch = "aarch64", target_feature = "neon"))] {
unsafe {
Self(
core::mem::transmute(vceqq_f32(self.0, rhs.0)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need transmute here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because your methods here are "weird" in that they do a comparison but return a floating point type, even though they really are masks. On the other architectures we got "lucky" in that they don't differentiate the types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is definitely Skia being "weird". And seems like they use vreinterpretq_f32_u32 instead. Should we as well? Not sure how different it from transmute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I believe I added Neon types to bytemuck, so we could just use bytemuck here.

@RazrFalcon
Copy link
Collaborator

Thanks for the patch! I took a while to merge (10 months...), but we're finally here. Hopefully I will publish a new release this weekends.

@RazrFalcon RazrFalcon merged commit 3457f25 into linebender:master Aug 24, 2022
@RazrFalcon
Copy link
Collaborator

Almost 3x faster on Apple M1. Amazing!

@Shnatsel
Copy link

Excellent! I think the README needs an update, as it still reads:

Skia also supports ARM NEON instructions, which are unavailable in a stable Rust at the moment. Therefore a fallback scalar implementation will be used instead on ARM and other non-x86 targets. So if you're targeting ARM, you better stick with Skia.

@RazrFalcon
Copy link
Collaborator

Yes, thanks!

@CryZe CryZe deleted the neon-simd branch August 25, 2022 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants