Implement Neon SIMD #39

CryZe · 2021-10-21T19:17:48Z

This adds support for the NEON SIMD instructions. They are still nightly only and some instructions are missing, but they are close to finishing implementing all of them, so stabilization shouldn't be too far off. There's a few semantic differences between the different targets which leaves some open questions in the meantime.

src/wide/f32x2_t.rs

src/wide/f32x4_t.rs

src/wide/f32x8_t.rs

src/wide/f32x16_t.rs

src/wide/f32x2_t.rs

src/wide/u32x8_t.rs

RazrFalcon · 2021-10-21T19:35:14Z

src/wide/u32x8_t.rs

@@ -71,11 +97,75 @@ impl u32x8 {
                Self(cmp_eq_mask_i32_m128i(self.0, rhs.0), cmp_eq_mask_i32_m128i(self.1, rhs.1))
            } else if #[cfg(all(feature = "simd", target_feature = "simd128"))] {
                Self(u32x4_eq(self.0, rhs.0), u32x4_eq(self.1, rhs.1))
+            } else if #[cfg(all(feature = "simd", target_arch = "aarch64", target_feature = "neon"))] {


aarch64 can be without neon?

Yeah it just uses the fallback code then.

I mean the actual hardware. Seems like a redundant check.

Ohh, I'm not entirely sure tbh. You can certainly tell Rust to compile for aarch64 without that target feature though, which would start breaking things. So you might say "alright let's just remove the target_arch check then instead", but there's Neon on ARMv7 too and it doesn't have some of these instructions (I may look into adding support for that before merging too).

Well, Rust crashes when a user tries to disable sse/sse2, so... If you think it's needed, then fine. As long as it runs on M1 I'm fine.

RazrFalcon · 2021-10-21T19:38:40Z

No way to test ARM code on CI. This is the real bummer.

jrmuizel · 2021-10-21T19:40:32Z

stdarch tests ARM code on CI:
https://github.com/rust-lang/stdarch/blob/master/.github/workflows/main.yml

RazrFalcon · 2021-10-21T19:54:40Z

@jrmuizel They are using docker. Seems like overly complicated solution. Looks like Circle CI provides ARM VMs.

CryZe · 2021-10-21T19:57:06Z

The cross tool is trivial to set up in CI and all you need to do is cross test instead of cargo test. It supports aarch64 and many other architectures (internally also uses docker and qemu).

RazrFalcon · 2021-10-21T20:00:17Z

Interesting. Would you mind including this into this PR?

Also, I would love to reduce unsafe usage, but looks like safe_arch doesn't support arm. Right now, tiny-skia doesn't rely on any unsafe code blocks and I want to keep it that way.

src/pipeline/highp.rs

CryZe · 2021-12-23T16:32:03Z

Alright, so NEON intrinsics are about to stabilize in Rust 1.59, so I've started working on this again.

Also, I would love to reduce unsafe usage, but looks like safe_arch doesn't support arm. Right now, tiny-skia doesn't rely on any unsafe code blocks and I want to keep it that way.

I've written a PR against bytemuck that allows removing the unsafe transmutes for the NEON vectors. It already got merged. I also ported all the NEON intrinsics into safe functions for merging into safe_arch. I'll open that PR soon.

A thing I've noticed though is that this will almost definitely require 1.59 as the minimum Rust version on at least AArch64, unless we introduce some kind of extra feature that gates AArch64 from using SIMD on earlier Rust versions.

Also since the const generics are unavoidable, I think we need at least 1.51 as the new minimum Rust version (the new bytemuck and safe_arch will require this too, at least according to tiny-skia's Cargo.toml).

RazrFalcon · 2021-12-23T18:07:24Z

I see. I guess MSRV bump is unavoidable. But I would to keep ARM feature-gated or something. Or maybe we could specify two MSRV for each target? Could we be able to build x86 on older version still?

Also, thanks a lot for you work. I really don't have time lately and those changes are far from trivial.

RazrFalcon · 2021-12-25T12:38:54Z

Could you also update the Readme? Currently, it mentions that ARM is not supported in the Performance section. Thanks.

RazrFalcon · 2022-06-04T11:01:03Z

Hi! Any updates on this?

CryZe · 2022-06-04T11:12:58Z

Not yet, I have a WIP branch for safe arch, but I haven't opened a PR for it yet. I'll probably revisit this soon.

Brooooooklyn · 2022-08-11T05:08:35Z

No way to test ARM code on CI. This is the real bummer.

We can test it via multiarch/ubuntu-core:aarch64-focal docker image, which is qemu under the hood, and it supports all neon instructions: https://github.com/Brooooooklyn/canvas/runs/7711666646?check_suite_focus=true

RazrFalcon · 2022-08-24T17:07:27Z

@CryZe I've dropped safe_arch support in favor of explicit SIMD. This should make this patch way easier.

This adds support for the NEON SIMD instructions.

CryZe · 2022-08-24T19:07:45Z

Well that was easy :D (didn't expect the CI to work first try, especially considering I didn't even run the tests locally).

It's quite easy to run tests for other architectures with `cross`.

RazrFalcon · 2022-08-24T19:18:51Z

path/src/f32x2_t.rs

@@ -104,3 +104,11 @@ impl core::ops::Div<f32x2> for f32x2 {
        ])
    }
 }
+
+fn pmax(a: f32, b: f32) -> f32 {


@CryZe Is this a no_std variant?

No, basically as I've noticed back then when I opened this PR, there's a bunch of platform differences when it comes to NaN, but Skia doesn't seem to care about them. Rust's default max is "unnecessarily slow" in that it cares for NaN (and I believe other edge cases), so I ported the pmin/pmax logic from wasm (that also is meant to ignore NaN and just do the fastest thing possible) so it can be used in the fallback code.

Are you sure? It seems it's just an intrinsic: ihttps://doc.rust-lang.org/stable/src/core/num/f32.rs.html#749

I compiled this list back then that really shows how much of a mess min and max are: https://gist.github.com/CryZe/30cc76f4629cb0846d5a9b8d13144649

Yes, it just calls into fmaxf, which is C's overcomplicated function: https://rust.godbolt.org/z/KKz7875sr

Ok, I will look into it in more details. We can leave it as is for now.

https://docs.rs/libm/latest/src/libm/math/fmaxf.rs.html#11

Hmm, I see. We technically shouldn't have NaN values to begin with, so a custom version should be fine.

RazrFalcon · 2022-08-24T19:20:45Z

src/wide/f32x4_t.rs

@@ -95,6 +113,10 @@ impl core::ops::Add for f32x4 {
                Self(unsafe { _mm_add_ps(self.0, rhs.0) })
            } else if #[cfg(all(feature = "simd", target_feature = "simd128"))] {
                Self(f32x4_add(self.0, rhs.0))
+            } else if #[cfg(all(feature = "simd", target_arch = "aarch64", target_feature = "neon"))] {


Do we still need target_arch = "aarch64"? Shouldn't target_feature = "neon" be enough?

Yeah we had this discussion before. I believe I want to keep this for the eventual followup PR that includes 32-bit ARM into the equation (which doesn't have neon by default).

Oh yes. Verbose, but fine.

RazrFalcon · 2022-08-24T19:21:22Z

src/wide/f32x8_t.rs

+            } else if #[cfg(all(feature = "simd", target_arch = "aarch64", target_feature = "neon"))] {
+                unsafe {
+                    Self(
+                        core::mem::transmute(vceqq_f32(self.0, rhs.0)),


Why do we need transmute here?

Because your methods here are "weird" in that they do a comparison but return a floating point type, even though they really are masks. On the other architectures we got "lucky" in that they don't differentiate the types.

Well, this is definitely Skia being "weird". And seems like they use vreinterpretq_f32_u32 instead. Should we as well? Not sure how different it from transmute.

Though I believe I added Neon types to bytemuck, so we could just use bytemuck here.

RazrFalcon · 2022-08-24T19:42:50Z

Thanks for the patch! I took a while to merge (10 months...), but we're finally here. Hopefully I will publish a new release this weekends.

RazrFalcon · 2022-08-24T21:33:10Z

Almost 3x faster on Apple M1. Amazing!

Shnatsel · 2022-08-24T22:05:27Z

Excellent! I think the README needs an update, as it still reads:

Skia also supports ARM NEON instructions, which are unavailable in a stable Rust at the moment. Therefore a fallback scalar implementation will be used instead on ARM and other non-x86 targets. So if you're targeting ARM, you better stick with Skia.

RazrFalcon · 2022-08-25T14:05:55Z

Yes, thanks!