-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for ARM SVE2. #8051
Support for ARM SVE2. #8051
Conversation
…st at 128 and 256 bits.
…ectors. Use proper prefix for neon intrinsics. Comment cleanups.
and SVE2 with simd_op_check_sve2 fails as both posibilities need to be allowed for 128-bit or smaller operations.
Change name of test class to avoid confusion.
I can't seem to name Steve Suzuki in the reviewers list. Apologies if I'm missing something. |
Replace internal_error with nop return for CodeGen_LLVM::match_vector_type_scalable called on scalar.
Remove dead code.
with lanes of zero, which is not correct.
looks like legit failure in correctness_vector_reductions on arm64 |
width multiplier applied to intrinsics.
look like a TODO is needed anymore.
Why is the IR sharing issue only affecting this SIMD op check test, not others? |
The other simd op check tests just use Exprs and some dummy calls to functions that get subbed out with per-test-case local imageparams, so there's no shared mutable IR. This test defines Funcs that are used by multiple test cases. It's the sharing of those Funcs that caused the problem. |
Though now that you say that, I guess we should move the deep copying code I added into the base class so that this isn't an issue elsewhere later. |
Is this ready to land? Are there remaining issues? Should I run a global presubmit inside Google first? |
It should be ready to land, pending validation that it does not break anything in existing ARM support, so please run a global presubmit. We can discuss testing correctness one to one. My first attempt to test this under QEMU failed completely because FreeBSD on QEMU on Mac OS (M3) is an exercise in disk corruption. |
Looks to be clean inside Google, are we ready to land? |
I still need to benchmark it on our workloads. |
On average it makes no difference to our neon workloads, though neon codegen definitely changed in a bunch of places. I did find one minor inefficiency while eyeballing things, which uncovered an inefficiency in main. Consider this pipeline:
On both main and the branch this generates the inner loop:
It rounds up and rounds down as a float, then converts to an int. It should instead be using a different rounding mode on the int conversions and just using fcvtms and fcvtps If you remove the .vectorize call then on main we get:
and on the branch we get:
It looks like the branch goes down the vector path even in the scalar case, and hits the issue we already have with vector floor/ceil. For the vector case, it looks like the llvm IR is:
but we should instead be calling the intrinsic llvm.aarch64.neon.fcvtms.v4i32.v4f32 I'll open a PR onto this branch |
Oops, looks like I broke something. Will fix. |
Can't seem to repro locally, and in the llvm commit log there have been reverts of aarch64 stuff. I'll just do a merge with main and see what happens. |
This revealed a bug. FindIntrinsics was not enabled for scalars anyway, so it was semi-pointless.
Heavily based on Steve Suzuki's work here: #6781 . Hopefully easier to merge with less effect on existing ARM support and fewer constraints on CodeGen_LLVM.