Support for ARM SVE2. #8051

zvookin · 2024-01-29T22:20:19Z

Heavily based on Steve Suzuki's work here: #6781 . Hopefully easier to merge with less effect on existing ARM support and fewer constraints on CodeGen_LLVM.

…st at 128 and 256 bits.

…ectors. Use proper prefix for neon intrinsics. Comment cleanups.

and SVE2 with simd_op_check_sve2 fails as both posibilities need to be allowed for 128-bit or smaller operations.

Change name of test class to avoid confusion.

zvookin · 2024-01-29T22:22:13Z

I can't seem to name Steve Suzuki in the reviewers list. Apologies if I'm missing something.

Replace internal_error with nop return for CodeGen_LLVM::match_vector_type_scalable called on scalar.

Remove dead code.

with lanes of zero, which is not correct.

src/CodeGen_ARM.cpp

steven-johnson · 2024-02-07T19:52:23Z

looks like legit failure in correctness_vector_reductions on arm64

width multiplier applied to intrinsics.

… arm_sve_redux

look like a TODO is needed anymore.

zvookin · 2024-02-24T02:10:40Z

Why is the IR sharing issue only affecting this SIMD op check test, not others?

abadams · 2024-02-24T02:17:55Z

The other simd op check tests just use Exprs and some dummy calls to functions that get subbed out with per-test-case local imageparams, so there's no shared mutable IR. This test defines Funcs that are used by multiple test cases. It's the sharing of those Funcs that caused the problem.

abadams · 2024-02-24T02:19:03Z

Though now that you say that, I guess we should move the deep copying code I added into the base class so that this isn't an issue elsewhere later.

steven-johnson · 2024-02-27T02:04:21Z

Is this ready to land? Are there remaining issues? Should I run a global presubmit inside Google first?

zvookin · 2024-02-27T02:09:22Z

It should be ready to land, pending validation that it does not break anything in existing ARM support, so please run a global presubmit. We can discuss testing correctness one to one. My first attempt to test this under QEMU failed completely because FreeBSD on QEMU on Mac OS (M3) is an exercise in disk corruption.

… arm_sve_redux

steven-johnson · 2024-03-06T17:31:50Z

Looks to be clean inside Google, are we ready to land?

abadams · 2024-03-06T18:48:28Z

I still need to benchmark it on our workloads.

abadams · 2024-03-12T18:43:06Z

On average it makes no difference to our neon workloads, though neon codegen definitely changed in a bunch of places.

I did find one minor inefficiency while eyeballing things, which uncovered an inefficiency in main. Consider this pipeline:

#include "Halide.h"
using namespace Halide;

int main() {
    Func f, g;
    Var x;

    f(x) = x / 17.0f;
    f.compute_root();

    g(x) = {cast<int>(ceil(f(x))), cast<int>(floor(f(x)))};

    g.vectorize(x, 4);

    g.compile_to_assembly("/dev/stdout", {}, "", Target{"arm-64-linux-no_asserts-no_runtime-no_bounds_query"});

    return 0;
}

On both main and the branch this generates the inner loop:

	ldr	q0, [x9], #16
	subs	x8, x8, #1
	frintp	v1.4s, v0.4s
	frintm	v0.4s, v0.4s
	fcvtzs	v1.4s, v1.4s
	fcvtzs	v0.4s, v0.4s
	str	q1, [x10], #16
	str	q0, [x11], #16

It rounds up and rounds down as a float, then converts to an int. It should instead be using a different rounding mode on the int conversions and just using fcvtms and fcvtps

If you remove the .vectorize call then on main we get:

	ldr	s0, [x8], #4
	subs	x19, x19, #1
	fcvtps	w9, s0
	fcvtms	w10, s0
	str	w9, [x20], #4
	str	w10, [x21], #4

and on the branch we get:

	ldr	s0, [x8], #4
	subs	x19, x19, #1
	frintp	s1, s0
	frintm	s0, s0
	fcvtzs	v1.2s, v1.2s
	fcvtzs	v0.2s, v0.2s
	st1	{ v1.s }[0], [x20], #4
	st1	{ v0.s }[0], [x21], #4

It looks like the branch goes down the vector path even in the scalar case, and hits the issue we already have with vector floor/ceil.

For the vector case, it looks like the llvm IR is:

  %22 = tail call <4 x float> @llvm.floor.v4f32(<4 x float> %18)
  %23 = fptosi <4 x float> %22 to <4 x i32>

but we should instead be calling the intrinsic llvm.aarch64.neon.fcvtms.v4i32.v4f32

I'll open a PR onto this branch

…#8151)

abadams · 2024-03-13T21:46:25Z

Oops, looks like I broke something. Will fix.

abadams · 2024-03-13T22:27:12Z

Can't seem to repro locally, and in the llvm commit log there have been reverts of aarch64 stuff. I'll just do a merge with main and see what happens.

This revealed a bug. FindIntrinsics was not enabled for scalars anyway, so it was semi-pointless.

Z Stern and others added 12 commits December 14, 2023 01:44

Checkpoint SVE2 restart.

77ea0a0

Remove dead code. Add new test.

c203d1e

Update cmake for new file.

27ee93e

Checkpoint progress on SVE2.

bf0e925

Merge branch 'main' into arm_sve_redux

f40eeb5

Checkpoint ARM SVE2 support. Passes correctness_simd_op_check_sve2 te…

deb5fbc

…st at 128 and 256 bits.

Merge branch 'main' into arm_sve_redux

51c4568

Remove an opportunity for RISC V codegen to change due to SVE2 support.

5f98675

Ensure SVE intrinsics get vscale vectors and non-SVE ones get fixed v…

1b8a75e

…ectors. Use proper prefix for neon intrinsics. Comment cleanups.

Checkpoint SVE2 work. Generally passes test, though using both NEON

5eeef77

and SVE2 with simd_op_check_sve2 fails as both posibilities need to be allowed for 128-bit or smaller operations.

Remove an unfavored implementation possibility.

f57f1d3

Fix opcode recognition in test to handle some cases that show up.

da3c259

Change name of test class to avoid confusion.

zvookin requested review from abadams and shoaibkamil January 29, 2024 22:20

Z Stern added 4 commits January 29, 2024 22:23

Merge branch 'main' into arm_sve_redux

06fa66c

Formatting fixes.

a069e6e

Replace internal_error with nop return for CodeGen_LLVM::match_vector_type_scalable called on scalar.

Formatting fix.

1e8a540

Limit SVE2 test to LLVM 19.

93fb752

Remove dead code.

zvookin mentioned this pull request Jan 30, 2024

Enable Arm SVE2 for 128 bits vector target #6781

Open

Fix a degenerate case asking for zero sized vectors via a HAlide type

de11e8f

with lanes of zero, which is not correct.

steven-johnson self-requested a review February 1, 2024 00:19

steven-johnson reviewed Feb 1, 2024

View reviewed changes

Z Stern and others added 2 commits February 6, 2024 18:26

Merge branch 'main' into arm_sve_redux

9b2897c

Merge branch 'main' into arm_sve_redux

2bc10e3

Z Stern added 4 commits February 7, 2024 20:49

Merge branch 'main' into arm_sve_redux

c598c9d

Fix confusion about Neon64/Neon128 and make it clear this is just the

bb73c00

width multiplier applied to intrinsics.

Merge branch 'arm_sve_redux' of https://github.com/halide/Halide into…

65fff76

… arm_sve_redux

REmove extraneous commented out line.

93d7ba9

Z Stern and others added 7 commits February 21, 2024 23:50

Remove dubious looking FP to int code that was ifdef'ed out. Doesn't

fe30990

look like a TODO is needed anymore.

Add issues for TODOs.

dc3be8a

Merge branch 'main' into arm_sve_redux

7627e0d

Merge branch 'main' into arm_sve_redux

4a269bd

Update simd_op_check_sve2.cpp

6afdcff

Merge branch 'main' into arm_sve_redux

b03b3c7

Make a deep copy of each piece of test IR so that we can parallelize

f8952c2

Z Stern and others added 6 commits March 5, 2024 02:15

Merge branch 'arm_sve_redux' of https://github.com/halide/Halide into…

eaed2ef

… arm_sve_redux

Merge branch 'main' into arm_sve_redux

2ac96c8

Fix two clang-tidy warnings

4324bc5

Remove try/catch block from simd-op-check-sve2

a63439b

Merge branch 'arm_sve_redux' of https://github.com/halide/Halide into…

f84c764

… arm_sve_redux

Don't try to run SVE2 code if vector_bits doesn't match host.

210e5d7

steven-johnson added the release_notes For changes that may warrant a note in README for official releases. label Mar 6, 2024

Add support for fcvtm/p, make scalars go through pattern matching too (…

9d8e2c6

…#8151)

Merge remote-tracking branch 'origin/main' into arm_sve_redux

32d1fcb

Don't do arm neon instruction selection on scalars

9dbfcd5

This revealed a bug. FindIntrinsics was not enabled for scalars anyway, so it was semi-pointless.

abadams approved these changes Mar 15, 2024

View reviewed changes

abadams merged commit 76a7dd4 into main Mar 15, 2024
19 checks passed

BrewTestBot mentioned this pull request Jul 17, 2024

halide 18.0.0 Homebrew/homebrew-core#177657

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for ARM SVE2. #8051

Support for ARM SVE2. #8051

zvookin commented Jan 29, 2024

zvookin commented Jan 29, 2024

steven-johnson commented Feb 7, 2024

zvookin commented Feb 24, 2024

abadams commented Feb 24, 2024

abadams commented Feb 24, 2024

steven-johnson commented Feb 27, 2024

zvookin commented Feb 27, 2024

steven-johnson commented Mar 6, 2024

abadams commented Mar 6, 2024

abadams commented Mar 12, 2024

abadams commented Mar 13, 2024

abadams commented Mar 13, 2024

Support for ARM SVE2. #8051

Support for ARM SVE2. #8051

Conversation

zvookin commented Jan 29, 2024

zvookin commented Jan 29, 2024

steven-johnson commented Feb 7, 2024

zvookin commented Feb 24, 2024

abadams commented Feb 24, 2024

abadams commented Feb 24, 2024

steven-johnson commented Feb 27, 2024

zvookin commented Feb 27, 2024

steven-johnson commented Mar 6, 2024

abadams commented Mar 6, 2024

abadams commented Mar 12, 2024

abadams commented Mar 13, 2024

abadams commented Mar 13, 2024