Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm_neon.h intrinsics should be target-gated, not preprocessor-gated #56480

Closed
davidben opened this issue Jul 11, 2022 · 14 comments
Closed

arm_neon.h intrinsics should be target-gated, not preprocessor-gated #56480

davidben opened this issue Jul 11, 2022 · 14 comments
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema"

Comments

@davidben
Copy link
Contributor

davidben commented Jul 11, 2022

Clang's intrinsics headers on Arm contain code like:

#if !defined(__ARM_NEON)
#error "NEON support not enabled"
#else

or:

#if __ARM_ARCH >= 8 && defined(__ARM_FEATURE_AES)
#ifdef __LITTLE_ENDIAN__
__ai uint8x16_t vaesdq_u8(uint8x16_t __p0, uint8x16_t __p1) {
...

(Generated by https://github.com/llvm/llvm-project/blob/main/clang/utils/TableGen/NeonEmitter.cpp.)

This means that one can only use Arm intrinsics in TUs that mark the feature as available for the entire intrinsic, e.g. via -march flags. In contrast, the x86 intrinsics are consistently defined, but tagged with __attribute__((__target__("whatever"))):
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/avx2intrin.h#L18

The x86 variant is both easier to use, as it doesn't require messing with your project's build definition, and safer, since you can just write a target("avx2") function and then gate the unmarked -> marked transition on some suitable CPUID check. (Or perhaps even use multi-versioning, though I believe target attributes are usable even without that.)

In contrast, Clang's Arm story requires messing with build definitions and risks ODR violations. Suppose your NEON-enabled TU included some inline functions, where Clang happened to vectorize code and use NEON instructions. If that copy of the inline function won, the resulting binary would inadvertently require NEON, even if the overall target wasn't meant to require NEON.

GCC fixed their Arm intrinsics, back in 2015, to be target-gated instead. Arm intrinsics would be much more usable if Clang could get parity here.
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=ae5e29239e28818f807cf11775c95c4243d9a256;hp=b8c7c62b2dbbdf355adb56d8250e68222ae0febb

@davidben
Copy link
Contributor Author

Playing around with godbolt, it looks like making this work for NEON itself may be more involved than just fixing the header:

<source>:5:24: error: 'neon_vector_type' attribute is not supported on targets missing 'neon' or 'mve'; specify an appropriate -march= or -mcpu=
typedef __attribute__((neon_vector_type(4))) uint32_t uint32x4_t;

Not sure about other features, which don't depend on feature-gated types.

@jan-wassenberg
Copy link

+1, this is important for SVE and even NEON adoption.
Here's a Compiler Explorer example of this in action using GCC.

@nico
Copy link
Contributor

nico commented Jul 12, 2022

Seems like a cool feature.

I also ran into this while trying to write a faster SHA256 using intrinsics over at #56121.

Adding a random assortment of folks who touched arm_neon.td and arm_sve.td: @davemgreen @dcandler @sdesmalen-arm , wdyt?

@davidben linked to the GCC change already.

9fc7fb2 is the change that did the corresponding change for the intel intrinsics long ago.

@llvmbot
Copy link
Member

llvmbot commented Jul 12, 2022

@llvm/issue-subscribers-backend-aarch64

@kyrilltkachov
Copy link
Contributor

This looks somewhat related to the function multiversioning work from @ilinpv https://reviews.llvm.org/D127812 but perhaps is not blocked by it?

@davidben
Copy link
Contributor Author

Yeah, I think they're related but can be done independently. Doing this would make function multiversioning much more useful (otherwise function multiversioning is limited to compiler vectorization, as I understand), but function multiversioning is not necessary for this to be useful, since the application can always manage the dispatch itself.

@davemgreen
Copy link
Collaborator

This does sound like a good idea, to help the usability of the intrinsics. GCC uses the same target attributes technique on the functions, then relies on an error happening when inlining those functions when the target features do not match. This is the same method that X86 uses, and seems to work OK for AArch64 from these examples:
https://godbolt.org/z/odWjW43xf

(GCC actually uses a different method for arm_sve.h, where they just define a single pragma that tells the compiler that it should include definitions for all the needed sve acle intrinsics).

Changing the tablegen emitter to use those target attributes as opposed to ifdefs looks fairly straightforward. There are some issues however:

  • The Neon functions give errors during inlining, but the SVE definitions use __attribute__((__clang_arm_builtin_alias(__builtin_sve_svmmla_s32))), so do not see the same error until it hits a backend "failed to select" fatal error.
  • The types need to be defined. This means defining types like bfloat16_t without bfloat16 support, which currently gives an error.
  • GCC spells it target("+aes") (or target("arch=armv9-a")). LLVM spells it target("aes"). We would need to align the two.
  • I only looked at changing the SVE emitter. We would need to make sure AArch32 still works correctly, and we may of course run into other problems..

@jan-wassenberg
Copy link

@davemgreen thanks for looking into this.
As to the target("+aes") vs target("aes") difference, from my perspective it would be fine to use whichever is more convenient for the compiler. We anyway use different syntax between gcc and clang:
pragma clang attribute push(__attribute__((target(targets_str))), apply_to = function)) vs
GCC target targets_str.

@davemgreen
Copy link
Collaborator

OK that's good. Thanks for clarifying. It is something that seems worth fixing in general, as the difference between compilers is likely to trip some people up.

@jan-wassenberg
Copy link

Sure, all other things being equal, more compatibility across compilers is nice :)

davemgreen added a commit that referenced this issue Jan 4, 2023
This patch makes SVE intrinsics more useable by gating them on the
target, not by ifdef preprocessor macros. See #56480. This alters the
SVEEmitter for arm_sve.h to remove the #ifdef guards and instead use
TARGET_BUILTIN with the correct features so that the existing "'func'
needs target feature sve" error will be generated when sve is not
present.

The ArchGuard containing defines in the SVEEmitter are changed to
TargetGuard containing target features. In the arm_neon.h emitter there
are both existing ArchGuard ifdefs mixed with new TargetGuard target
feature guards, so the name is change in the SVE too for consistency.
The few functions that are present in arm_sve.h (as opposed to builtin
aliases) have __attribute__((target("sve"))) added. Some of the tests
needed to be rejigged a little, as well as updating the error message,
as the error now happens at a later point.

Differential Revision: https://reviews.llvm.org/D131064
@davemgreen
Copy link
Collaborator

We have been making some changes recently, such as these:
30b67c6 [AArch64] Make ACLE intrinsics always available part1
09aaf19 [AArch64] Make ACLE intrinsics always available part MTE
b879f99 [AArch64][ARM] Alter most of arm_neon.h to be target-based, not preprocessor based.
af1bb28 [AArch64][ARM] Alter v8.3a complex neon intrinsics to be target-based, not preprocessor based
9c48b7f [AArch64][ARM] Alter v8.1a neon intrinsics to be target-based, not preprocessor based
6f1e430 [AArch64] Alter v8.5a FRINT neon intrinsics to be target-based, not preprocessor based
e7deca5 [AArch64] Alter arm_fp16.h to be target-based, not preprocessor based.
6cac7c2 [AArch64] Alter arm_sve.h to be target-based, not preprocessor based.
9978529 [AArch64] Alter arm_neon_sve_bridge.h to be target-based, not preprocessor based.

There are a couple of left-over bits and pieces, but I believe this should now be working.

Hacking HWY_HAVE_RUNTIME_DISPATCH to 1 in highway seems to work, but that is a very limited test. It would be good if you could give it a go and see if it meets your needs.

@jan-wassenberg
Copy link

That's fantastic news, thanks for the heads-up!
Yes, I believe changing HWY_HAVE_RUNTIME_DISPATCH to 1 would be sufficient. If the tests outputs include [..]/NEON and [..]/SVE in the same binary, without setting -march to +sve, then we've succeeded.

I'll be happy to double-check this when back in early Feb, presumably this will require building clang from source.

Should we then update the Highway code to expect this capability in Clang 16?

@Lukacma
Copy link
Contributor

Lukacma commented Sep 23, 2024

This issue has been fixed in #95224.

@Lukacma Lukacma closed this as completed Sep 23, 2024
@EugeneZelenko EugeneZelenko added clang:frontend Language frontend issues, e.g. anything involving "Sema" and removed backend:AArch64 clang:headers Headers provided by Clang, e.g. for intrinsics labels Sep 23, 2024
@llvmbot
Copy link
Member

llvmbot commented Sep 23, 2024

@llvm/issue-subscribers-clang-frontend

Author: David Benjamin (davidben)

Clang's intrinsics headers on Arm contain code like:
#if !defined(__ARM_NEON)
#error "NEON support not enabled"
#else

or:

#if __ARM_ARCH &gt;= 8 &amp;&amp; defined(__ARM_FEATURE_AES)
#ifdef __LITTLE_ENDIAN__
__ai uint8x16_t vaesdq_u8(uint8x16_t __p0, uint8x16_t __p1) {
...

(Generated by https://github.com/llvm/llvm-project/blob/main/clang/utils/TableGen/NeonEmitter.cpp.)

This means that one can only use Arm intrinsics in TUs that mark the feature as available for the entire intrinsic, e.g. via -march flags. In contrast, the x86 intrinsics are consistently defined, but tagged with __attribute__((__target__("whatever"))):
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/avx2intrin.h#L18

The x86 variant is both easier to use, as it doesn't require messing with your project's build definition, and safer, since you can just write a target("avx2") function and then gate the unmarked -> marked transition on some suitable CPUID check. (Or perhaps even use multi-versioning, though I believe target attributes are usable even without that.)

In contrast, Clang's Arm story requires messing with build definitions and risks ODR violations. Suppose your NEON-enabled TU included some inline functions, where Clang happened to vectorize code and use NEON instructions. If that copy of the inline function won, the resulting binary would inadvertently require NEON, even if the overall target wasn't meant to require NEON.

GCC fixed their Arm intrinsics, back in 2015, to be target-gated instead. Arm intrinsics would be much more usable if Clang could get parity here.
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=ae5e29239e28818f807cf11775c95c4243d9a256;hp=b8c7c62b2dbbdf355adb56d8250e68222ae0febb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema"
Projects
None yet
Development

No branches or pull requests

9 participants