arm_neon.h intrinsics should be target-gated, not preprocessor-gated #56480

davidben · 2022-07-11T19:44:42Z

Clang's intrinsics headers on Arm contain code like:

#if !defined(__ARM_NEON)
#error "NEON support not enabled"
#else

or:

#if __ARM_ARCH >= 8 && defined(__ARM_FEATURE_AES)
#ifdef __LITTLE_ENDIAN__
__ai uint8x16_t vaesdq_u8(uint8x16_t __p0, uint8x16_t __p1) {
...

(Generated by https://github.com/llvm/llvm-project/blob/main/clang/utils/TableGen/NeonEmitter.cpp.)

This means that one can only use Arm intrinsics in TUs that mark the feature as available for the entire intrinsic, e.g. via -march flags. In contrast, the x86 intrinsics are consistently defined, but tagged with __attribute__((__target__("whatever"))):
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/avx2intrin.h#L18

The x86 variant is both easier to use, as it doesn't require messing with your project's build definition, and safer, since you can just write a target("avx2") function and then gate the unmarked -> marked transition on some suitable CPUID check. (Or perhaps even use multi-versioning, though I believe target attributes are usable even without that.)

In contrast, Clang's Arm story requires messing with build definitions and risks ODR violations. Suppose your NEON-enabled TU included some inline functions, where Clang happened to vectorize code and use NEON instructions. If that copy of the inline function won, the resulting binary would inadvertently require NEON, even if the overall target wasn't meant to require NEON.

GCC fixed their Arm intrinsics, back in 2015, to be target-gated instead. Arm intrinsics would be much more usable if Clang could get parity here.
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=ae5e29239e28818f807cf11775c95c4243d9a256;hp=b8c7c62b2dbbdf355adb56d8250e68222ae0febb

The text was updated successfully, but these errors were encountered:

davidben · 2022-07-11T19:55:16Z

Playing around with godbolt, it looks like making this work for NEON itself may be more involved than just fixing the header:

<source>:5:24: error: 'neon_vector_type' attribute is not supported on targets missing 'neon' or 'mve'; specify an appropriate -march= or -mcpu=
typedef __attribute__((neon_vector_type(4))) uint32_t uint32x4_t;

Not sure about other features, which don't depend on feature-gated types.

jan-wassenberg · 2022-07-12T11:30:55Z

+1, this is important for SVE and even NEON adoption.
Here's a Compiler Explorer example of this in action using GCC.

nico · 2022-07-12T12:58:13Z

Seems like a cool feature.

I also ran into this while trying to write a faster SHA256 using intrinsics over at #56121.

Adding a random assortment of folks who touched arm_neon.td and arm_sve.td: @davemgreen @dcandler @sdesmalen-arm , wdyt?

@davidben linked to the GCC change already.

9fc7fb2 is the change that did the corresponding change for the intel intrinsics long ago.

llvmbot · 2022-07-12T13:36:00Z

@llvm/issue-subscribers-backend-aarch64

kyrilltkachov · 2022-07-13T16:35:55Z

This looks somewhat related to the function multiversioning work from @ilinpv https://reviews.llvm.org/D127812 but perhaps is not blocked by it?

davidben · 2022-07-13T16:50:30Z

Yeah, I think they're related but can be done independently. Doing this would make function multiversioning much more useful (otherwise function multiversioning is limited to compiler vectorization, as I understand), but function multiversioning is not necessary for this to be useful, since the application can always manage the dispatch itself.

davemgreen · 2022-07-20T13:14:02Z

This does sound like a good idea, to help the usability of the intrinsics. GCC uses the same target attributes technique on the functions, then relies on an error happening when inlining those functions when the target features do not match. This is the same method that X86 uses, and seems to work OK for AArch64 from these examples:
https://godbolt.org/z/odWjW43xf

(GCC actually uses a different method for arm_sve.h, where they just define a single pragma that tells the compiler that it should include definitions for all the needed sve acle intrinsics).

Changing the tablegen emitter to use those target attributes as opposed to ifdefs looks fairly straightforward. There are some issues however:

The Neon functions give errors during inlining, but the SVE definitions use __attribute__((__clang_arm_builtin_alias(__builtin_sve_svmmla_s32))), so do not see the same error until it hits a backend "failed to select" fatal error.
The types need to be defined. This means defining types like bfloat16_t without bfloat16 support, which currently gives an error.
GCC spells it target("+aes") (or target("arch=armv9-a")). LLVM spells it target("aes"). We would need to align the two.
I only looked at changing the SVE emitter. We would need to make sure AArch32 still works correctly, and we may of course run into other problems..

jan-wassenberg · 2022-07-20T13:29:28Z

@davemgreen thanks for looking into this.
As to the target("+aes") vs target("aes") difference, from my perspective it would be fine to use whichever is more convenient for the compiler. We anyway use different syntax between gcc and clang:
pragma clang attribute push(__attribute__((target(targets_str))), apply_to = function)) vs
GCC target targets_str.

davemgreen · 2022-07-21T07:11:27Z

OK that's good. Thanks for clarifying. It is something that seems worth fixing in general, as the difference between compilers is likely to trip some people up.

jan-wassenberg · 2022-07-21T07:22:48Z

Sure, all other things being equal, more compatibility across compilers is nice :)

This patch makes SVE intrinsics more useable by gating them on the target, not by ifdef preprocessor macros. See #56480. This alters the SVEEmitter for arm_sve.h to remove the #ifdef guards and instead use TARGET_BUILTIN with the correct features so that the existing "'func' needs target feature sve" error will be generated when sve is not present. The ArchGuard containing defines in the SVEEmitter are changed to TargetGuard containing target features. In the arm_neon.h emitter there are both existing ArchGuard ifdefs mixed with new TargetGuard target feature guards, so the name is change in the SVE too for consistency. The few functions that are present in arm_sve.h (as opposed to builtin aliases) have __attribute__((target("sve"))) added. Some of the tests needed to be rejigged a little, as well as updating the error message, as the error now happens at a later point. Differential Revision: https://reviews.llvm.org/D131064

davemgreen · 2023-01-18T19:23:07Z

We have been making some changes recently, such as these:
30b67c6 [AArch64] Make ACLE intrinsics always available part1
09aaf19 [AArch64] Make ACLE intrinsics always available part MTE
b879f99 [AArch64][ARM] Alter most of arm_neon.h to be target-based, not preprocessor based.
af1bb28 [AArch64][ARM] Alter v8.3a complex neon intrinsics to be target-based, not preprocessor based
9c48b7f [AArch64][ARM] Alter v8.1a neon intrinsics to be target-based, not preprocessor based
6f1e430 [AArch64] Alter v8.5a FRINT neon intrinsics to be target-based, not preprocessor based
e7deca5 [AArch64] Alter arm_fp16.h to be target-based, not preprocessor based.
6cac7c2 [AArch64] Alter arm_sve.h to be target-based, not preprocessor based.
9978529 [AArch64] Alter arm_neon_sve_bridge.h to be target-based, not preprocessor based.

There are a couple of left-over bits and pieces, but I believe this should now be working.

Hacking HWY_HAVE_RUNTIME_DISPATCH to 1 in highway seems to work, but that is a very limited test. It would be good if you could give it a go and see if it meets your needs.

jan-wassenberg · 2023-01-19T09:40:59Z

That's fantastic news, thanks for the heads-up!
Yes, I believe changing HWY_HAVE_RUNTIME_DISPATCH to 1 would be sufficient. If the tests outputs include [..]/NEON and [..]/SVE in the same binary, without setting -march to +sve, then we've succeeded.

I'll be happy to double-check this when back in early Feb, presumably this will require building clang from source.

Should we then update the Highway code to expect this capability in Clang 16?

Lukacma · 2024-09-23T09:27:26Z

This issue has been fixed in #95224.

llvmbot · 2024-09-23T13:54:57Z

@llvm/issue-subscribers-clang-frontend

Author: David Benjamin (davidben)

Clang's intrinsics headers on Arm contain code like:

#if !defined(__ARM_NEON)
#error "NEON support not enabled"
#else

or:

#if __ARM_ARCH &gt;= 8 &amp;&amp; defined(__ARM_FEATURE_AES)
#ifdef __LITTLE_ENDIAN__
__ai uint8x16_t vaesdq_u8(uint8x16_t __p0, uint8x16_t __p1) {
...

(Generated by https://github.com/llvm/llvm-project/blob/main/clang/utils/TableGen/NeonEmitter.cpp.)

This means that one can only use Arm intrinsics in TUs that mark the feature as available for the entire intrinsic, e.g. via -march flags. In contrast, the x86 intrinsics are consistently defined, but tagged with __attribute__((__target__("whatever"))):
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/avx2intrin.h#L18

The x86 variant is both easier to use, as it doesn't require messing with your project's build definition, and safer, since you can just write a target("avx2") function and then gate the unmarked -> marked transition on some suitable CPUID check. (Or perhaps even use multi-versioning, though I believe target attributes are usable even without that.)

In contrast, Clang's Arm story requires messing with build definitions and risks ODR violations. Suppose your NEON-enabled TU included some inline functions, where Clang happened to vectorize code and use NEON instructions. If that copy of the inline function won, the resulting binary would inadvertently require NEON, even if the overall target wasn't meant to require NEON.

GCC fixed their Arm intrinsics, back in 2015, to be target-gated instead. Arm intrinsics would be much more usable if Clang could get parity here.
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=ae5e29239e28818f807cf11775c95c4243d9a256;hp=b8c7c62b2dbbdf355adb56d8250e68222ae0febb

github-actions bot added the new issue label Jul 11, 2022

davemgreen added the backend:AArch64 label Jul 12, 2022

EugeneZelenko removed the new issue label Jul 12, 2022

jan-wassenberg mentioned this issue Jul 12, 2022

compile regression for aarch64 in libjxl caused by recent changes in d8867c9 google/highway#836

Closed

jan-wassenberg mentioned this issue Jul 18, 2022

Implement runtime dispatch on riscv64 google/highway#838

Closed

malaterre mentioned this issue Jul 18, 2022

riscv_vector.h intrinsics should be target-gated, not preprocessor-gated #56592

Open

RKSimon added the clang:headers Headers provided by Clang, e.g. for intrinsics label Jan 19, 2023

malaterre mentioned this issue Mar 29, 2023

HWY_CMAKE_ARM7:BOOL=ON vs Dynamic Dispatch google/highway#1271

Closed

randombit mentioned this issue Apr 10, 2023

'++crypto' is not a recognized feature for this target (ignoring feature) randombit/botan#3487

Open

lephilousophe mentioned this issue Jan 21, 2024

GRAPHICS: Use target pragmas instead of compiler flags to prevent ODR problems scummvm/scummvm#5581

Merged

Lukacma closed this as completed Sep 23, 2024

EugeneZelenko added clang:frontend Language frontend issues, e.g. anything involving "Sema" and removed backend:AArch64 clang:headers Headers provided by Clang, e.g. for intrinsics labels Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm_neon.h intrinsics should be target-gated, not preprocessor-gated #56480

arm_neon.h intrinsics should be target-gated, not preprocessor-gated #56480

davidben commented Jul 11, 2022 •

edited by VoltrexKeyva

Loading

davidben commented Jul 11, 2022

jan-wassenberg commented Jul 12, 2022

nico commented Jul 12, 2022

llvmbot commented Jul 12, 2022

kyrilltkachov commented Jul 13, 2022

davidben commented Jul 13, 2022

davemgreen commented Jul 20, 2022

jan-wassenberg commented Jul 20, 2022

davemgreen commented Jul 21, 2022

jan-wassenberg commented Jul 21, 2022

davemgreen commented Jan 18, 2023

jan-wassenberg commented Jan 19, 2023

Lukacma commented Sep 23, 2024

llvmbot commented Sep 23, 2024

arm_neon.h intrinsics should be target-gated, not preprocessor-gated #56480

arm_neon.h intrinsics should be target-gated, not preprocessor-gated #56480

Comments

davidben commented Jul 11, 2022 • edited by VoltrexKeyva Loading

davidben commented Jul 11, 2022

jan-wassenberg commented Jul 12, 2022

nico commented Jul 12, 2022

llvmbot commented Jul 12, 2022

kyrilltkachov commented Jul 13, 2022

davidben commented Jul 13, 2022

davemgreen commented Jul 20, 2022

jan-wassenberg commented Jul 20, 2022

davemgreen commented Jul 21, 2022

jan-wassenberg commented Jul 21, 2022

davemgreen commented Jan 18, 2023

jan-wassenberg commented Jan 19, 2023

Lukacma commented Sep 23, 2024

llvmbot commented Sep 23, 2024

davidben commented Jul 11, 2022 •

edited by VoltrexKeyva

Loading