-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arm_neon.h intrinsics should be target-gated, not preprocessor-gated #56480
Comments
Playing around with godbolt, it looks like making this work for NEON itself may be more involved than just fixing the header:
Not sure about other features, which don't depend on feature-gated types. |
+1, this is important for SVE and even NEON adoption. |
Seems like a cool feature. I also ran into this while trying to write a faster SHA256 using intrinsics over at #56121. Adding a random assortment of folks who touched arm_neon.td and arm_sve.td: @davemgreen @dcandler @sdesmalen-arm , wdyt? @davidben linked to the GCC change already. 9fc7fb2 is the change that did the corresponding change for the intel intrinsics long ago. |
@llvm/issue-subscribers-backend-aarch64 |
This looks somewhat related to the function multiversioning work from @ilinpv https://reviews.llvm.org/D127812 but perhaps is not blocked by it? |
Yeah, I think they're related but can be done independently. Doing this would make function multiversioning much more useful (otherwise function multiversioning is limited to compiler vectorization, as I understand), but function multiversioning is not necessary for this to be useful, since the application can always manage the dispatch itself. |
This does sound like a good idea, to help the usability of the intrinsics. GCC uses the same target attributes technique on the functions, then relies on an error happening when inlining those functions when the target features do not match. This is the same method that X86 uses, and seems to work OK for AArch64 from these examples: (GCC actually uses a different method for arm_sve.h, where they just define a single pragma that tells the compiler that it should include definitions for all the needed sve acle intrinsics). Changing the tablegen emitter to use those target attributes as opposed to ifdefs looks fairly straightforward. There are some issues however:
|
@davemgreen thanks for looking into this. |
OK that's good. Thanks for clarifying. It is something that seems worth fixing in general, as the difference between compilers is likely to trip some people up. |
Sure, all other things being equal, more compatibility across compilers is nice :) |
This patch makes SVE intrinsics more useable by gating them on the target, not by ifdef preprocessor macros. See #56480. This alters the SVEEmitter for arm_sve.h to remove the #ifdef guards and instead use TARGET_BUILTIN with the correct features so that the existing "'func' needs target feature sve" error will be generated when sve is not present. The ArchGuard containing defines in the SVEEmitter are changed to TargetGuard containing target features. In the arm_neon.h emitter there are both existing ArchGuard ifdefs mixed with new TargetGuard target feature guards, so the name is change in the SVE too for consistency. The few functions that are present in arm_sve.h (as opposed to builtin aliases) have __attribute__((target("sve"))) added. Some of the tests needed to be rejigged a little, as well as updating the error message, as the error now happens at a later point. Differential Revision: https://reviews.llvm.org/D131064
We have been making some changes recently, such as these: There are a couple of left-over bits and pieces, but I believe this should now be working. Hacking HWY_HAVE_RUNTIME_DISPATCH to 1 in highway seems to work, but that is a very limited test. It would be good if you could give it a go and see if it meets your needs. |
That's fantastic news, thanks for the heads-up! I'll be happy to double-check this when back in early Feb, presumably this will require building clang from source. Should we then update the Highway code to expect this capability in Clang 16? |
This issue has been fixed in #95224. |
@llvm/issue-subscribers-clang-frontend Author: David Benjamin (davidben)
Clang's intrinsics headers on Arm contain code like:
#if !defined(__ARM_NEON)
#error "NEON support not enabled"
#else or: #if __ARM_ARCH >= 8 && defined(__ARM_FEATURE_AES)
#ifdef __LITTLE_ENDIAN__
__ai uint8x16_t vaesdq_u8(uint8x16_t __p0, uint8x16_t __p1) {
... (Generated by https://github.com/llvm/llvm-project/blob/main/clang/utils/TableGen/NeonEmitter.cpp.) This means that one can only use Arm intrinsics in TUs that mark the feature as available for the entire intrinsic, e.g. via The x86 variant is both easier to use, as it doesn't require messing with your project's build definition, and safer, since you can just write a In contrast, Clang's Arm story requires messing with build definitions and risks ODR violations. Suppose your NEON-enabled TU included some inline functions, where Clang happened to vectorize code and use NEON instructions. If that copy of the inline function won, the resulting binary would inadvertently require NEON, even if the overall target wasn't meant to require NEON. GCC fixed their Arm intrinsics, back in 2015, to be target-gated instead. Arm intrinsics would be much more usable if Clang could get parity here. |
Clang's intrinsics headers on Arm contain code like:
or:
(Generated by https://github.com/llvm/llvm-project/blob/main/clang/utils/TableGen/NeonEmitter.cpp.)
This means that one can only use Arm intrinsics in TUs that mark the feature as available for the entire intrinsic, e.g. via
-march
flags. In contrast, the x86 intrinsics are consistently defined, but tagged with__attribute__((__target__("whatever")))
:https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/avx2intrin.h#L18
The x86 variant is both easier to use, as it doesn't require messing with your project's build definition, and safer, since you can just write a
target("avx2")
function and then gate the unmarked -> marked transition on some suitable CPUID check. (Or perhaps even use multi-versioning, though I believe target attributes are usable even without that.)In contrast, Clang's Arm story requires messing with build definitions and risks ODR violations. Suppose your NEON-enabled TU included some inline functions, where Clang happened to vectorize code and use NEON instructions. If that copy of the inline function won, the resulting binary would inadvertently require NEON, even if the overall target wasn't meant to require NEON.
GCC fixed their Arm intrinsics, back in 2015, to be target-gated instead. Arm intrinsics would be much more usable if Clang could get parity here.
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=ae5e29239e28818f807cf11775c95c4243d9a256;hp=b8c7c62b2dbbdf355adb56d8250e68222ae0febb
The text was updated successfully, but these errors were encountered: