Add CPUID submodule #37479

giordano · 2020-09-08T22:50:38Z

This is a small internal module which defines the API to query the Instruction
Set Architecture (ISA) of the current CPU. This can be used for example to
select an artifact in a JLL package which is compatible with the current CPU.

This is related to #37320, and might change to accommodate some needs of that PR.

Fix #36367. CC: @staticfloat.

yuyichao · 2020-09-08T23:06:45Z

As I said in #36367 this should not be a public API since using this to generate code is wrong. A strictly private API would be fine and should just use the existing C function instead.

yuyichao · 2020-09-08T23:10:06Z

Also, this is capturing the build time environment which is wrong. Note that the selection of external binary also must be done at runtime and not package installation time or precompilation time. This is the problem shared by a lot of packages that must be fixed before they can be accepted to base/stdlib.

yuyichao

Fix detection time, do not export julia API.

staticfloat · 2020-09-08T23:24:27Z

As I said in #36367 this should not be a public API since using this to generate code is wrong.

Sorry, I read through again and it's just not connecting for me. How would this be used for code generation?

yuyichao · 2020-09-08T23:35:14Z

For example, LoopVectorization.jl must not (and as I understand it currently is) using anything like this to determine what code to generate.
The target used for julia code and the actual hardware target are usually the same. However, whenever they disagree the hardware one is almost always the wrong one to use.

And again, having some dirty hack that should work for library loading most of the time is fine, as long as it's clear that no one outside of base should use it and the API may be deleted at any time without notice.

giordano · 2020-09-08T23:53:48Z

Fair point about detecting the features at runtime. Regarding the API, my understanding is that something is considered "public" if its documentation/docstring is exposed in the manual, which isn't the case in this pull request and wasn't my intention.

Regarding the use of jl_test_cpu_feature, should we manually copy the numeric values of all features in features_*.h to this module and manually have a mapping between the value and the feature it represent, keeping this always in sync with the features defined on the C side?

yuyichao · 2020-09-08T23:56:59Z

Regarding the API, my understanding is that something is considered "public" if its documentation/docstring is exposed in the manual, which isn't the case in this pull request and wasn't my intention.

For now this can just go in the module that does the loading. That'll lower the chance of someone using it.

Regarding the use of jl_test_cpu_feature, should we manually copy the numeric values of all features in features_*.h to this module and manually have a mapping between the value and the feature it represent, keeping this always in sync with the features defined on the C side?

No.

you can define them by generating a jl file from the features_*.h headers.

You can just run a julia script through C preprocessor. We do that for quite a few other headers.

giordano · 2020-09-09T00:11:47Z

We do that for quite a few other headers.

Any example to look at?

yuyichao · 2020-09-09T00:23:45Z

julia/base/Makefile

Lines 24 to 34 in 4a412e3

    
           $(BUILDDIR)/pcre_h.jl: $(PCRE_INCL_PATH) 
        
           	@$(call PRINT_PERL, $(CPP) -D PCRE2_CODE_UNIT_WIDTH=8 -dM $< | perl -nle '/^\s*#define\s+PCRE2_(\w*)\s*\(?($(PCRE_CONST))\)?u?\s*$$/ and print index($$1, "ERROR_") == 0 ? "const $$1 = Cint($$2)" : "const $$1 = UInt32($$2)"' | LC_ALL=C sort > $@) 
        
           $(BUILDDIR)/errno_h.jl: 
        
           	@$(call PRINT_PERL, echo '#include <errno.h>' | $(CPP) -dM - | perl -nle 'print "const $$1 = Int32($$2)" if /^#define\s+(E\w+)\s+(\d+)\s*$$/' | LC_ALL=C sort > $@) 
        
           $(BUILDDIR)/file_constants.jl: $(SRCDIR)/../src/file_constants.h 
        
           	@$(call PRINT_PERL, $(CPP_STDOUT) -DJULIA $< | perl -nle 'print "$$1 0o$$2" if /^(\s*const\s+[A-z_]+\s+=)\s+(0[0-9]*)\s*$$/; print "$$1" if /^\s*(const\s+[A-z_]+\s+=\s+([1-9]|0x)[0-9A-z]*)\s*$$/' > $@) 
        
           $(BUILDDIR)/uv_constants.jl: $(SRCDIR)/../src/uv_constants.h $(build_includedir)/uv/errno.h 
        
           	@$(call PRINT_PERL, $(CPP_STDOUT) "-I$(LIBUV_INC)" -DJULIA $< | tail -n 16 > $@)

codecov-commenter · 2020-09-09T08:36:27Z

Codecov Report

Merging #37479 into master will increase coverage by 0.03%.
The diff coverage is 88.73%.

@@            Coverage Diff             @@
##           master   #37479      +/-   ##
==========================================
+ Coverage   87.54%   87.57%   +0.03%     
==========================================
  Files         351      351              
  Lines       71009    71009              
==========================================
+ Hits        62166    62188      +22     
+ Misses       8843     8821      -22

Impacted Files	Coverage Δ
stdlib/Printf/src/Printf.jl	`70.99% <ø> (+1.10%)`	⬆️
base/compiler/ssair/passes.jl	`88.74% <50.00%> (-0.26%)`	⬇️
base/compiler/ssair/ir.jl	`83.47% <80.00%> (+0.08%)`	⬆️
base/compiler/tfuncs.jl	`86.54% <84.00%> (-0.11%)`	⬇️
stdlib/REPL/src/REPL.jl	`81.37% <94.44%> (-0.19%)`	⬇️
base/compiler/abstractinterpretation.jl	`85.86% <100.00%> (ø)`
base/compiler/validation.jl	`88.88% <100.00%> (+0.10%)`	⬆️
base/ryu/shortest.jl	`96.69% <100.00%> (+1.65%)`	⬆️
base/strings/substring.jl	`93.44% <100.00%> (+0.05%)`	⬆️
base/process.jl	`89.59% <0.00%> (-1.01%)`	⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f13e11c...4a9ed1e. Read the comment docs.

giordano · 2020-09-09T19:03:12Z

Ok, I exported jl_test_cpu_feature and a Perl script kindly provided by Elliot generates a Julia file with the list of the numerical features in Julia. Now there is a function to get the list of features.

This module is not going to stay in Sys but I'll move it later and I marked this PR as draft, also because some details of the API need to be ironed out, depending on exactly how we need to use the microarchitectures in the Artifacts module.

base/sysinfo.jl

giordano · 2020-09-12T22:35:37Z

I've moved the module under the new BinaryPlatforms module, so I believe I've now addressed all original comments

staticfloat

I think the largest piece missing from this is any kind of mapping for aarch64 and armv7l. Once we have those pieces, this looks exactly like what I'm looking for.

base/binaryplatforms.jl

giordano · 2020-09-15T23:03:36Z

I addressed the last two comments

giordano · 2020-09-16T00:27:54Z

On the aarch64 build machine I get the following:

julia> using Base.BinaryPlatforms.CPUID

julia> for f in filter(n -> startswith(String(n), "JL_AArch64"), names(CPUID; all=true))
           CPUID.test_cpu_feature(getfield(CPUID, f)) && println(f)
       end
JL_AArch64_aes
JL_AArch64_crc
JL_AArch64_sha2

@yuyichao any clue why only these features are set? In particular, the machine is armv8.1-a, shouldn't JL_AArch64_v8_1a be set as well?

julia> CPUID.test_cpu_feature(CPUID.JL_AArch64_v8_1a)
false

yuyichao · 2020-09-16T04:24:23Z

ISA version is impossible to detect (it's not a thing, basically). The only way to set it is through guessing. What is the CPU? Looking at the list, the only v8.1a I've seen is thunderx2....

KristofferC · 2020-09-16T05:13:47Z

base/binaryplatforms.jl

+# https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html.  Keep in sync with
+# `arch_march_isa_mapping`.
+const ISAs_by_family = Dict(
+    "x86_64" => (


Does it make sense to make these enums so you know from the type what possible the possible values are?

I like Dict's better so that we can add to them via Pkg hooks in the future.

I didn't mean to change the Dict, just the entries, "core2", "nehalem" etc. So you know that the keys are arches, and not say, the life work of Shakespeare, and you can reflect on the instances of the enum.

If we change the keys like "core2" from a string to an enum type, how do we add a new value to that enum type at runtime? What I'm considering is that Julia 1.6 is around for 3+ years, and in that time we want to add new ISAs, which we will be able to do from within a package by running something like Base.BinaryPlatforms.ISAs_by_family["x86_64"]["cannonlake"] = ....

We can already reflect on keys(ISAs_by_family["x86_64"]), what does an enum type give us?

If we want to be able to interactively add new ISA's at runtime from packages then it clearly can't be an enum. I didn't know that was desirable, especially when it seems like there are multiple data structures here that need to be kept in sync and that the features are generated from the C-code which made it look like the data here is the one source of truth. But if that's not the case, then yeah, can't use an enum.

which we will be able to do from within a package by running something like Base.BinaryPlatforms.ISAs_by_family["x86_64"]["cannonlake"] = ....

Note that nothing from this PR will be a stable API.

I changed how arch_march_isa_mapping is defined below, which should make it easier to keep that dictionary in sync with this one.

giordano · 2020-09-16T12:27:49Z

What is the CPU?

The features shown in /proc/cpuinfo are

fp asimd evtstrm aes pmull sha1 sha2 crc32

I'm a bit confused by the comments in

julia/src/features_aarch64.h

Lines 14 to 15 in 8bdf569

    
           // JL_FEATURE_DEF(aes, 3, 0) // HWCAP_AES. Implied by `aes` 
        
           JL_FEATURE_DEF(aes, 4, 0) // HWCAP_PMULL, ID_AA64ISAR0_EL1.AES == 2

What does it mean that aes is implied by aes? The feature at line 15 is called aes (like the one in line 14) but the comment mentions PMULL, is this correct?

yuyichao · 2020-09-16T12:42:52Z

The features shown in /proc/cpuinfo are

.... but what CPU is it...

What does it mean that aes is implied by aes? The feature at line 15 is called aes (like the one in line 14) but the comment mentions PMULL, is this correct?

Yes it is correct. It says HWCAP_AES. Implied by `aes`. The exported name is to match the LLVM one and LLVM's aes actually means HWCAP_PMULL (ID_AA64ISAR0_EL1.AES == 2) rather than HWCAP_AES (ID_AA64ISAR0_EL1.AES == 1 || ID_AA64ISAR0_EL1.AES == 2). The commented out name also says aes simply because there's nothing else I can think of to put there....

giordano · 2020-09-16T12:47:25Z

.... but what CPU is it...

The list of features was the only information I could find in /proc/cpuinfo, no reference to the model. lscpu says ThunderX 88XX

yuyichao · 2020-09-16T12:50:43Z

The list of features was the only information I could find in /proc/cpuinfo, no reference to the model. lspcu says ThunderX 88XX

What's the full content of /proc/cpuinfo.

lspcu says ThunderX 88XX

ThunderX should be v8

giordano · 2020-09-16T12:52:32Z

What's the full content of /proc/cpuinfo.

processor       : 0
BogoMIPS        : 200.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x43
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0x0a1
CPU revision    : 1

repeated a bunch of times for all CPUs

yuyichao · 2020-09-16T12:54:56Z

CPU implementer : 0x43
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0x0a1
CPU revision    : 1

Yes that's a thunderx88 and we should be detecting it correctly. Why do you believe it's v8.1a?

staticfloat · 2020-09-16T15:54:09Z

I was told the machine was a Thunder X2. Heh. Well this is good progress!

staticfloat · 2020-09-16T17:36:27Z

Confirmed that on a Graviton2, this successfully identifies it as v8.2-a!

yuyichao · 2020-09-16T17:50:20Z

Since this detection is unreliable (it's based on known name) I would not recommend relying on it. You should check for the features you need instead.

staticfloat · 2020-09-16T17:55:27Z

We get to set the features we need; e.g. we're generating binaries in BB by using compiler flags like "-march=armv8.2-a". Can you point me to the right table to use to translate from the JL_AArch64_XXX flags to ensure that any binary generated with -march=armv8.2-a can run on that CPU?

yuyichao · 2020-09-16T18:05:00Z

Well, that's the part that needs to be fixed. v8.2a comes with features (don't remember which one off the top of my head) that cannot be reliably detected.

staticfloat · 2020-09-16T18:08:32Z

Fixed by whom? Is it the LLVM detection that is unreliable?

yuyichao · 2020-09-16T18:14:18Z

Fixed by whom?

Fixed by you, to change the compilation flags.

Is it the LLVM detection that is unreliable?

No, there's simply no way at the hardware level to detect which version it is. The only thing available is the feature set. And the linux kernel also does not expose all the features that LLVM may use.

staticfloat · 2020-09-16T18:22:44Z

So is what you're saying that we should avoid the -march=armv8.X-a flags completely, and instead use armv8a+feature1+feature2+feature3, and only use features that can be reliably detected from Julia?

yuyichao · 2020-09-16T18:26:49Z

Yes. At least until someone can find a way to reliably detect all features in all 8.*-a versions. It would be OK to give them aliases like v8.1-a but that shouldn't be the thing passed to the compiler.

staticfloat · 2020-09-16T18:30:58Z

Okay, this we can work with. So you suggest that we look through what is defined in https://github.com/JuliaLang/julia/blob/master/src/features_aarch64.h, take those as the official list of flags that can be reliably detected on AArch64, and map those to what is implied by each armv8.X-a level. We can then setup BB to define each "named" armv8.x-a level to that set of features, then we can test them here in the CPUID module. And we'll completely ignore the JL_AArch64_v8_Xa flags?

yuyichao · 2020-09-16T18:38:04Z

Yes.

yuyichao · 2020-09-16T18:39:54Z

All of those maps to kernel ABI so that's about as stable as it can be. It's possible that more could be added later for earlier versions but doing that shouldn't break anything.

staticfloat · 2020-09-16T19:43:50Z

As a datapoint, I looked at what archspec does (the microarchitecture management library used by spack) and it looks like they just use what you can get from /proc/cpuinfo, so they're in the same position as us, and they also only use families of feature flags.

Example mappings for thunderx2 and graviton2.

I'm going to collect here a list of features that we should map to: I'm going to edit this comment until I'm happy with it. :)

First off, some useful resources:

GCC aarch64 core definitions: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/aarch64/aarch64-cores.def
GCC base aarch64 architecture flags: https://github.com/gcc-mirror/gcc/blob/40d00b7aab9be046a98b282a149ff3829887ba79/gcc/config/aarch64/aarch64.h#L170-L181
armv8.0-a: The base level, no optional features enabled. Should run on every armv8 processor under the sun. GCC says that fp and asimd should always be on from this level.
armv8.1-a: GCC says: [fp, asimd], lse, crc and rdma (we call this rdm?).
armv8.2-a: GCC doesn't give us any extra required extensions here, but on this page it mentions that fp16 and ras are on by default, but I don't see that reflected in the code.
armv8.3-a: GCC doesn't give us any extra required extensions here, but on this page it mentions that rcpc and compnom are on by default, but I don't see that reflected in the code.
armv8.4-a: GCC says: [fp, asimd, lse, crc, rdma], fp16fml, dotprod
armv8.5-a: GCC says: [fp, asimd, lse, crc, rdma, fp16fml, dotprod], sb, ssbs, predres (this is not available in LLVM, apparently?)

Notably, I see that there are no base armv8.X-a levels that require common extensions such as the crypto extensions (e.g. (aes, sha2) on armv8.1- and (aes, sha2, sha3, sm4) on armv8.2+), or the sve extension, which we may want to support.

Questions:

fp and asimd are both commented out in the header, presumably because they'll always be on. Can we un-comment them so that we have less implicit knowledge and it's easier to do exhaustive tests?
Given your domain knowledge of cores, do you think there are useful levels that we should compile for? For x86_64 it seems clear that the avx, avx2, etc... feature boundaries are useful stepping stones for performance. In this case, I can only guess that we should support some base levels but I'm not sure what optional extensions we should build for, if any. Right now, I'm leaning toward us supporting the following:
- armv8.0-a -> [fp, asimd] - matches rpi4 in aarch64 mode, and thunderx88
- armv8.1-a -> [fp, asimd, lse, crc, rdma]
- armv8.2-a_crypto -> [fp, asimd, lse, crc, rdma, aes, sha2] - matches thunderx2, carmel, graviton2, and even Apple A12z (according to pytorch's cpuinfo)
- armv8.4-a_crypto_sve -> [fp, asimd, lse, crc, rdma, fp16fml, dotprod, aes, sha2, sve] - matches a64fx

yuyichao · 2020-09-16T20:27:15Z

First off, some useful resources:

You can just use our definition.

julia/src/processor_arm.cpp

Lines 230 to 341 in 4864161

    
           constexpr auto generic = get_feature_masks(); 
        
           constexpr auto armv8a_crc = get_feature_masks(crc); 
        
           constexpr auto armv8a_crc_crypto = armv8a_crc | get_feature_masks(aes, sha2); 
        
           constexpr auto armv8_1a = armv8a_crc | get_feature_masks(v8_1a, lse, rdm); // lor 
        
           constexpr auto armv8_1a_crypto = armv8_1a | get_feature_masks(aes, sha2); 
        
           constexpr auto armv8_2a = armv8_1a | get_feature_masks(v8_2a, ccpp); 
        
           constexpr auto armv8_2a_crypto = armv8_2a | get_feature_masks(aes, sha2); 
        
           constexpr auto armv8_3a = armv8_2a | get_feature_masks(v8_3a, jsconv, complxnum, rcpc); 
        
           constexpr auto armv8_3a_crypto = armv8_3a | get_feature_masks(aes, sha2); 
        
           constexpr auto armv8_4a = armv8_3a | get_feature_masks(v8_4a, dit, rcpc_immo, fmi); 
        
           constexpr auto armv8_4a_crypto = armv8_4a | get_feature_masks(aes, sha2); 
        
           constexpr auto armv8_5a = armv8_4a | get_feature_masks(v8_5a, sb, ccdp, altnzcv, fptoint); 
        
           constexpr auto armv8_6a = armv8_5a | get_feature_masks(v8_6a, i8mm, bf16); 
        
           // For ARM cores, the features required can be found in the technical reference manual 
        
           // The relevant register values and the features they are related to are: 
        
           // ID_AA64ISAR0_EL1: 
        
           //     .AES: aes, pmull 
        
           //     .SHA1: sha1 
        
           //     .SHA2: sha2, sha512 
        
           //     .CRC32: crc 
        
           //     .Atomic: les 
        
           //     .RDM: rdm 
        
           //     .SHA3: sha3 
        
           //     .SM3: sm3 (sm4) 
        
           //     .SM4: sm4 
        
           //     .DP: dotprod 
        
           //     .FHM: fp16fml 
        
           //     .TS: fmi, altnzcz 
        
           //     .RNDR: rand 
        
           // ID_AA64ISAR1_EL1 
        
           //     .JSCVT: jsconv 
        
           //     .FCMA: complxnum 
        
           //     .LRCPC: rcpc, rcpc_immo 
        
           //     .DPB: ccpp, ccdp 
        
           //     .SB: sb 
        
           //     .APA/.API: paca (pa) 
        
           //     .GPA/.GPI: paga (pa) 
        
           //     .FRINTTS: fptoint 
        
           //     .I8MM: i8mm 
        
           //     .BF16: bf16 
        
           //     .DGH: dgh 
        
           // ID_AA64PFR0_EL1 
        
           //     .FP: fullfp16 
        
           //     .SVE: sve 
        
           //     .DIT: dit 
        
           //     .BT: bti 
        
           // ID_AA64PFR1_EL1.SSBS: ssbs 
        
           // ID_AA64MMFR2_EL1.AT: uscat 
        
           // ID_AA64ZFR0_EL1 
        
           //     .SVEVer: sve2 
        
           //     .AES: sve2-aes, sve2-pmull 
        
           //     .BitPerm: sve2-bitperm 
        
           //     .SHA3: sve2-sha3 
        
           //     .SM4: sve2-sm4 
        
           //     .F32MM: f32mm 
        
           //     .F64MM: f64mm 
        
           constexpr auto arm_cortex_a34 = armv8a_crc; 
        
           constexpr auto arm_cortex_a35 = armv8a_crc; 
        
           constexpr auto arm_cortex_a53 = armv8a_crc; 
        
           constexpr auto arm_cortex_a55 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16, ssbs); 
        
           constexpr auto arm_cortex_a57 = armv8a_crc; 
        
           constexpr auto arm_cortex_a65 = armv8_2a | get_feature_masks(rcpc, fullfp16, ssbs); 
        
           constexpr auto arm_cortex_a72 = armv8a_crc; 
        
           constexpr auto arm_cortex_a73 = armv8a_crc; 
        
           constexpr auto arm_cortex_a75 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16); 
        
           constexpr auto arm_cortex_a76 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16, ssbs); 
        
           constexpr auto arm_cortex_a77 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16, ssbs); 
        
           constexpr auto arm_cortex_a78 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16, ssbs); // spe 
        
           constexpr auto arm_cortex_x1 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16, ssbs); // spe 
        
           constexpr auto arm_neoverse_e1 = armv8_2a | get_feature_masks(rcpc, fullfp16, ssbs); 
        
           constexpr auto arm_neoverse_n1 = armv8_2a | get_feature_masks(dotprod, rcpc, fullfp16, ssbs); 
        
           constexpr auto arm_zeus = armv8_4a | get_feature_masks(sve, i8mm, bf16, fullfp16, ssbs, rand); 
        
           constexpr auto cavium_thunderx = armv8a_crc_crypto; 
        
           constexpr auto cavium_thunderx88 = armv8a_crc_crypto; 
        
           constexpr auto cavium_thunderx88p1 = armv8a_crc_crypto; 
        
           constexpr auto cavium_thunderx81 = armv8a_crc_crypto; 
        
           constexpr auto cavium_thunderx83 = armv8a_crc_crypto; 
        
           constexpr auto cavium_thunderx2t99 = armv8_1a_crypto; 
        
           constexpr auto cavium_thunderx2t99p1 = cavium_thunderx2t99; 
        
           constexpr auto cavium_octeontx2 = armv8_2a_crypto; 
        
           constexpr auto fujitsu_a64fx = armv8_2a | get_feature_masks(sha2, fullfp16, sve, complxnum); 
        
           constexpr auto hisilicon_tsv110 = armv8_2a_crypto | get_feature_masks(dotprod, fullfp16); 
        
           constexpr auto hxt_phecda = armv8a_crc_crypto; 
        
           constexpr auto marvell_thunderx3t110 = armv8_3a_crypto; 
        
           constexpr auto nvidia_denver1 = generic; // TODO? (crc, crypto) 
        
           constexpr auto nvidia_denver2 = armv8a_crc_crypto; 
        
           constexpr auto nvidia_carmel = armv8_2a_crypto | get_feature_masks(fullfp16); 
        
           constexpr auto apm_xgene1 = generic; 
        
           constexpr auto apm_xgene2 = generic; // TODO? 
        
           constexpr auto apm_xgene3 = generic; // TODO? 
        
           constexpr auto qualcomm_kyro = armv8a_crc_crypto; 
        
           constexpr auto qualcomm_falkor = armv8a_crc_crypto | get_feature_masks(rdm); 
        
           constexpr auto qualcomm_saphira = armv8_4a_crypto; 
        
           constexpr auto samsung_exynos_m1 = armv8a_crc_crypto; 
        
           constexpr auto samsung_exynos_m2 = armv8a_crc_crypto; 
        
           constexpr auto samsung_exynos_m3 = armv8a_crc_crypto; 
        
           constexpr auto samsung_exynos_m4 = armv8_2a_crypto | get_feature_masks(dotprod, fullfp16); 
        
           constexpr auto samsung_exynos_m5 = samsung_exynos_m4; 
        
           constexpr auto apple_a7 = armv8a_crc_crypto; 
        
           constexpr auto apple_a10 = armv8a_crc_crypto | get_feature_masks(rdm); 
        
           constexpr auto apple_a11 = armv8_2a_crypto | get_feature_masks(fullfp16); 
        
           constexpr auto apple_a12 = armv8_3a_crypto | get_feature_masks(fullfp16); 
        
           constexpr auto apple_a13 = armv8_4a_crypto | get_feature_masks(fp16fml, fullfp16, sha3); 
        
           constexpr auto apple_s4 = apple_a12; 
        
           constexpr auto apple_s5 = apple_a12;

I believe this is the most complete list you can get from anywhere.

fp and asimd are both commented out in the header, presumably because they'll always be on. Can we un-comment them so that we have less implicit knowledge and it's easier to do exhaustive tests?

No, they are just required feature. Soft floating point is not supported. I'm not sure what can be tested about something that just isn't supposed to work.

Given your domain knowledge of cores, do you think there are useful levels that we should compile for?

It was "clear" for x86 only because there are changes on the vector registers that somewhat overshadows other changes. There are actually also a few other extensions like bit manipulation and cache (prefetch, invalidate, flush) manipulations, half precisions, that doesn't really fit into the sse/avx evolution timeline. Of course since there are only two major players there aren't too many lines of succession. Even then, AMD is known for deprecating old and unused extensions in newer processors and actually removing them so it's not actually not as simple as the illusion one might get from just looking at the vector registers.

On AArch64, the only extra SIMD level exists is SVE (and SVE2 I guess) so there just isn't the same evolution line that overshadow everything else. Also, SVE may or may not actually work with our compiler and unless someone get a chip/server from fujitsu or is working on fugaku you'll not be able to actually use the instruction. There is at least still another useful enough line of succession on aarch64 though, which is the memory model. 8.1-atomic (i.e. lse) and later ones like rcpc are interesting enough (and according to Keno good enough to run rr) so that's one feature that's worth supporting, though probably only for libraries that does their own locking etc.

Other than the atomics, there are basically about the same number of other random interesting features. They are just not burried among a million avx512 flavors........ The usuable hardwares with these extensions are scarce enough that I don't really know what are the ones that people uses. (if anyone want to teach me how to hack a mid-high end phone/tablet so that I can run gdb on it I'll be very interested.............) OTOH, despite the dozens of cores and hundreds of SoC names, the actual unique features are actually pretty limited. If you have a look at the feature lists we have, most cores are basically either v8-a or v8.2-a and potentially with extensions that agrees with the cortex-a line fairly well. Apple seems to be the most agressive player and is bumping their ISA version on every new chip since a11 but unless someone can figure out how to run linux on there it won't be very relavent for linux libraries.

So in the end I'll say just go with what known cores have is probably fine. This is also how I decided to suggest thunderx originally as an additional target since at the time it seems like it'll be the most immediate "server" platform that people can use. Feature wise this shouldn't give too many variance to support (it seems that v8 and v8.2,rcpc,fullfp16 would be good, maybe adding carmel if people care a lot since it seems to be missing rcpc). I'm not sure if there's anything that's worth optimizing for a particular core but I kind of doubt it. With more and more players (apple and marvell being the main exceptions) using straight or tweaked arm cortex lines the performance model is probably not going to diverge too much....

staticfloat · 2020-09-16T21:23:55Z

Thanks, that's really helpful Yichao!

No, they are just required feature. Soft floating point is not supported. I'm not sure what can be tested about something that just isn't supposed to work.

It's not about testing with a platform that doesn't have fp, it's about having our feature lists be consistent the rest of the world. It's annoying to have to carry the cognitive load of remembering that even though /proc/cpuinfo specifies fp and asmid, the fact that Julia doesn't advertise support for those flags doesn't mean it's not compatible, we just have to remember that it is automatically included.

I'm not sure if there's anything that's worth optimizing for a particular core but I kind of doubt it. With more and more players (apple and marvell being the main exceptions) using straight or tweaked arm cortex lines the performance model is probably not going to diverge too much....

So it sounds like just two marches for now may be okay:

armv8 => armv8.0-a+fp+asimd (yes, I know fp and asimd are already on by default, I just want to be explicit)
armv8_2_crypto => armv8.0-a+fp+asimd+lse+crc+rdma+aes+sha2

I'm not sure how important rcpc or fullfp16 is; is it worthwhile adding an armv8_2_crypto_rcpc? If you don't think it's that big of a deal, I vote that we just go with these two for now and we can add more later.

Co-authored-by: Elliot Saba <staticfloat@gmail.com>

yuyichao · 2020-09-16T22:07:16Z

It's annoying to have to carry the cognitive load of remembering that even though /proc/cpuinfo specifies fp and asmid, the fact that Julia doesn't advertise support for those flags doesn't mean it's not compatible, we just have to remember that it is automatically included.

Well, you are free to ignore these anywhere in the stack if you want copy-paste from /proc/cpuinfo etc. However, note that they are not the LLVM name so if anything if they are added here they will not be called these. (LLVM names are neon and fp-armv8 I believe) and these aren't the only one either so you always have to know the translation. That's why I added document to map between llvm name, HWCAP name and feature register bits.

Also, the two are always exactly the same according to the arm architecture manual so allowing both will be even more confusing. It's the same reason many other features are commented out, (e.g. asimdhp, sm3, svei8mm, svebf16)

yuyichao · 2020-09-16T22:09:10Z

I'm not sure how important rcpc or fullfp16 is; is it worthwhile adding an armv8_2_crypto_rcpc? If you don't think it's that big of a deal, I vote that we just go with these two for now and we can add more later.

According to Keno, rcpc is suspected to be required by rr. I don't think it'll be important for many external libraries (and still, I believe very few of these are important for external libraries and the rest can and should do internal dispatch.........................)

This is a small internal module which defines the API to query the Instruction Set Architecture (ISA) of the current CPU. This can be used for example to select an artifact in a JLL package which is compatible with the current CPU.

staticfloat

LGTM, I say ready to merge.

Thanks @yuyichao for his careful attention, and @giordano for his work putting this together!

staticfloat · 2020-09-17T01:38:51Z

According to Keno, rcpc is suspected to be required by rr.

He's not 100% certain, so we're going to do some experiments to figure out exactly what is required by building rr with different march'es and figuring out what we need. Once we know, we can add a mapping here (or even dynamically include it via a Pkg hook that I'm writing)

chriselrod · 2020-09-18T05:32:08Z

For example, LoopVectorization.jl must not (and as I understand it currently is) using anything like this to determine what code to generate.
The target used for julia code and the actual hardware target are usually the same. However, whenever they disagree the hardware one is almost always the wrong one to use.

And again, having some dirty hack that should work for library loading most of the time is fine, as long as it's clear that no one outside of base should use it and the API may be deleted at any time without notice.

@yuyichao I didn't see this. It would be great to have an API that provides features that are safe to really on.
LoopVectorization hasn't relied on CpuId for a couple months, instead queering LLVM with getHostCpuFeaures. I made the change from CpuId because some people experienced crashes when running on avx512 hardware while LLVM not listing it among the features (the recent bug on mac's).

A more abstracted api would be generally useful. It's be great to have a reliable FMA_NATIVE, for example.

yuyichao · 2020-09-18T05:42:57Z

The answer from LLVM was wrong as well.

chriselrod · 2020-09-18T05:57:45Z

It solved the issue on Macs, but it is incorrect when starting Julia with -Civybridge. I haven't noticed any issues outside of specifying arch on startup yet.
Is there any way to do this correctly?

Aside from "don't generate machine specific code from Julia at all".
Maybe someday I can try implementing it in LLVM, but experimenting is much easier in Julia -- at least for me -- and the library is still immature, lacking many capabilities I intend it to have. Meaning there is still a lot more experimenting to be done.

Unless I'm mistaken, getHostCpuFeatures should be correct as long as llvm's rather is the host, which will be the case unless user's deliberately specified something else when starting Julia.

vchuravy · 2020-09-22T00:26:29Z

base/Makefile

+@echo >> $@
+endef
+
+$(BUILDDIR)/features_h.jl: ../src/features_x86.h ../src/features_aarch32.h ../src/features_aarch64.h


Broke out of dir build":

vchuravy@odin ~/b/julia-replloader [2]> make -j make[1]: *** No rule to make target '../src/features_x86.h', needed by 'features_h.jl'. Stop. make[1]: *** Waiting for unfinished jobs.... make: *** [/home/vchuravy/src/julia/Makefile:66: julia-base] Error 2 make: *** Waiting for unfinished jobs....

yuyichao requested changes Sep 8, 2020

View reviewed changes

giordano force-pushed the mg/cpuid branch from 4a9ed1e to 5d76385 Compare September 9, 2020 18:53

giordano marked this pull request as draft September 9, 2020 19:01

giordano commented Sep 9, 2020

View reviewed changes

base/sysinfo.jl Outdated Show resolved Hide resolved

giordano force-pushed the mg/cpuid branch 2 times, most recently from fc6e8f7 to 14cd2d1 Compare September 12, 2020 22:32

giordano marked this pull request as ready for review September 12, 2020 22:34

staticfloat requested changes Sep 14, 2020

View reviewed changes

base/binaryplatforms.jl Show resolved Hide resolved

base/binaryplatforms.jl Outdated Show resolved Hide resolved

giordano force-pushed the mg/cpuid branch from 14cd2d1 to 4ce5867 Compare September 15, 2020 22:56

KristofferC reviewed Sep 16, 2020

View reviewed changes

Export jl_test_cpu_feature and generate Julia list of CPU features

f40dbbd

Co-authored-by: Elliot Saba <staticfloat@gmail.com>

giordano force-pushed the mg/cpuid branch 2 times, most recently from 19e5300 to d73df58 Compare September 16, 2020 23:25

Add CPUID submodule

f1bb7df

This is a small internal module which defines the API to query the Instruction Set Architecture (ISA) of the current CPU. This can be used for example to select an artifact in a JLL package which is compatible with the current CPU.

giordano force-pushed the mg/cpuid branch from d73df58 to f1bb7df Compare September 16, 2020 23:45

staticfloat approved these changes Sep 17, 2020

View reviewed changes

staticfloat merged commit 0f782b2 into JuliaLang:master Sep 17, 2020

giordano deleted the mg/cpuid branch September 17, 2020 22:12

vchuravy reviewed Sep 22, 2020

View reviewed changes

Add CPUID submodule #37479

Add CPUID submodule #37479

Conversation

giordano commented Sep 8, 2020

yuyichao commented Sep 8, 2020

yuyichao commented Sep 8, 2020

yuyichao left a comment

Choose a reason for hiding this comment

staticfloat commented Sep 8, 2020

yuyichao commented Sep 8, 2020 • edited Loading

giordano commented Sep 8, 2020

yuyichao commented Sep 8, 2020 • edited Loading

giordano commented Sep 9, 2020

yuyichao commented Sep 9, 2020

codecov-commenter commented Sep 9, 2020

Codecov Report

giordano commented Sep 9, 2020

giordano commented Sep 12, 2020

staticfloat left a comment

Choose a reason for hiding this comment

giordano commented Sep 15, 2020 • edited Loading

giordano commented Sep 16, 2020

yuyichao commented Sep 16, 2020

KristofferC Sep 16, 2020

Choose a reason for hiding this comment

staticfloat Sep 16, 2020

Choose a reason for hiding this comment

KristofferC Sep 16, 2020 • edited Loading

Choose a reason for hiding this comment

staticfloat Sep 16, 2020 • edited Loading

Choose a reason for hiding this comment

KristofferC Sep 16, 2020 • edited Loading

Choose a reason for hiding this comment

yuyichao Sep 16, 2020

Choose a reason for hiding this comment

giordano Sep 16, 2020

Choose a reason for hiding this comment

giordano commented Sep 16, 2020

yuyichao commented Sep 16, 2020

giordano commented Sep 16, 2020 • edited Loading

yuyichao commented Sep 16, 2020 • edited Loading

giordano commented Sep 16, 2020

yuyichao commented Sep 16, 2020

staticfloat commented Sep 16, 2020

staticfloat commented Sep 16, 2020

yuyichao commented Sep 16, 2020

staticfloat commented Sep 16, 2020

yuyichao commented Sep 16, 2020

staticfloat commented Sep 16, 2020

yuyichao commented Sep 16, 2020

staticfloat commented Sep 16, 2020

yuyichao commented Sep 16, 2020

staticfloat commented Sep 16, 2020

yuyichao commented Sep 16, 2020

yuyichao commented Sep 16, 2020

staticfloat commented Sep 16, 2020 • edited Loading

yuyichao commented Sep 16, 2020

staticfloat commented Sep 16, 2020

yuyichao commented Sep 16, 2020

yuyichao commented Sep 16, 2020

staticfloat left a comment

Choose a reason for hiding this comment

staticfloat commented Sep 17, 2020 • edited Loading

chriselrod commented Sep 18, 2020

yuyichao commented Sep 18, 2020

chriselrod commented Sep 18, 2020 • edited Loading

vchuravy Sep 22, 2020

Choose a reason for hiding this comment

yuyichao commented Sep 8, 2020 •

edited

Loading

yuyichao commented Sep 8, 2020 •

edited

Loading

giordano commented Sep 15, 2020 •

edited

Loading

KristofferC Sep 16, 2020 •

edited

Loading

staticfloat Sep 16, 2020 •

edited

Loading

KristofferC Sep 16, 2020 •

edited

Loading

giordano commented Sep 16, 2020 •

edited

Loading

yuyichao commented Sep 16, 2020 •

edited

Loading

staticfloat commented Sep 16, 2020 •

edited

Loading

staticfloat commented Sep 17, 2020 •

edited

Loading

chriselrod commented Sep 18, 2020 •

edited

Loading