Improve cpu info for non-x86 architectures #9

chriselrod · 2020-05-24T01:21:27Z

Currently, it uses a generic build script.

This script assumes:

const REGISTER_SIZE = 16
const REGISTER_COUNT = 16
const CACHELINE_SIZE = 64
const SIMD_NATIVE_INTEGERS = true

If any of these are violated, dependent libraries (e.g., LoopVectorization) are likely to produce suboptimal code. If these numbers undershoot, that would just mean some performance is left on the table, but it's likely to perform reasonably well.
If these numbers overshoot, performance consequences could be dire. Register spills galore.

I believe some ARM CPUs do not have SIMD Float64, so perhaps this should be handled somehow.

Ideally, we'd use a library like CpuId.jl to query hardware info, like we do for AMD and Intel.

The text was updated successfully, but these errors were encountered:

iamed2 · 2020-07-29T18:39:56Z

arm64 is available as a Travis test environment, it might be good to set up testing in a branch and see where expectations match reality. It's Arm v8.0-A, SBSA Level 3, so it should support 64-bit floating point SIMD.

chriselrod · 2020-07-29T19:32:01Z

Yes, that comment was a little out of date. You're right.
It would be great to start testing it on ARM.

I'll have to figure out how to do feature detection.
Are CPUs that don't support Float64 SIMD actually common? What about raspberry pis?

Would also be good to know if 32 floating point registers are the norm. That would help with a lot of operations like BLAS.
Testing would be good to know that it works and there aren't unexpected errors or crashes, but performance benchmarks would be helpful as well.

iamed2 · 2020-07-30T13:45:20Z

I think all AArch64/arm64 with SIMD will support Float64. So I think you'd be okay to make that particular assumption based on the architecture. AArch32 is more complicated since it represents a lot more versions of ARM. Based on the ARM v8 spec I think some AArch32 processors can run SIMD operations on regular registers as well as SIMD registers, which makes that more complicated. I'm pretty sure Raspberry PIs have been AArch32 for a long time. I'm not sure if there's new ones that are AArch64 but the vast majority will be AArch32. That said, AArch64 is going to be a huge new user base, since Amazon is heavily advertising their new Graviton2 processors, and Apple is switching to it for seemingly all of their products.

Not sure about the number of registers. The ARMv8-A spec is here: https://documentation-service.arm.com/static/5f20515cbb903e39c84dc459 but that's only one spec.

For feature detection on Linux, the kernel provides asm/hwcap.h (e.g., https://android.googlesource.com/kernel/msm/+/android-7.1.0_r0.2/arch/arm64/include/uapi/asm/hwcap.h). You can get the feature data using hwcap = ccall(:getauxval, Culong, (Culong,), 16) (AT_HWCAP is 16, apparently) and then & with the feature ID in the header. Not sure what other operating systems provide.

chriselrod · 2020-07-30T14:19:00Z

There's also this for feature detection:

using Libdl
llvmlib = Libdl.dlopen(only(filter(lib->occursin(r"LLVM\b", basename(lib)), Libdl.dllist())))
gethostcpufeatures = Libdl.dlsym(llvmlib, :LLVMGetHostCPUFeatures)
features_cstring = ccall(gethostcpufeatures, Cstring, ())
features = filter(ext -> (m = match(r"\d", ext); isnothing(m) ? true : m.offset != 2 ), split(unsafe_string(features_cstring), ','))

But from what I recall of seeing discussions on Julia issues/PRs, this is going to be incomplete with ARM.

I'll still have to see how to translate hwcap into the information I want, the most important of which is the 4 constants from the opening post.

iamed2 · 2020-07-30T17:10:02Z

I think you can find cache info from the same kernel function getauxval using a different AT_ value: https://man7.org/linux/man-pages/man3/getauxval.3.html

Not sure about that register information. @PallHaraldsson answered a related question on Quora so he might have some suggestions.

I'll note that the ARMv8 spec linked above does tell you exactly how many registers of what kind it has, so worst-case scenario you could hardcode the numbers by architecture.

chriselrod · 2020-08-16T16:55:53Z

I think hard coding the number of registers would be worth it.
Currently, LoopVectorization only uses the number of vector registers and just hopes performance isn't hurt too badly by spilling others (it does spill integer registers in some cases, but I think doing so is worthwhile in those cases).

For cross platform cache information, perhaps we should depend on Hwloc.jl?

Although I'm not opposed to making OS-specific performance optimizations (already using faster SIMD special functions on Linux).

iamed2 · 2020-08-17T15:42:15Z

For cross platform cache information, perhaps we should depend on Hwloc.jl?

That's a good idea! Looks like it needs some fixes first though: JuliaParallel/Hwloc.jl#31

turingtest37 · 2020-12-02T17:22:53Z

I believe I am struggling with the same issue as others have reported: errors compiling LoopVectorization on a Jetson Nano.
The problem, I believe, is that Hwloc does not report any L3 Cache, which is assumed to exist in line 8 of topology.jl.
Here's what the compile error looks like on my Nano:

[ Info: Precompiling VectorizationBase [3d5dd08c-fd9d-11e8-17fa-ed2836048c2f]
ERROR: LoadError: LoadError: type NullAttr has no field size
Stacktrace:
 [1] getproperty(::Hwloc.NullAttr, ::Symbol) at ./Base.jl:33
 [2] top-level scope at /home/doug/.julia/packages/VectorizationBase/26Yla/src/topology.jl:8
 [3] include(::Function, ::Module, ::String) at ./Base.jl:380
 [4] include at ./Base.jl:368 [inlined]
 [5] include(::String) at /home/doug/.julia/packages/VectorizationBase/26Yla/src/VectorizationBase.jl:1
 [6] top-level scope at /home/doug/.julia/packages/VectorizationBase/26Yla/src/VectorizationBase.jl:270
 [7] include(::Function, ::Module, ::String) at ./Base.jl:380
 [8] include(::Module, ::String) at ./Base.jl:368
 [9] top-level scope at none:2
 [10] eval at ./boot.jl:331 [inlined]
 [11] eval(::Expr) at ./client.jl:467
 [12] top-level scope at ./none:3
in expression starting at /home/doug/.julia/packages/VectorizationBase/26Yla/src/topology.jl:8
in expression starting at /home/doug/.julia/packages/VectorizationBase/26Yla/src/VectorizationBase.jl:270

Digging deeper,

julia> using Hwloc
julia> const TOPOLOGY = Hwloc.topology_load();
julia> topology = Hwloc.topology_load()
D0: L0 P0 Machine  
    D1: L0 P0 Package  
        D2: L0 P-1 L2Cache  Cache{size=2097152,depth=2,linesize=64,associativity=16,type=Unified}
            D3: L0 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L0 P0 Core  
                    D5: L0 P0 PU  
            D3: L1 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L1 P1 Core  
                    D5: L1 P1 PU  
            D3: L2 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L2 P2 Core  
                    D5: L2 P2 PU  
            D3: L3 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L3 P3 Core  
                    D5: L3 P3 PU  
julia> const CACHE = TOPOLOGY.children[1].children[1];
        D2: L0 P-1 L2Cache  Cache{size=2097152,depth=2,linesize=64,associativity=16,type=Unified}
            D3: L0 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L0 P0 Core  
                    D5: L0 P0 PU  
            D3: L1 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L1 P1 Core  
                    D5: L1 P1 PU  
            D3: L2 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L2 P2 Core  
                    D5: L2 P2 PU  
            D3: L3 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L3 P3 Core  
                    D5: L3 P3 PU

Subsequently (from topology.jl, line 8)

julia> CACHE.children[1].children[1]
                D4: L0 P0 Core  
                    D5: L0 P0 PU  
julia> CACHE.children[1].children[1].attr
julia>

I'd be happy to help fix this!
-thanks
doug

chriselrod · 2020-12-02T17:29:25Z

I shouldn't assume that. Lots of non-x86 CPUs only have 2 levels of cache, including the M1.

chriselrod · 2020-12-02T17:45:50Z

I'd be happy to help fix this!

You're welcome to make a PR!
You could set all the corresponding references to a missing cache level to nothing when it isn't there.

Or, for CACHE_SIZE and CACHE_COUNT, just make them tuples of length equal to the number of cache levels.

chriselrod mentioned this issue May 25, 2020

Precompilation error on the Nvidia Jetson platform JuliaSIMD/LoopVectorization.jl#101

Closed

chriselrod changed the title ~~Improve build script for non-x86 architectures~~ Improve cpu info for non-x86 architectures Jul 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cpu info for non-x86 architectures #9

Improve cpu info for non-x86 architectures #9

chriselrod commented May 24, 2020 •

edited

Loading

iamed2 commented Jul 29, 2020

chriselrod commented Jul 29, 2020

iamed2 commented Jul 30, 2020

chriselrod commented Jul 30, 2020

iamed2 commented Jul 30, 2020

chriselrod commented Aug 16, 2020

iamed2 commented Aug 17, 2020

turingtest37 commented Dec 2, 2020 •

edited

Loading

chriselrod commented Dec 2, 2020

chriselrod commented Dec 2, 2020 •

edited

Loading

Improve cpu info for non-x86 architectures #9

Improve cpu info for non-x86 architectures #9

Comments

chriselrod commented May 24, 2020 • edited Loading

iamed2 commented Jul 29, 2020

chriselrod commented Jul 29, 2020

iamed2 commented Jul 30, 2020

chriselrod commented Jul 30, 2020

iamed2 commented Jul 30, 2020

chriselrod commented Aug 16, 2020

iamed2 commented Aug 17, 2020

turingtest37 commented Dec 2, 2020 • edited Loading

chriselrod commented Dec 2, 2020

chriselrod commented Dec 2, 2020 • edited Loading

chriselrod commented May 24, 2020 •

edited

Loading

turingtest37 commented Dec 2, 2020 •

edited

Loading

chriselrod commented Dec 2, 2020 •

edited

Loading