Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cpu info for non-x86 architectures #9

Open
chriselrod opened this issue May 24, 2020 · 10 comments
Open

Improve cpu info for non-x86 architectures #9

chriselrod opened this issue May 24, 2020 · 10 comments

Comments

@chriselrod
Copy link
Member

chriselrod commented May 24, 2020

Currently, it uses a generic build script.

This script assumes:

const REGISTER_SIZE = 16
const REGISTER_COUNT = 16
const CACHELINE_SIZE = 64
const SIMD_NATIVE_INTEGERS = true

If any of these are violated, dependent libraries (e.g., LoopVectorization) are likely to produce suboptimal code. If these numbers undershoot, that would just mean some performance is left on the table, but it's likely to perform reasonably well.
If these numbers overshoot, performance consequences could be dire. Register spills galore.

I believe some ARM CPUs do not have SIMD Float64, so perhaps this should be handled somehow.

Ideally, we'd use a library like CpuId.jl to query hardware info, like we do for AMD and Intel.

@chriselrod chriselrod changed the title Improve build script for non-x86 architectures Improve cpu info for non-x86 architectures Jul 29, 2020
@iamed2
Copy link

iamed2 commented Jul 29, 2020

arm64 is available as a Travis test environment, it might be good to set up testing in a branch and see where expectations match reality. It's Arm v8.0-A, SBSA Level 3, so it should support 64-bit floating point SIMD.

@chriselrod
Copy link
Member Author

Yes, that comment was a little out of date. You're right.
It would be great to start testing it on ARM.

I'll have to figure out how to do feature detection.
Are CPUs that don't support Float64 SIMD actually common? What about raspberry pis?

Would also be good to know if 32 floating point registers are the norm. That would help with a lot of operations like BLAS.
Testing would be good to know that it works and there aren't unexpected errors or crashes, but performance benchmarks would be helpful as well.

@iamed2
Copy link

iamed2 commented Jul 30, 2020

I think all AArch64/arm64 with SIMD will support Float64. So I think you'd be okay to make that particular assumption based on the architecture. AArch32 is more complicated since it represents a lot more versions of ARM. Based on the ARM v8 spec I think some AArch32 processors can run SIMD operations on regular registers as well as SIMD registers, which makes that more complicated. I'm pretty sure Raspberry PIs have been AArch32 for a long time. I'm not sure if there's new ones that are AArch64 but the vast majority will be AArch32. That said, AArch64 is going to be a huge new user base, since Amazon is heavily advertising their new Graviton2 processors, and Apple is switching to it for seemingly all of their products.

Not sure about the number of registers. The ARMv8-A spec is here: https://documentation-service.arm.com/static/5f20515cbb903e39c84dc459 but that's only one spec.

For feature detection on Linux, the kernel provides asm/hwcap.h (e.g., https://android.googlesource.com/kernel/msm/+/android-7.1.0_r0.2/arch/arm64/include/uapi/asm/hwcap.h). You can get the feature data using hwcap = ccall(:getauxval, Culong, (Culong,), 16) (AT_HWCAP is 16, apparently) and then & with the feature ID in the header. Not sure what other operating systems provide.

@chriselrod
Copy link
Member Author

There's also this for feature detection:

using Libdl
llvmlib = Libdl.dlopen(only(filter(lib->occursin(r"LLVM\b", basename(lib)), Libdl.dllist())))
gethostcpufeatures = Libdl.dlsym(llvmlib, :LLVMGetHostCPUFeatures)
features_cstring = ccall(gethostcpufeatures, Cstring, ())
features = filter(ext -> (m = match(r"\d", ext); isnothing(m) ? true : m.offset != 2 ), split(unsafe_string(features_cstring), ','))

But from what I recall of seeing discussions on Julia issues/PRs, this is going to be incomplete with ARM.

I'll still have to see how to translate hwcap into the information I want, the most important of which is the 4 constants from the opening post.

@iamed2
Copy link

iamed2 commented Jul 30, 2020

I think you can find cache info from the same kernel function getauxval using a different AT_ value: https://man7.org/linux/man-pages/man3/getauxval.3.html

Not sure about that register information. @PallHaraldsson answered a related question on Quora so he might have some suggestions.

I'll note that the ARMv8 spec linked above does tell you exactly how many registers of what kind it has, so worst-case scenario you could hardcode the numbers by architecture.

@chriselrod
Copy link
Member Author

I think hard coding the number of registers would be worth it.
Currently, LoopVectorization only uses the number of vector registers and just hopes performance isn't hurt too badly by spilling others (it does spill integer registers in some cases, but I think doing so is worthwhile in those cases).

For cross platform cache information, perhaps we should depend on Hwloc.jl?

Although I'm not opposed to making OS-specific performance optimizations (already using faster SIMD special functions on Linux).

@iamed2
Copy link

iamed2 commented Aug 17, 2020

For cross platform cache information, perhaps we should depend on Hwloc.jl?

That's a good idea! Looks like it needs some fixes first though: JuliaParallel/Hwloc.jl#31

@turingtest37
Copy link

turingtest37 commented Dec 2, 2020

I believe I am struggling with the same issue as others have reported: errors compiling LoopVectorization on a Jetson Nano.
The problem, I believe, is that Hwloc does not report any L3 Cache, which is assumed to exist in line 8 of topology.jl.
Here's what the compile error looks like on my Nano:

[ Info: Precompiling VectorizationBase [3d5dd08c-fd9d-11e8-17fa-ed2836048c2f]
ERROR: LoadError: LoadError: type NullAttr has no field size
Stacktrace:
 [1] getproperty(::Hwloc.NullAttr, ::Symbol) at ./Base.jl:33
 [2] top-level scope at /home/doug/.julia/packages/VectorizationBase/26Yla/src/topology.jl:8
 [3] include(::Function, ::Module, ::String) at ./Base.jl:380
 [4] include at ./Base.jl:368 [inlined]
 [5] include(::String) at /home/doug/.julia/packages/VectorizationBase/26Yla/src/VectorizationBase.jl:1
 [6] top-level scope at /home/doug/.julia/packages/VectorizationBase/26Yla/src/VectorizationBase.jl:270
 [7] include(::Function, ::Module, ::String) at ./Base.jl:380
 [8] include(::Module, ::String) at ./Base.jl:368
 [9] top-level scope at none:2
 [10] eval at ./boot.jl:331 [inlined]
 [11] eval(::Expr) at ./client.jl:467
 [12] top-level scope at ./none:3
in expression starting at /home/doug/.julia/packages/VectorizationBase/26Yla/src/topology.jl:8
in expression starting at /home/doug/.julia/packages/VectorizationBase/26Yla/src/VectorizationBase.jl:270

Digging deeper,

julia> using Hwloc
julia> const TOPOLOGY = Hwloc.topology_load();
julia> topology = Hwloc.topology_load()
D0: L0 P0 Machine  
    D1: L0 P0 Package  
        D2: L0 P-1 L2Cache  Cache{size=2097152,depth=2,linesize=64,associativity=16,type=Unified}
            D3: L0 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L0 P0 Core  
                    D5: L0 P0 PU  
            D3: L1 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L1 P1 Core  
                    D5: L1 P1 PU  
            D3: L2 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L2 P2 Core  
                    D5: L2 P2 PU  
            D3: L3 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L3 P3 Core  
                    D5: L3 P3 PU  
julia> const CACHE = TOPOLOGY.children[1].children[1];
        D2: L0 P-1 L2Cache  Cache{size=2097152,depth=2,linesize=64,associativity=16,type=Unified}
            D3: L0 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L0 P0 Core  
                    D5: L0 P0 PU  
            D3: L1 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L1 P1 Core  
                    D5: L1 P1 PU  
            D3: L2 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L2 P2 Core  
                    D5: L2 P2 PU  
            D3: L3 P-1 L1Cache  Cache{size=32768,depth=1,linesize=64,associativity=2,type=Data}
                D4: L3 P3 Core  
                    D5: L3 P3 PU  

Subsequently (from topology.jl, line 8)

julia> CACHE.children[1].children[1]
                D4: L0 P0 Core  
                    D5: L0 P0 PU  
julia> CACHE.children[1].children[1].attr
julia>

I'd be happy to help fix this!
-thanks
doug

@chriselrod
Copy link
Member Author

I shouldn't assume that. Lots of non-x86 CPUs only have 2 levels of cache, including the M1.

@chriselrod
Copy link
Member Author

chriselrod commented Dec 2, 2020

I'd be happy to help fix this!

You're welcome to make a PR!
You could set all the corresponding references to a missing cache level to nothing when it isn't there.

Or, for CACHE_SIZE and CACHE_COUNT, just make them tuples of length equal to the number of cache levels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants