Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible incorrect AVX512 detection #1947

Closed
yuyichao opened this issue Jan 5, 2019 · 29 comments
Closed

Possible incorrect AVX512 detection #1947

yuyichao opened this issue Jan 5, 2019 · 29 comments
Milestone

Comments

@yuyichao
Copy link
Contributor

yuyichao commented Jan 5, 2019

Ref JuliaLang/julia#29652 (comment), see my version of the code in julia (which is mostly copied and updated from LLVM).

Disclaimer: I do not have the hardware to test this so I cannot tell if this is already fixed or not (by looking at the code I assume not, unless it's handled in a very different way than avx support). That's why I wished that the two people that have actually experienced this issue to report them instead. I'm reporting it now since I just noticed that none of them seem to have done that already. Feel free to close if I'm just missing the issue in my search somehow. I won't be able to help testing any patch but I'm happy to explain the code that does the testing in julia (I feel like it's fairly simple and very similar to the avx one though....).

@yuyichao
Copy link
Contributor Author

yuyichao commented Jan 5, 2019

Oh, just realize that I didn't put a description of the issue here so just to summarize the issue so that ppl don't have to read through the julia issue thread.

The OpenBLAS avx512 detection code seems to be purely based on processor part number. While this is probably necessary for figuring out the performance model, it's not enough to guarantee the support of the AVX512 feature. Similar to AVX, both the CPUID feature bit and the zmm state bit in XCR0 needs to be checked to make sure both the processor and the OS supports AVX512.

If I'm reading my code correctly, the bits should be the 0x10000 and 0x20000 bits of EBX for CPUID(EAX=7, ECX=0) and the 0xe0 bit of xcr0.

@mk
Copy link

mk commented Jan 5, 2019

I happy to test this on my machine with avx512.

@martin-frbg
Copy link
Collaborator

If I am reading this correctly, it would affect "only" VM setups where the virtual hardware claims to have the host CPU but does not provide all its features ? At least as far as I know, there are no physical processors around that identify as SkylakeX and do not support AVX512 (?)
"Historically" neither VMs nor prepackaged OpenBLAS distributed with Julia or similar played much of a role so past reactions to similiar issues (typically over AVX2 I think) seems to have been to tell to "go fix your VM configuration". On the other hand, the runtime detection code in dynamic.c already has an
active check for AVX that probably dates back to GotoBLAS, most likely none of the later contributors (me included) realized the need add the equivalent for later features.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jan 5, 2019

active check for AVX that probably dates back to GotoBLAS

Nope. 735ca38

@martin-frbg
Copy link
Collaborator

Tentative fix in #1949 - took me a while to realize that the implementation of cpuid was not zeroing ecx as required.

@arnavs
Copy link

arnavs commented Jan 7, 2019

@martin-frbg I have the hardware for this and would be happy to test, but am unclear on the process. Let me know if you think that'd help.

@martin-frbg
Copy link
Collaborator

Generally one can download a PR as a conventional diff file (to be applied with the patch command by appending .diff to its URL. I have merged this one now to make testing easier - the initial CI failure on OSX appears to have been caused by "homebrew" no longer providing gcc for OSX 10.11. Interestingly, now that I updated the Travis config to use OSX 10.12, the build test fails with an assembler error in the SkylakeX kernel code that was not seen previously.

@martin-frbg
Copy link
Collaborator

I would still appreciate if one of you could check out the current develop branch and check if the detection problem is solved.

@martin-frbg martin-frbg added this to the 0.3.6 milestone Jan 12, 2019
@mk
Copy link

mk commented Jan 12, 2019

Happy to do this next week. Do you have instructions for what I should best run? I guess another test would be to run Julia in docker against this PR but not sure how hard that is.

@martin-frbg
Copy link
Collaborator

As I understand it, you would probably need to build OpenBLAS on OSX (outside docker) and then replace the stock Julia libopenblas.so in the docker container with it.

@mk
Copy link

mk commented Jan 14, 2019

Unfortunately I'm running into build errors trying to build the develop branch at 21c0f2a. You can see the full build output as well as only the errors on https://gist.github.com/mk/b3ac286cbcbbaae81aeb811fb753224b

Please let me know what else I could try. Does it have to be built on my system or could I also download a binary of openblas to test this?

@martin-frbg
Copy link
Collaborator

That build error ("register %xmm16 is only available with AVX512" despite already compiling with -march=skylake-avx512) looks like it could be LLVM bug 36202 here: http://lists.llvm.org/pipermail/llvm-bugs/2018-February/062338.html - if that is the case you could try adding -mcpu=skx in system.cmake or Makefile.x86_64. A binary built on a different system would probably work as well (but I know next to nothing about OSX).

@mk
Copy link

mk commented Jan 15, 2019

Can you tell me how to add that option?

diff --git i/Makefile.x86_64 w/Makefile.x86_64
index 1b7fe3ef..c88c3057 100644
--- i/Makefile.x86_64
+++ w/Makefile.x86_64
@@ -1,4 +1,4 @@
-# CCOMMON_OPT  += -DFASTCPU
+CCOMMON_OPT    += -mcpu=skx
 
 ifeq ($(OSNAME), SunOS)
 ifdef BINARY64

I tried setting it with the diff above bug then I get:

clang: warning: argument unused during compilation: '-mcpu=skx' [-Wunused-command-line-argument]

@martin-frbg
Copy link
Collaborator

Strange, I copied that suggestion straight from the llvm bug. Indeed the current clang documentation seems to know -mcpu only for MIPS targets, could you try -mavx512vl instead ?

@mk
Copy link

mk commented Jan 15, 2019

Still get the same error error: register %xmm25 is only available with AVX512 unfortunately with this diff:

diff --git i/Makefile.x86_64 w/Makefile.x86_64
index 1b7fe3ef..166926f0 100644
--- i/Makefile.x86_64
+++ w/Makefile.x86_64
@@ -1,4 +1,4 @@
-# CCOMMON_OPT  += -DFASTCPU
+CCOMMON_OPT    += -mavx512vl
 
 ifeq ($(OSNAME), SunOS)
 ifdef BINARY64

@martin-frbg
Copy link
Collaborator

Could be that it wants -mavx512f (for foundation level AVX512) in addition, although I would have expected any sane compiler to enable all such "lesser" features automatically.

@mk
Copy link

mk commented Jan 15, 2019

Same result. I'll try to build it on another OSX machine without Skylake and copy the binary to my system. The CPU feature detection should still work since it's a runtime feature, correct? Will have to continue on this tomorrow.

@mk
Copy link

mk commented Jan 15, 2019

Actually I had still time to give this a try and it seems to work:

root@0317aaf31644:/usr/local# mv /libopenblas_haswellp-r0.3.6.dev.so  /usr/local/julia/bin/../lib/julia/libopenblas64_.so
root@0317aaf31644:/usr/local# julia
WARNING: Error during initialization of module LinearAlgebra:
ErrorException("could not load symbol "openblas_get_config64_":
/usr/local/julia/bin/../lib/julia/libopenblas64_.so: undefined symbol: openblas_get_config64_")
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.1 (2018-09-29)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

Julia>

Not sure about the warnings though.

@martin-frbg
Copy link
Collaborator

That last message looks as if Julia expects OpenBLAS to be built with the INTERFACE64=1 option (and "64" postfix added to all function symbols accordingly) so it actually failed to load the library. (And with "haswell" in the name it was not built for runtime detection a.k.a DYNAMIC_ARCH=1 anyway)

@mk
Copy link

mk commented Jan 16, 2019

I tried building it with the options Julia uses but I still couldn't get it to work.

@yuyichao can you maybe help? Does Julia maybe have a canary build against the develop branch against OpenBLAS or do you have instructions for building OpenBLAS for Julia? I did see the makefile but it seems to be executed in a larger context.

@martin-frbg
Copy link
Collaborator

Unfortunately our CI build on OSX is falling over an "invalid % escape in inline assembly" citing the "%{1to8%} in the SkylakeX DGEMM microkernel ever since I updated its xcode environment, else it would perhaps be possible to make it put the generated library somewhere. (The update became necessary when Homebrew stopped providing gcc builds for the old OSX version, most likely the old xcode was not capable of avx512 at all). I can only build Linux libraries locally, which are probably useless on OSX.

@yuyichao
Copy link
Contributor Author

I'm not sure about all the options to build the openblas to the exact same config used by julia binary.

OTOH, I feel like it'll be easier to test to use a C/fortran repro?

@yuyichao
Copy link
Contributor Author

(And I have no idea about the assembler error or how to make clang happy etc.................)

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jan 18, 2019

I have "fixed" the clang problem by updating the Travis config to use an even more recent version of xcode but have not looked into if/how it is possible to store the output of the CI run.
EDIT: apparently xcode 9.4 is needed to avoid the "invalid % escape" error (not sure yet if and how to catch this at the Makefile level). (I updated the travis config to xcode 10.1 as 32bit compiles hit an unrelated problem with their xcode9.4 image)

@martin-frbg
Copy link
Collaborator

To test this, it is probably sufficient to build OpenBLAS with DYNAMIC_ARCH=1 on OSX (with xcode 9.4 or later according to the CI results), and copy the compiled openblas_utest to the docker container.
If all goes as planned, running the openblas_utest binary (which is statically linked against openblas and requires no input files) should display a warning that it is falling back to Haswell kernels due to lack of AVX512 support (or even to Sandybridge if the docker environment does not provide AVX2 either)

@musm
Copy link

musm commented Apr 2, 2019

FYI several julia users also ran into this bug, including myself.

@martin-frbg
Copy link
Collaborator

@musm can you check if this is actually fixed on the develop branch, as it should be ?

@mk
Copy link

mk commented Apr 15, 2019

@martin-frbg finally coming back to this, sorry for the long silence. I can't build openblas on OS X and run it on docker, that gives me

./openblas_utest 
bash: ./openblas_utest: cannot execute binary file: Exec format error

But that's expected, right?

Compiling and running the utest util on docker, gives me:

root@065da2653635:/OpenBLAS/utest# DYNAMIC_ARCH=1 make
cc -O2 -DUTEST_CHECK -DSANITY_CHECK -DREFNAME=f_ -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DDYNAMIC_ARCH -DNO_LAPACK -DNO_LAPACKE -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=10 -DMAX_PARALLEL_NUMBER=1 -DVERSION=\"0.3.6.dev\" -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I..  -o openblas_utest utest_main.o test_amax.o test_rotmg.o test_axpy.o test_dotu.o test_dsdot.o test_swap.o test_rot.o test_fork.o ../libopenblasp-r0.3.6.dev.a -lm -lpthread 
./openblas_utest
TEST 1/21 amax:samax [OK]
TEST 2/21 drotmg:drotmg_D1_big_D2_big_flag_zero [OK]
TEST 3/21 drotmg:rotmg_D1eqD2_X1eqX2 [OK]
TEST 4/21 drotmg:rotmg_issue1452 [OK]
TEST 5/21 drotmg:rotmg [OK]
TEST 6/21 axpy:caxpy_inc_0 [OK]
TEST 7/21 axpy:saxpy_inc_0 [OK]
TEST 8/21 axpy:zaxpy_inc_0 [OK]
TEST 9/21 axpy:daxpy_inc_0 [OK]
TEST 10/21 zdotu:zdotu_offset_1 [OK]
TEST 11/21 zdotu:zdotu_n_1 [OK]
TEST 12/21 dsdot:dsdot_n_1 [OK]
TEST 13/21 swap:cswap_inc_0 [OK]
TEST 14/21 swap:sswap_inc_0 [OK]
TEST 15/21 swap:zswap_inc_0 [OK]
TEST 16/21 swap:dswap_inc_0 [OK]
TEST 17/21 rot:csrot_inc_0 [OK]
TEST 18/21 rot:srot_inc_0 [OK]
TEST 19/21 rot:zdrot_inc_0 [OK]
TEST 20/21 rot:drot_inc_0 [OK]
TEST 21/21 fork:safety [OK]
RESULTS: 21 tests (21 ok, 0 failed, 0 skipped) ran in 340 ms

@martin-frbg
Copy link
Collaborator

Thanks, so hopefully fixed with the just released 0.3.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants