-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance on AMD Ryzen and Threadripper #1461
Comments
Here, I report on a first test with modified cache sizes. The cache sizes reported from lstopo are Hence, I performed a test on the current development version. Assuming L1 and L2 would have to be reported per core and L3 in total, I modified the relevant lines of getarch.c to I hope these settings are correctly translated from lstopo. After this, config.h reads as Eventually, I installed using After this, config.h has changed and reads as Note that in particular L1_CODE_SIZE was reset to 32KB. Also some of the L?_ASSOCIATIVE values have changed. Looking at /usr/local/include/openblas_config.h, which was generated during the installation, has copied the entries of config.h generated during compilation (i.e. it has the right L1_CODE_SIZE of 64KB). I have not observed any performance improvement from the modification of the caches. But I wonder whether the changes to config.h (L1_CODE_SIZE) during installation may have adverse effects on performance. I will do more tests using HASWELL targets instead of ZEN in the development version and try to find out about loop unrolling. |
I suspect as you ran the |
@martin-frbg Your guessed correctly. Including the TRAGET=ZEN argument in the installation let to the right entries in config.h after installation and this did not improve performance. |
BTW this was also the case in the original LibGoto - L2 size used only to derive xGEMM_P parameters for Core2, Opteron and earlier, L1 and L3 size apparently unused. Seems most of the hardware parameters are detected and reported "just in case" now, but perhaps cpu development has stabilized in the sense that a given type id will no longer have variants that vary in L1 or L2 properties. |
Somewhere deep in Wikipedia it is said that zen-epic-ripper changes cache from write-through to write-back. That may mean that effective cache easily halves. |
Somehow I doubt that, or at least its relevance for the detected cache sizes. On the other hand I feel there is a need to find out which of the many parameters in param.h and elsewhere that are faithfully copied for new cpu models are actually used in the current code, and how critical their influence is. |
I have been looking a bit more at the suspected performance shortage on the AMD Threadripper 1950X, its reasons and the consequences of thread-oversubscription on the TR1950X.
no. threads runtime 0.2.19 [s] runtime 0.2.20 [s] runtime devel [s] These numbers are for compilation of openblas and my code using gcc/gfortran v. 7.2 with optimisation level -O2 (-O3 reduces the run times by about 1 s). So, up to 4 threads the performance is comparable, and the thread-oversubscription does not really seem to play a big role. For a larger number of threads, v.0.2.19 still sees a decent speedup, but for versions 0.2.20 and devel, there is clear detoriation in performance when opting for a higher number of threads. Note that these examples where run after correcting lapack-netlib/SRC/Makefile of versions 0.2.20 and devel and recompiling (cf. #1425). So, I get correct results in all of these tests. Since there are general improvements in speed when selecting a recent openblas version, I would want to get over the problem with thread-oversubscription. However, I would need sequential versions of ZGBTRF and ZGBTRS inside the aforementioned loop and parallelised versions of DSYRK, DPOSV, etc in other parts of the code. This seems to require compilation of sequential and parallel versions of the openblas library, linking to these and then somehow picking the right (sequential or parallel) versions of the required BLAS and LAPACK routines. Now, if openblas came as a Fortran module, this would be a no-brainer, because one could just use the "use ..., only :: ... => ..." mechanism of Fortran to restrict import of symbols and to re-name them. I have been searching the net for possible linker-related solutions that provide similar mechanisms, but to no avail. Do you have a document or webpage with a solution to this at hand? |
Some thoughts -
|
Fine, I will run git bisect on Monday. |
Thanks. BTW there is a utility function openblas_set_num_threads() but according to #803 not all parts of the code honor it. |
ZGBTRF did not change between LAPACK 3.6.1 and 3.8.0, and is mostly IZAMAX + ZSCAL + ZGERU + ZTRSM + ZGEMM. Nothing I'd immediately identify as having undergone any drastic changes since 0.2.19. (Except the post-0.2.20 GEMM thread scheduling mentioned above). Perhaps a good starting point would be 9e4b697 - a few months after 0.2.19, and shortly before both a big LAPACK update (with risk of collateral damage) and a set of thread safety fixes. If all is well up to that point, it should still display the good performance of 0.2.19. |
I have finished bisecting between 0.2.19 and 0.2.20 to find the cause of the performance degradation reported in point 3 (see above) when ZGBTRF and ZGBTRS are called from within an OpenMP parallelised loop. The resulting output is: 87c7d10 is the first bad commit
:040000 040000 9f41e2cd82dc83e84b65d32000d6341cc7e417a8 bcd37e483226009ef28ac179f7268fe419e0b73d M driver As already reported in point 3 (see above), there was an additional performance degradation between 0.2.20 and the development version. Would you like to have the bisection results on that, too, or shall we see whether fixing the problem between 0.2.19 and 0.2.20 removes the additional degradation? |
This is bad, as it means we will probably need someone more experienced with thread programming than me to improve this. :-( Given your find, it is entirely possible that the later degradation is from #1299 where I touched the same files again, nine months older but probably no wiser. (A brute force test could be to drop the 0.2.19 memory.c into current develop and see if this restores the original performance - I think it would still compile despite some intervening changes) |
I have copied driver/others/memory.c from 0.2.19 to the development version and recompiled successfully. This has restored the good performance observed in 0.2.19. |
I have now merged a (supposed) fix that uses all the additional locks only in multithreaded builds that do not employ OpenMP. This should restore pre-0.2.20 OpenMP performance without completely reverting the thread safety fixes, hopefully something can be done about their (probable) performance impact on pure pthreads builds in the future. (Though possibly the impact was worst with OpenMP, if the change was adding another layer of locks on top of what OpenMP already imposed) |
The new version does not deliver correct results when compiled with make TARGET=ZEN USE_OPENMP=1 BINARY=64 COMMON_OPT='-O2 -march=znver1 -mtune=znver1' FC=gfortran and running more than 1 thread. |
Sorry. Seems I added ifdefs around a few locks that were there unconditionally in 0.2.19. Somehow none of the standard tests was able to flag this on my hardware. |
Sorry, I still get wrong results. |
Hmm, thanks. I had missed one spot near line 1120 that had a blas_unlock from the earlier version still commented out, hopefully this was causing it. Apart from that, there are only cases where I had to move a lock outside an if() block that uses a thread variable in the conditional - I do not see a functional difference but can duplicate the offending code block if necessary. |
I have recompiled, but still get wrong results except for when only 1 thread is used.
Hopefully, the latter message can help to locate the problem. |
I have reverted the previous commit for now. While that "bad unallocation" message does originate from memory.c, I do not understand how the revised version of my changes could be causing it. |
Unfortunately I cannot reproduce your problem with any of the tests, not with the software I normally use. |
Do you think there are any further test I could do or would you recommend to just copy driver/others/memory.c from 0.2.19 to the development version to get higher performance (despite the thread safety issues in this older version)? |
I am about to upload another PR where absolutely all locking-related changes will be encapsulated in |
With the updated memory.c, the results are fine, but the run-time performance is as degraded as in v. 0.2.20 |
Weird. I believe a side-by-side comparison of the old and new memory.c will show that they are now functionally equivalent (with respect to locking) for USE_OPENMP=1. Did you do a full rebuild (starting from |
Running |
And with NUMA on (aka channel) and 8 threads (OPENBLAS_NUM_THREADS=8 ./testsuite)? |
Is your compiler getting |
My compiler did not get Where is What about SMT (it was activated for now), is there any recommandation ? I will test ASAP |
I can only refer to Agner Fog's analysis at https://www.agner.org/optimize/microarchitecture.pdf where (as I understand it) he comes to the conclusion that multithreading is more efficient than it was on earlier AMD and Intel designs (with the caveat that inter-thread communication should be kept within the same 4-cpu die if possible). Unfortunately I cannot identify any obvious bottlenecks in the current (Haswell) kernels even after reading his description of the microarchitectures. |
And I suspect brada4 was only using "testsuite" as shorthand for your current numpy code. (There are some test codes in the benchmark directory of the distribution and there is xianyi's BLAS-Tester project that is derived from the ATLAS test codes, but I am not aware of anything actually named testsuite) |
@jcolafrancesco : Many thanks for the tests! @martin-frbg and @brada4 : Sorry for not being in contact after promissing test. First, one of my RAM modules was broken. After that, I had too much teaching. With a reservation, I can confirm @jcolafrancesco 's findings that changing the SWITH_RATE parameter has pretty marignal effect. I tested this 3 weeks ago or so with a freshly downloaded openblas version and one from spring. Here, the multi-threaded performance was worse in the recently downloaded version and basically degraded to the level, before @martin-frbg fixed issue #1468 on February 28. Unless this description raises immediate suspicions about the degraded performance, I can download the last development version, recheck and bisect. Also, AMD published an updated blis libflame version on their developer webpage a couple of months ago or so. I have not had the time yet to look at how much this has improved threadripper support. |
@jcolafrancesco : Regarding SMT, I have not seen any performance benefits from this on the TR1950X. Going beyond 16 threads, it was rather slight performance degradation by a few per cent. Generally, I recommend that you test setting thread affinity in both UMA and NUMA model to see what works best for your specific work tasks. From what I have seen, it is beneficial to set thread affinity and avoid having your data move through the InfiniFabric between the two main parts of the CPU to different DIMM modules. For my typical workloads, it is fastest to run 2or 4 simulations in parallel on 8 or 4 cores, respectively, and just do that in NUMA. |
@tkswe88 memory.c should be basically the same as #1468 from end of february (where you confirmed that the performance was back to earlier levels) - since then, there has been a very promising rewrite of the code with the intent to use thread-local storage. Unfortunately this new code had unexpected side effects at first, so I decided to make the "old" version the default again after two consecutive releases that were "broken" for some high-profile codes that indirectly rely on OpenBLAS. |
20k * 20k * 8b * 3 makes 9.6GB for double precision... Should get better with working RAM |
Yubeic also posted on AMD's website : https://community.amd.com/thread/228619 He is communicating the same results and someone is comparing them with the theoretical limit he could reach with its Threadripper (canyouseeme 23 mai 2018 04:35 ):
Same poster is obtaining 0.36 seconds on dgemm with 4096*4096 matrices and this is totally on par with tkswe88's results (around 0.3 seconds) and mine. Do you think that canyouseeme's reasoning is correct ? In that case it would cast some doubts on yubeic's results. I'm starting so think that we should just stop to take those results into account. |
canyouseeme is also reporting (22 mai 2018 03:04) :
I've made some tests with a new C code directly using the cblas API :
I'm not observing any significant change. |
@martin-frbg What specific preprocessor flag is required during compilation of memory.c to restore the optimal performance? |
@tkswe88 there should be no specific flag necessary - the active code in memory.c should be in the same state as on february 28, except that the file is now approximately twice as big as it also contains a reimplementation that uses thread-local storage (which is available by setting USE_TLS=1) |
|
what's the status of this ticket and is there a way to help it move forward? |
Needs testers especially for Threadripper, and generally someone who knows how to optimize code for Zen. I just bought a Ryzen 2700 and hope to be able to do at least some quick and simple tests in the next few days. (Top on my list is to see how the recent improvements for Haswell map to Ryzen, e.g. optimal settings of SWITCH_RATIO and GEMM_PREFERED_SIZE in param.h) |
Not much experience optimizing code, but I do have access to a TR1950x and can regularly run tests on it. Could you describe in detail what kind of tests would you need? |
I am not that sure myself, but for a start it would probably help to run the dgemm and sgemm benchmarks to see if increasing the SWITCH_RATIO for ZEN in param.h from the default of 4 has a similar impact on performance as it did for HASWELL where it is now 32 - my very first tests suggest that 16 may be optimal for the 8-core Ryzen, but 32 did not appear to be much worse. Second task would be to see if fenrus75's recent improvements in the Haswell xGEMM_BETA and xGEMM_ONCOPY routines (see entries referencing skylake files in kernel/x86_64/KERNEL.HASWELL) are applicable to the Zen architecture despite its allegedly inferior AVX2 hardware. |
There is a new best practice guide from the PRACE project for AMD EPYC architectures on http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-AMD.pdf |
Thats more about configuring node in a compute cluster. AMD does not make 3dnow processors for quite some decades.... |
@martin-frbg and @brada4 I have adverse performance effects on the TR1950X when using the memory.c file as provided in the current current development version (0.3.6) and runnin gon more than 8 cores. It does not seem to matter whether I define USE_TLS or not. I have alos tried commenting out lines 1045-1047 in Makefile.system. When replacing memory.c by the one of #1468, I get good performance again. |
Can you tell if 0.3.5 worked for you ? (Some earlier changes that might have affected locking were inadvertently removed in 0.3.5 and put back in develop). From your reference to #1468 I assume you are compiling with OpenMP, so |
Actually #1468 was among the changes that were accidentally dropped in the creation of the "Frankenstein" version of memory.c that combined both TLS and non-TLS versions - this is supposed to have been rectified by #2004 three weeks ago but there may be an interaction with intervening changes from #1785 & #1814 |
|
@martin-frbg I followed #2040 to https://raw.githubusercontent.com/xianyi/OpenBLAS/af480b02a4a45df377acf9be0d6078609bb345c2/driver/others/memory.c, copied this version of memory.c into a snapshot of the openblas development version from Friday and recompiled. This has restored the performance to where I would like it to be. Many thanks for your help and effort! About a year ago or so, I presented a first comparison of openblas and blis/flame performance. Since then blis and libflame have caught up quite a bit and seem to have become threads-safe. In my standard code, I can use either openblas or blis/flame with OpenMP parallelisation. openblas tends to be around 1-4 % faster on small and moderately sized workloads. However, when pure DGEMM performance is required and the matrices become rather big, at the moment, openblas seems to be the clear winner. See below for a comparison with 1 and 16 threads. The results were obtained with openblas development and blis versions downloaded last Friday and after compilation using gcc9.0 on Ubuntu Linux 18.04 (4.15 kernel). |
Thank you very much for the confirmation, merging that PR now. |
* With the Intel compiler on Linux, prefer ifort for the final link step icc has known problems with mixed-language builds that ifort can handle just fine. Fixes OpenMathLib#1956 * Rename operands to put lda on the input/output constraint list * Fix wrong constraints in inline assembly for OpenMathLib#2009 * Fix inline assembly constraints rework indices to allow marking argument lda4 as input and output. For OpenMathLib#2009 * Fix inline assembly constraints rework indices to allow marking argument lda as input and output. * Fix inline assembly constraints * Fix inline assembly constraints * Fix inline assembly constraints in Bulldozer TRSM kernels rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For OpenMathLib#2009 * Correct range_n limiting same bug as seen in OpenMathLib#1388, somehow missed in corresponding PR OpenMathLib#1389 * Allow multithreading TRMV again revert workaround introduced for issue OpenMathLib#1332 as the actual cause appears to be my incorrect fix from OpenMathLib#1262 (see OpenMathLib#1388) * Fix error introduced during cleanup * Reduce list of kernels in the dynamic arch build to make compilation complete reliably within the 1h limit again * init * move fix to right place * Fix missing -c option in AVX512 test * Fix AVX512 test always returning false due to missing compiler option * Make x86_32 imply NO_AVX2, NO_AVX512 in addition to NO_AVX fixes OpenMathLib#2033 * Keep xcode8.3 for osx BINARY=32 build as xcode10 deprecated i386 * Make sure that AVX512 is disabled in 32bit builds for OpenMathLib#2033 * Improve handling of NO_STATIC and NO_SHARED to avoid surprises from defining either as zero. Fixes OpenMathLib#2035 by addressing some concerns from OpenMathLib#1422 * init * address warning introed with OpenMathLib#1814 et al * Restore locking optimizations for OpenMP case restore another accidentally dropped part of OpenMathLib#1468 that was missed in OpenMathLib#2004 to address performance regression reported in OpenMathLib#1461 * HiSilicon tsv110 CPUs optimization branch add HiSilicon tsv110 CPUs optimization branch * add TARGET support for HiSilicon tsv110 CPUs * add TARGET support for HiSilicon tsv110 CPUs * add TARGET support for HiSilicon tsv110 CPUs * Fix module definition conflicts between LAPACK and ReLAPACK for OpenMathLib#2043 * Do not compile in AVX512 check if AVX support is disabled xgetbv is function depends on NO_AVX being undefined - we could change that too, but that combo is unlikely to work anyway * ctest.c : add __POWERPC__ for PowerMac * Fix crash in sgemm SSE/nano kernel on x86_64 Fix bug OpenMathLib#2047. Signed-off-by: Celelibi <celelibi@gmail.com> * param.h : enable defines for PPC970 on DarwinOS fixes: gemm.c: In function 'sgemm_': ../common_param.h:981:18: error: 'SGEMM_DEFAULT_P' undeclared (first use in this function) #define SGEMM_P SGEMM_DEFAULT_P ^ * common_power.h: force DCBT_ARG 0 on PPC970 Darwin without this, we see ../kernel/power/gemv_n.S:427:Parameter syntax error and many more similar entries that relates to this assembly command dcbt 8, r24, r18 this change makes the DCBT_ARG = 0 and openblas builds through to completion on PowerMac 970 Tests pass * Make TARGET=GENERIC compatible with DYNAMIC_ARCH=1 for issue OpenMathLib#2048 * make DYNAMIC_ARCH=1 package work on TSV110. * make DYNAMIC_ARCH=1 package work on TSV110 * Add Intel Denverton for OpenMathLib#2048 * Add Intel Denverton * Change 64-bit detection as explained in OpenMathLib#2056 * Trivial typo fix as suggested in OpenMathLib#2022 * Disable the AVX512 DGEMM kernel (again) Due to as yet unresolved errors seen in OpenMathLib#1955 and OpenMathLib#2029 * Use POSIX getenv on Cygwin The Windows-native GetEnvironmentVariable cannot be relied on, as Cygwin does not always copy environment variables set through Cygwin to the Windows environment block, particularly after fork(). * Fix for OpenMathLib#2063: The DllMain used in Cygwin did not run the thread memory pool cleanup upon THREAD_DETACH which is needed when compiled with USE_TLS=1. * Also call CloseHandle on each thread, as well as on the event so as to not leak thread handles. * AIX asm syntax changes needed for shared object creation * power9 makefile. dgemm based on power8 kernel with following changes : 32x unrolled 16x4 kernel and 8x4 kernel using (lxv stxv butterfly rank1 update). improvement from 17 to 22-23gflops. dtrmm cases were added into dgemm itself * Expose CBLAS interfaces for I?MIN and I?MAX * Build CBLAS interfaces for I?MIN and I?MAX * Add declarations for ?sum and cblas_?sum * Add interface for ?sum (derived from ?asum) * Add ?sum * Add implementations of ssum/dsum and csum/zsum as trivial copies of asum/zsasum with the fabs calls replaced by fmov to preserve code structure * Add ARM implementations of ?sum (trivial copies of the respective ?asum with the fabs calls removed) * Add ARM64 implementations of ?sum as trivial copies of the respective ?asum kernels with the fabs calls removed * Add ia64 implementation of ?sum as trivial copy of asum with the fabs calls removed * Add MIPS implementation of ?sum as trivial copy of ?asum with the fabs calls removed * Add MIPS64 implementation of ?sum as trivial copy of ?asum with the fabs replaced by mov to preserve code structure * Add POWER implementation of ?sum as trivial copy of ?asum with the fabs replaced by fmr to preserve code structure * Add SPARC implementation of ?sum as trivial copy of ?asum with the fabs replaced by fmov to preserve code structure * Add x86 implementation of ?sum as trivial copy of ?asum with the fabs calls removed * Add x86_64 implementation of ?sum as trivial copy of ?asum with the fabs calls removed * Add ZARCH implementation of ?sum as trivial copies of the respective ?asum kernels with the ABS and vflpsb calls removed * Detect 32bit environment on 64bit ARM hardware for OpenMathLib#2056, using same approach as OpenMathLib#2058 * Add cmake defaults for ?sum kernels * Add ?sum * Add ?sum definitions for generic kernel * Add declarations for ?sum * Add -lm and disable EXPRECISION support on *BSD fixes OpenMathLib#2075 * Add in runtime CPU detection for POWER. * snprintf define consolidated to common.h * Support INTERFACE64=1 * Add support for INTERFACE64 and fix XERBLA calls 1. Replaced all instances of "int" with "blasint" 2. Added string length as "hidden" third parameter in calls to fortran XERBLA * Correct length of name string in xerbla call * Avoid out-of-bounds accesses in LAPACK EIG tests see Reference-LAPACK/lapack#333 * Correct INFO=4 condition * Disable reallocation of work array in xSYTRF as it appears to cause memory management problems (seen in the LAPACK tests) * Disable repeated recursion on Ab_BR in ReLAPACK xGBTRF due to crashes in LAPACK tests * sgemm/strmm * Update Changelog with changes from 0.3.6 * Increment version to 0.3.7.dev * Increment version to 0.3.7.dev * Misc. typo fixes Found via `codespell -q 3 -w -L ith,als,dum,nd,amin,nto,wis,ba -S ./relapack,./kernel,./lapack-netlib` * Correct argument of CPU_ISSET for glibc <2.5 fixes OpenMathLib#2104 * conflict resolve * Revert reference/ fixes * Revert Changelog.txt typos * Disable the SkyLakeX DGEMMITCOPY kernel as well as a stopgap measure for numpy/numpy#13401 as mentioned in OpenMathLib#1955 * Disable DGEMMINCOPY as well for now OpenMathLib#1955 * init * Fix errors in cpu enumeration with glibc 2.6 for OpenMathLib#2114 * Change two http links to https Closes OpenMathLib#2109 * remove redundant code OpenMathLib#2113 * Set up CI with Azure Pipelines [skip ci] * TST: add native POWER8 to CI * add native POWER8 testing to Travis CI matrix with ppc64le os entry * Update link to IBM MASS library, update cpu support status * first try migrating one of the arm builds from travis * fix tabbing in azure commands * Update azure-pipelines.yml take out offending lines (although stolen from https://github.com/conda-forge/opencv-feedstock azure-pipelines fiie) * Update azure-pipelines.yml * Update azure-pipelines.yml * Update azure-pipelines.yml * Update azure-pipelines.yml * DOC: Add Azure CI status badge * Add ARMV6 build to azure CI setup (OpenMathLib#2122) using aytekinar's Alpine image and docker script from the Travis setup [skip ci] * TST: Azure manylinux1 & clean-up * remove some of the steps & comments from the original Azure yml template * modify the trigger section to use develop since OpenBLAS primarily uses this branch; use the same batching behavior as downstream projects NumPy/ SciPy * remove Travis emulated ARMv6 gcc build because this now happens in Azure * use documented Ubuntu vmImage name for Azure and add in a manylinux1 test run to the matrix [skip appveyor] * Add NO_AFFINITY to available options on Linux, and set it to ON to match the gmake default. Fixes second part of OpenMathLib#2114 * Replace ISMIN and ISAMIN kernels on all x86_64 platforms (OpenMathLib#2125) * Mark iamax_sse.S as unsuitable for MIN due to issue OpenMathLib#2116 * Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for OpenMathLib#2116 * Move ARMv8 gcc build from Travis to Azure * Move ARMv8 gcc build from Travis to Azure * Update .travis.yml * Test drone CI * install make * remove sudo * Install gcc * Install perl * Install gfortran and add a clang job * gfortran->gcc-gfortran * Switch to ubuntu and parallel jobs * apt update * Fix typo * update yes * no need of gcc in clang build * Add a cmake build as well * Add cmake builds and print options * build without lapack on cmake * parallel build * See if ubuntu 19.04 fixes the ICE * Remove qemu armv8 builds * arm32 build * Fix typo * TST: add SkylakeX AVX512 CI test * adapt the C-level reproducer code for some recent SkylakeX AVX512 kernel issues, provided by Isuru Fernando and modified by Martin Kroeker, for usage in the utest suite * add an Intel SDE SkylakeX emulation utest run to the Azure CI matrix; a custom Docker build was required because Ubuntu image provided by Azure does not support AVX512VL instructions * Add option USE_LOCKING for single-threaded build with locking support for calling from concurrent threads * Add option USE_LOCKING for single-threaded build with locking support * Add option USE_LOCKING for SMP-like locking in USE_THREAD=0 builds * Add option USE_LOCKING but keep default settings intact * Remove unrelated change * Do not try ancient PGI hacks with recent versions of that compiler should fix OpenMathLib#2139 * Build and run utests in any case, they do their own checks for fortran availability * Add softfp support in min/max kernels fix for OpenMathLib#1912 * Revert "Add softfp support in min/max kernels" * Separate implementations of AMAX and IAMAX on arm As noted in OpenMathLib#1912 and comment on OpenMathLib#1942, the combined implementation happens to "do the right thing" on hardfp, but cannot return both value and index on softfp where they would have to share the return register * Ensure correct output for DAMAX with softfp * Use generic kernels for complex (I)AMAX to support softfp * improved zgemm power9 based on power8 * upload thread safety test folder * hook up c++ thread safety test (main Makefile) * add c++ thread test option to Makefile.rule * Document NO_AVX512 for OpenMathLib#2151 * sgemm pipeline improved, zgemm rewritten without inner packs, ABI lxvx v20 fixed with vs52 * Fix detection of AVX512 capable compilers in getarch 21eda8b introduced a check in getarch.c to test if the compiler is capable of AVX512. This check currently fails, since the used __AVX2__ macro is only defined if getarch itself was compiled with AVX2/AVX512 support. Make sure this is the case by building getarch with -march=native on x86_64. It is only supposed to run on the build host anyway. * c_check: Unlink correct file * power9 zgemm ztrmm optimized * conflict resolve * Add gfortran workaround for ABI violations in LAPACKE for OpenMathLib#2154 (see gcc bug 90329) * Add gfortran workaround for ABI violations for OpenMathLib#2154 (see gcc bug 90329) * Add gfortran workaround for potential ABI violation for OpenMathLib#2154 * Update fc.cmake * Remove any inadvertent use of -march=native from DYNAMIC_ARCH builds from OpenMathLib#2143, -march=native precludes use of more specific options like -march=skylake-avx512 in individual kernels, and defeats the purpose of dynamic arch anyway. * Avoid unintentional activation of TLS code via USE_TLS=0 fixes OpenMathLib#2149 * Do not force gcc options on non-gcc compilers fixes compile failure with pgi 18.10 as reported on OpenBLAS-users * Update Makefile.x86_64 * Zero ecx with a mov instruction PGI assembler does not like the initialization in the constraints. * Fix mov syntax * new sgemm 8x16 * Update dtrmm_kernel_16x4_power8.S * PGI compiler does not like -march=native * Fix build on FreeBSD/powerpc64. Signed-off-by: Piotr Kubaj <pkubaj@anongoth.pl> * Fix build for PPC970 on FreeBSD pt. 1 FreeBSD needs DCBT_ARG=0 as well. * Fix build for PPC970 on FreeBSD pt.2 FreeBSD needs those macros too. * cgemm/ctrmm power9 * Utest needs CBLAS but not necessarily FORTRAN * Add mingw builds to Appveyor config * Add getarch flags to disable AVX on x86 (and other small fixes to match Makefile behaviour) * Make disabling DYNAMIC_ARCH on unsupported systems work needs to be unset in the cache for the change to have any effect * Mingw32 needs leading underscore on object names (also copy BUNDERSCORE settings for FORTRAN from the corresponding Makefile)
This report comes right from #1425 where the discussion drifted off from thread safety in openblas v. 0.2.20 to performance on AMD Ryzen and Threadripper processors (in this particular case a TR 1950X). I seems worthwhile to discuss this in a separate thread. Until now we had the following discussion
@tkswe88 : After plenty of initial tests with the AMD TR 1950X processor, it seems that openblas (tested 0.2.19, 0.2.20 and the development version on Ubuntu 17.10 with kernel 4.13, gcc and gfortran v. 7.2) operates roughly 20% slower on the TR1950X than on an i7-4770 (4 cores) when using 1, 2 and 4 threads. This is somewhat surprising given that both CPUs use at most AVX2 and thus should be comparable in terms of vectorisation potential. I have already adjusted the OpenMP thread affinity to exclude that the (hidden) NUMA architecture of the TR1950X causes its lower performance. Other measures I took were 1) BIOS upgrade, 2) Linux kernel upgrade to 4.15, 3) increased DIMM frequency from 2133 to 2666 MHz. Except for the latter, which gave a speedup of roughly 3%, these measures did not have any effect on execution speed. Do you have any idea where the degraded performance on the TR1950X comes from? Is this related to a current lack of optimization in openblas or do we just have to wait for the next major release of gcc/gfortran to fix the problem? Of course, I would be willing to run tests, if this was of help in developing openblas.
@brada4 : AMD has slightly slower AVX and AVX2 units per CPU, by no means slow in general, it still has heap of cores spare. Sometimes optimal AVX2 saturation means turning whole CPU cartridge to base, i.e non-turpo frequency.
@martin-frbg: Could also be that getarch is mis-detecting the cache sizes on TR, or the various hardcoded block sizes from param.h for loop unrolling are "more wrong" on TR than they were on the smaller Ryzen. Overall support for the Ryzen architecture is currently limited to treating it like Haswell, see #1133,1147. There may be other assembly instructions besides AVX2 that are slower on Ryzen (#1147 mentions movntp*).
@tkswe88: Are there any tests I could do to find out about cache size detection errors or more appropriate settings for loop unroling?
@brada4 What you asked to martin - copied parameters may need doubled or halved at least here: https://github.com/xianyi/OpenBLAS/pull/1133/files#diff-7a3ef0fabb9c6c40aac5ae459c3565f0
@martin-frbg: You could simply check the autodetected information in config.h against the specification. (As far as I can determine, L3 size seems to be ignored as a parameter).
As far as appropriate settings go, the benchmark directory contains a few tests that can be used for timing individual functions. Adjusting the values in param.h (see https://github.com/xianyi/OpenBLAS/pull/1157/files) is a bit of a black art though.
The text was updated successfully, but these errors were encountered: