Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test and tune for Zen 2 #2180

Open
TiborGY opened this issue Jul 8, 2019 · 47 comments
Open

Test and tune for Zen 2 #2180

TiborGY opened this issue Jul 8, 2019 · 47 comments

Comments

@TiborGY
Copy link
Contributor

TiborGY commented Jul 8, 2019

Zen 2 is now released, bringing a number of improvements to the table.
Most notably, it now has 256 wide AVX units. This should in theory allow performance parity with Haswell-Coffee Lake CPUs, and initial results suggest this is true (at least for single thread).
https://i.imgur.com/sFhxPrW.png

The chips also have double the L3 cache, and a generally reworked cache hierarchy. One thing to note, is that these chips do not have enough TLB cache to cover all of L2 and L3, so hugepages might be a little more important.

I might be able to get my hands on a Zen 2 system in ~1-2 months.

@brada4
Copy link
Contributor

brada4 commented Jul 8, 2019

It is not L3 cache per core or NUMA domain, it is per socket, like 1-2MB per core, in place of haswell's 2.5MB
Smaller than zen1 L1d actually matches that of haswell....
Probably neither is considered even with zen1, just that some lengthy discussions how to work around BIOS with broken NUMA support.

@TiborGY
Copy link
Contributor Author

TiborGY commented Jul 8, 2019

It is not L3 cache per core or NUMA domain, it is per socket, like 1-2MB per core, in place of haswell's 2.5MB

I have no idea what you are talking about. The 3700X has 8 cores and a total of 32 MiB of L3. Internally each cluster of 4 cores share their L3, so its more like 2*16MiB of L3. That still works out to 4 MiB of L3 per core. No idea where you are getting the 1-2 MiB from.

L3 cache is not shared between the 4 core core complexes (CCXs), not even within the same die.

@wjc404
Copy link
Contributor

wjc404 commented Jul 14, 2019

@TiborGY I also found tuning of kernel is required for zen2. I tested single-thread dgemm performance of OpenBLAS(target=Haswell) on a Ryzen 7 3700X at 3.0GHz fixed clock, got ~33GFLOPS, which was far behind the theoretical maximum (48GFLOPS at 3.0GHz). By the way I also tested my dgemm subroutine and got a speed of ~44GFLOPS.

@wjc404
Copy link
Contributor

wjc404 commented Jul 14, 2019

AIDA64_Cache_Mem_Test_R7-3700X
The speed of L3 in r7-3700X is fast, but the memory latency is still a problem.
I think the enhanced size of L3 allows larger blocks from matrix B to be packed, thus reducing the bandwidth requirement for accessing matrix A and C, eliminating the problem of slow memory access.

@wjc404
Copy link
Contributor

wjc404 commented Jul 14, 2019

I read the code of OpenBLAS's Haswell dgemm kernel and found the 2 most common FP arithmetic instructions are vfmadd231pd and (chained) vpermpd. I roughly tested the latency of vfmadd231pd and vpermpd on i9-9900K and r7-3700x, found that vfmadd231pd has a latency of 5 cycles on both CPUs, however for vpermpd the latency on r7-3700x (6 cycles) doubles that on 9900K (3 cycles). I guess the performance problem on zen2 may result from vpermpd instructions.
test_program.tar.gz

@martin-frbg
Copy link
Collaborator

Interesting observation. I now see this doubling of latency for vpermpd mentioned in Agner Fog's https://www.agner.org/optimize/instruction_tables.pdf for Zen - so this apparently still applies to Zen2 as well (and it is as obviously relevant for the old issue #1461)

@TiborGY
Copy link
Contributor Author

TiborGY commented Jul 15, 2019

The speed of L3 in r7-3700X is fast, but the memory latency is still a problem.

The reason why you memory latency is sky high is your memory clock. 2133 MHz is a huge performance nerf for Ryzen CPUs, because the internal bus that connects the cores to the memory controller (and each other) is running at 1/2 memory clock. (this bus is conceptually similar to intels mesh/uncore clock)

102 ns is crazy high, even for Ryzen. IMO 2400 MHz should be the bare minimum speed that anyone uses, even that is because ECC UDIMMs are kinda hard to find above that. If someone is not using ECC, using 2666 or more like 3000 MHz is very much recommended. You can easily shave off 20 ns from that figure you measured.

@wjc404
Copy link
Contributor

wjc404 commented Jul 15, 2019

I removed vperm instructions in the macros "KERNEL4x12_M1", "KERNEL4x12_M2", "KERNEL4x12_E" and "KERNEL4x12_SUB" of the file "dgemm_kernel_4x8_haswell.S" and recompiled OpenBLAS, and found a 1/4 speedup in a subsequent dgemm test (of course the results were no more meaningful), which illustrated the performance penalty is from vpermpd.
Screenshot from 2019-07-15 11-37-29

(test on r7-3700x, 1thread, 3.6GHz)

On r5-1600 the performance degradation is not significant (OpenBLAS(zen) gave 27GFLOPS when theoretical maximum is 29GFLOPS for 1 thread), probably because the half throughput of fma instructions on zen1 hides the latency of vpermpd.

@wjc404
Copy link
Contributor

wjc404 commented Jul 15, 2019

I also tested the latencies of some other AVX instructions on r7 3700x in a way similar to my previous test of vpermpd. The results are as follows:
instruction vblendpd vperm2f128 vshufpd
latency 1 cycle 3 cycles 1 cycle
The expensive vpermpd can be replaced by a proper combination of the 3 tested instructions above (vblendpd and vshufpd should also be cheaper on common intel CPUs).

@brada4
Copy link
Contributor

brada4 commented Jul 15, 2019

but the memory latency is still a problem

Are you serious? You know that X GHz memory server that much words per second, there is no shortcut (There is one, called cache)

@wjc404
Copy link
Contributor

wjc404 commented Jul 15, 2019

I changed 8 vpermpd to vshufpd in the first 4 "KERNEL4x12_*" macros in the file "dgemm_kernel_4x8_haswell.S" and received a 1/4 speedup while maintaining the correct results.
dgemm_kernel_4x8_haswell.S.txt
Screenshot from 2019-07-15 13-37-34

@wjc404
Copy link
Contributor

wjc404 commented Jul 15, 2019

I then modified the macro "SAVE4x12" in a similar way and got 0.3% performance improvement. Now the performance is about 9/10 of theoretical maximum.
dgemm_kernel_4x8_haswell.S.txt
Screenshot from 2019-07-15 14-36-52

@wjc404
Copy link
Contributor

wjc404 commented Jul 15, 2019

Test of more avx(2) instructions of doubles on r7-3700x (1 thread at 3.6 GHz)
Screenshot from 2019-07-15 23-52-10
test_of_common_avx2_instructions.zip
Instruction..... IPC.. latency
vpermpd....... 0.8... 6cycs
vblendpd....... 2.0... 1cyc
vperm2f128.. 1.0... 3cycs
vshufpd......... 2.0... 1cyc
vfmadd231pd 2.0... 5cycs
vaddpd.......... 2.0... 3cycs
vmulpd.......... 2.0... 3cycs
vhaddpd........ 0.5... 6-7cycs

A similar test on i9-9900K (1 thread, 4.4GHz) (chained vfmadd231pd test encountered endless running so it was removed from the test, luckily I've done it previously with different codes):
Screenshot from 2019-07-16 15-38-59
Instruction..... IPC.. latency
vpermpd....... 1.0... 3cycs
vblendpd....... 3.0... 1cyc
vperm2f128.. 1.0... 3cycs
vshufpd......... 1.0... 1cyc
vfmadd231pd 2.0... ~5cycs(previous test)
vaddpd.......... 2.0... 4cycs
vmulpd.......... 2.0... 4cycs
vhaddpd........ 0.5... 6cycs

@wjc404
Copy link
Contributor

wjc404 commented Jul 15, 2019

I also found that alternating vaddpd and vmulpd in the test code can get a total IPC of 4 on zen2, which was only 2 for i9-9900K.

@wjc404
Copy link
Contributor

wjc404 commented Jul 16, 2019

A simple test of AVX load & store instructions of packed doubles on r7-3700x (3.6GHz, 1 thread):
Screenshot from 2019-07-16 14-25-08
test_load_store_avx_doubles.zip

Instruction(AT&T)................ max_IPC
vmovapd mem,reg ............ ......2.......
vmovupd mem,reg ............ ......2.......
vmaskmovpd mem,reg,reg ......2.......
vbroadcastsd mem,reg ..... ......2.......
vmovapd reg,mem ............ ......1.......
vmovupd reg,mem ............. ......1.......
vmaskmovpd reg,reg,mem. .....1/6.....

@wjc404
Copy link
Contributor

wjc404 commented Jul 16, 2019

The same load/store test on i9-9900K (4.4 GHz, 1 thread)
Screenshot from 2019-07-16 15-16-34

shared the same maximum IPCs with r7-3700x except "vmaskmovpd reg,reg,mem"(IPC=1)

@wjc404
Copy link
Contributor

wjc404 commented Jul 16, 2019

Unlike vpermpd, vpermilpd share the same latency and IPC with vshufpd on r7-3700x, so it can also replace vpermpd in some cases.

@wjc404
Copy link
Contributor

wjc404 commented Jul 30, 2019

Data sharing between CCXs - still problematic
Synchronization latencies of shared data between cores: test on r7-3700x (3.6 GHz)
Screenshot from 2019-07-31 07-05-20

Here's the code:
INTER-CORE LATENCY.zip

the same test on i9-9900K:
Screenshot from 2019-07-31 07-11-53

@wjc404
Copy link
Contributor

wjc404 commented Jul 31, 2019

Synchronization bandwidths of shared data between cores: test on r7-3700x (3.6 GHz)
Screenshot from 2019-07-31 12-50-34

the same test on i9-9900K:
Screenshot from 2019-07-31 12-40-03

codes:
INTER-CORE BANDWIDTH.zip

@brada4
Copy link
Contributor

brada4 commented Jul 31, 2019

AMD looks like 4-core clusters ?
Does it get seen in NUMA tables anywhere?

@TiborGY
Copy link
Contributor Author

TiborGY commented Jul 31, 2019

AMD looks like 4-core clusters ?
Does it get seen in NUMA tables anywhere?

It it accurately shown by lstopo, the L3 cache is not shared between CCXs. But it is shown as a single NUMA node, since memory access is uniform for all cores, so technically it is not a NUMA setup.

@TiborGY
Copy link
Contributor Author

TiborGY commented Jul 31, 2019

@wjc404 What sort of fabric clock (FCLK) are you running? The inter core bandwidth between the CCXs is probably largely affected by FCLK.

@brada4
Copy link
Contributor

brada4 commented Jul 31, 2019

Well, not exposed but 3x faster ...
It is quite important that same data does not get dragged around outer cache without need. There is sort of no software exposure, just that probably way to hack affinity so that all threads stay in same space with shared L3

@wjc404
Copy link
Contributor

wjc404 commented Jul 31, 2019

Sorry I don't know where to get the frequency of FCLK. It should be the default one for 3.6 GHz CPU clock.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jul 31, 2019

I believe AMD put in some effort to make the Linux and Windows10 schedulers aware of the special topology. OpenBLAS itself probably has little chance to create a "useful" default affinity map on its own without knowing the "bigger picture" of what kind of code it was called from and what the overall system utilization is.
Perhaps a wiki page collecting links to Ryzen "best practices" whitepapers like
https://www.suse.com/documentation/suse-best-practices/singlehtml/optimizing-linux-for-amd-epyc-with-sle-12-sp3/optimizing-linux-for-amd-epyc-with-sle-12-sp3.html#sec.memory_cpu_binding or the PRACE document linked in #1461 (comment) might be useful.

(I think FCLK is proportional to the clock speed of the RAM installed in a particular system, so it could be that the DDR4-2133 memory shown on your AIDA screenshot lead to less than optimal performance of the interconnect )

@TiborGY
Copy link
Contributor Author

TiborGY commented Jul 31, 2019 via email

@wjc404
Copy link
Contributor

wjc404 commented Jul 31, 2019

@TiborGY Thanks for your guidance~ The FCLK frequency setting in my bios is AUTO.
On win10 I see the fabric clock frequency is 1200 MHz from ryzen master utility.

@martin-frbg
Copy link
Collaborator

So by replacing your memory with DDR4-3600 you could increase FCLK to 1800 which would make the cross-ccx transfers look less ugly (though at an added cost of something like $150 per 16GB)

@TiborGY
Copy link
Contributor Author

TiborGY commented Jul 31, 2019 via email

@brada4
Copy link
Contributor

brada4 commented Jul 31, 2019

It is HyperTransport (intels rough equivalent is QPI). Though no idea how modern one does around clocking/powersaving etc....

@TiborGY
Copy link
Contributor Author

TiborGY commented Jul 31, 2019

It is HyperTransport (intels rough equivalent is QPI). Though no idea how modern one does around clocking/powersaving etc....

Not anymore. It used to be hypertransport before Zen. The official marketing name for the current fabric is "Infinity Fabric".

@brada4
Copy link
Contributor

brada4 commented Jul 31, 2019

It is not userspace programmable, if scheduler knows we might be able to just group threads in cluster sized groups accessing same memory pieces, and avoiding L3 to L3 copies

It claims 40GB/s roughly ?duplex? ?half each way? ?you are at optimum already?

@marxin
Copy link
Contributor

marxin commented Feb 10, 2020

test_of_common_avx2_instructions.zip
Instruction..... IPC.. latency
vpermpd....... 0.8... 6cycs
vblendpd....... 2.0... 1cyc
vperm2f128.. 1.0... 3cycs
vshufpd......... 2.0... 1cyc
vfmadd231pd 2.0... 5cycs
vaddpd.......... 2.0... 3cycs
vmulpd.......... 2.0... 3cycs
vhaddpd........ 0.5... 6-7cycs

Hello.

Thank you very much for the measurement script. I modified that a bit and pushed here;
https://github.com/marxin/instruction-tester

For the znver1 CPU, I've got a different numbers for model name : AMD Ryzen 7 2700X Eight-Core Processor:

make test
gcc -march=haswell testinst.S testinst.c -o testinst
./testinst
CPU frequency: 4.30 GHz
GOPs per second for vpermpd indep. instructions: 2.137337e+00, rec. throughput: 2.01
GOPs per second for vpermpd chained instructions: 2.150827e+00, latency: 2.00

GOPs per second for vpermilpd indep. instructions: 4.301699e+00, rec. throughput: 1.00
GOPs per second for vpermilpd chained instructions: 4.296690e+00, latency: 1.00

GOPs per second for vblendpd indep. instructions: 4.298875e+00, rec. throughput: 1.00
GOPs per second for vblendpd chained instructions: 4.301755e+00, latency: 1.00

GOPs per second for vperm2f128 indep. instructions: 1.435560e+00, rec. throughput: 3.00
GOPs per second for vperm2f128 chained instructions: 1.439942e+00, latency: 2.99

GOPs per second for vshufpd indep. instructions: 4.296961e+00, rec. throughput: 1.00
GOPs per second for vshufpd chained instructions: 4.296540e+00, latency: 1.00

GOPs per second for vfmadd231pd indep. instructions: 4.296248e+00, rec. throughput: 1.00
GOPs per second for vfmadd231pd chained instructions: 8.651844e-01, latency: 4.97

GOPs per second for vaddpd indep. instructions: 4.286476e+00, rec. throughput: 1.00
GOPs per second for vaddpd chained instructions: 1.443964e+00, latency: 2.98

GOPs per second for vmulpd indep. instructions: 4.304053e+00, rec. throughput: 1.00
GOPs per second for vmulpd chained instructions: 1.086745e+00, latency: 3.96

GOPs per second for vhaddpd indep. instructions: 1.433505e+00, rec. throughput: 3.00
GOPs per second for vhaddpd chained instructions: 6.227662e-01, latency: 6.90

I verified the numbers with 4. Instruction tables - Agner Fog and we've got the same numbers.
I'm also sending numbers for znver2 (model name : AMD EPYC 7702 64-Core Processor):

$ make test
gcc -march=haswell testinst.S testinst.c -o testinst
./testinst
CPU frequency: 3.35 GHz
GOPs per second for vpermpd indep. instructions: 2.582493e+00, rec. throughput: 1.30
GOPs per second for vpermpd chained instructions: 5.568692e-01, latency: 6.02

GOPs per second for vpermilpd indep. instructions: 6.679462e+00, rec. throughput: 0.50
GOPs per second for vpermilpd chained instructions: 3.340770e+00, latency: 1.00

GOPs per second for vblendpd indep. instructions: 6.682278e+00, rec. throughput: 0.50
GOPs per second for vblendpd chained instructions: 3.338153e+00, latency: 1.00

GOPs per second for vperm2f128 indep. instructions: 3.339144e+00, rec. throughput: 1.00
GOPs per second for vperm2f128 chained instructions: 1.113484e+00, latency: 3.01

GOPs per second for vshufpd indep. instructions: 6.679295e+00, rec. throughput: 0.50
GOPs per second for vshufpd chained instructions: 3.338552e+00, latency: 1.00

GOPs per second for vfmadd231pd indep. instructions: 6.677935e+00, rec. throughput: 0.50
GOPs per second for vfmadd231pd chained instructions: 6.681326e-01, latency: 5.01

GOPs per second for vaddpd indep. instructions: 6.679347e+00, rec. throughput: 0.50
GOPs per second for vaddpd chained instructions: 1.113059e+00, latency: 3.01

GOPs per second for vmulpd indep. instructions: 6.681665e+00, rec. throughput: 0.50
GOPs per second for vmulpd chained instructions: 1.113511e+00, latency: 3.01

GOPs per second for vhaddpd indep. instructions: 1.670478e+00, rec. throughput: 2.01
GOPs per second for vhaddpd chained instructions: 5.135085e-01, latency: 6.52

@marxin
Copy link
Contributor

marxin commented Feb 10, 2020

I then modified the macro "SAVE4x12" in a similar way and got 0.3% performance improvement. Now the performance is about 9/10 of theoretical maximum.
dgemm_kernel_4x8_haswell.S.txt

Hey.
Can you please shared the benchmark so that I can test it on my machines ;) ?
Thanks.

@marxin
Copy link
Contributor

marxin commented Feb 13, 2020

Hey.

I've just prepared a comparison on one znver1 and one znver22 machine for all releases from 0.3.3 to 0.3.8. I've used the following script:
https://github.com/marxin/BLAS-Tester/blob/benchmark-script/test-all.py

which runs BLAS-Tester binaries with the following arguments:

$ ./test-all.py
1/12: taskset 0x1 ./bin/xsl1blastst -R all -N 67108864 67108864 1 -X 5 1 1 1 1 1
2/12: taskset 0x1 ./bin/xdl1blastst -R all -N 67108864 67108864 1 -X 5 1 1 1 1 1
3/12: taskset 0x1 ./bin/xcl1blastst -R all -N 67108864 67108864 1 -X 5 1 1 1 1 1
4/12: taskset 0x1 ./bin/xzl1blastst -R all -N 33554432 33554432 1 -X 5 1 1 1 1 1
5/12: taskset 0x1 ./bin/xsl2blastst -R all -N 8192 8192 1 -X 5 1 1 1 1 1
6/12: taskset 0x1 ./bin/xdl2blastst -R all -N 8192 8192 1 -X 5 1 1 1 1 1
7/12: taskset 0x1 ./bin/xcl2blastst -R all -N 8192 8192 1 -X 5 1 1 1 1 1
8/12: taskset 0x1 ./bin/xzl2blastst -R all -N 4096 4096 1 -X 5 1 1 1 1 1
9/12: taskset 0x1 ./bin/xsl3blastst -R all -N 2048 2048 1 -a 5 1 1 1 1 1
10/12: taskset 0x1 ./bin/xdl3blastst -R all -N 2048 2048 1 -a 5 1 1 1 1 1
11/12: taskset 0x1 ./bin/xcl3blastst -R all -N 1024 1024 1 -a 5 1 1 1 1 1 1 1 1 1 1
12/12: taskset 0x1 ./bin/xzl3blastst -R all -N 1024 1024 1 -a 5 1 1 1 1 1 1 1 1 1 1

all numbers are collected here:
https://docs.google.com/spreadsheets/d/1Xb3HWbsEuMeMf1mfRPP1AdnQTYxGU-7Rmm-khzMxz98/edit#gid=228273818 (the spreadsheet contains 3 sheets).

Based on the numbers I was able to identify the following problems:

  1. I found a typo in isamax and the speed will be restored once Fix iamax sse implementation and add utests #2414 is merged
  2. there's a speed drop of ~5% for GEMM, SYMM, SYR2K, SYRK, TRMM after 92b1021 (Optimize AVX2 SGEMM & STRMM #2361, @wjc404); I also verified that locally on my AMD Ryzen 7 2700X machine
  3. there's a speed drop for both znver1 and znver2 after 28e9645 (Replace vpermpd with vpermilpd in the Haswell/Zen zdot microkernel #2190, @wjc404); the patch was supposed to speed it up; I can confirm vpermilpd has smaller latency (and bigger throughput), but is slower for some reason in the benchmark

I'm going to bisect other performance issues.
Feel free to comment on the selected benchmarking workloads.

@wjc404
Copy link
Contributor

wjc404 commented Feb 13, 2020

@marxin I did most of the SGEMM and DGEMM benchmarks with the 2 programs "sgemmtest_new" and "dgemmtest_new" in my repository GEMM_AVX2_FMA3. When using them on Zen processors, please set the environment variable MKL_DEBUG_CPU_TYPE to 5.
For benchmarking level3 subroutines, monitoring CPU frequency is recommended (if it is never done before) as thermal throttling can affect results.

@marxin
Copy link
Contributor

marxin commented Feb 14, 2020

@marxin I did most of the SGEMM and DGEMM benchmarks with the 2 programs "sgemmtest_new" and "dgemmtest_new" in my repository GEMM_AVX2_FMA3. When using them on Zen processors, please set the environment variable MKL_DEBUG_CPU_TYPE to 5.

Ok, I see the program depends on a MKL header file (and needs to be linked against it).
Can you please make the code more portable? It would be great to have it part of this repository or BLAS-Tester, can you please do it?

For benchmarking level3 subroutines, monitoring CPU frequency is recommended (if it is never done before) as thermal throttling can affect results.

Sure. A difference is that you probably use OPENMP with multiple threads, am I right?
Can you please re-test the numbers with the corresponding GEMM test in BLAS-Tester?

@martin-frbg
Copy link
Collaborator

@marxin couldn't you use the provided binaries from wjc404's repo (which also have MKL statically linked) ? And ISTR performance figures were obtained for both single and multiple threads.

@wjc404
Copy link
Contributor

wjc404 commented Feb 14, 2020

@marxin If you have confirmed significant performance drop of SGEMM (especially in serial execution with dimensions > 4000) on zen/zen+ chips after PR #2361 , then you can try to specify different SGEMM kernels for zen and zen2 (probably by editing "KERNEL.ZEN" & "param.h" and modifying CPU detection codes, to choose "sgemm_kernel_16x4_haswell.S" for zen/zen+ and "sgemm_kernel_8x4_haswell.c" for zen2) and make it a PR. Unfortunately I cannot access google website in China to download your results. Currently I don't have a machine with zen/zen+ CPU to test.
I would be greatful if you can figure out the reason of the SGEMM performance drop (memory-bound or core-bound factors?) so I can modify the new kernel code accordingly to improve its compatibility to old zen processors.

@martin-frbg
Copy link
Collaborator

I believe the speed drops in xDOT post 0.3.6 might be due to #1965 if they are not just an artefact. If I read your table correctly, your figures for DSDOT/SDSDOT are even worse than for ZDOT, and they definitely did not receive any changes except that fix for undeclared clobbers.
(Possibly the compiler was able to apply some dangerous optimizations before).

@martin-frbg
Copy link
Collaborator

@wjc404 this is marxin's spreadsheet exported from the google docs site in .xlsx format
OpenBLAS - AMD ZEN.xlsx

@wjc404
Copy link
Contributor

wjc404 commented Feb 18, 2020

@martin-frbg Thanks.
@marxin Maybe the changed settings in "param.h" played a role. I didn't realize this since I have never had chance to test SGEMM on EPYC CPUs. Could you try with SGEMM_DEFAULT_P = 640 and SGEMM_DEFAULT_Q = 448 (or even larger) (modify line 669,675 and 678 in param.h and recompile OpenBLAS 0.3.8)?

@marxin
Copy link
Contributor

marxin commented Feb 18, 2020

@marxin I see it's about parallel performance with >4 threads.

Note that my spreadsheet only contains results for single-threaded runs. I haven't had time to run parallel tests. I'm planning to do that.

Most likely the changed settings in "param.h" played a role. I didn't realize this since I have never had chance to test SGEMM on EPYC CPUs. Could you try with SGEMM_DEFAULT_P = 640 and SGEMM_DEFAULT_Q = 448 (or even larger) (modify line 669,675 and 678 in param.h and recompile OpenBLAS 0.3.8)?

Yes, I will test the suggested changes.

@marxin
Copy link
Contributor

marxin commented Feb 18, 2020

@marxin If you have confirmed significant performance drop of SGEMM (especially in serial execution with dimensions > 4000) on zen/zen+ chips after PR #2361 , then you can try to specify different SGEMM kernels for zen and zen2 (probably by editing "KERNEL.ZEN" & "param.h" and modifying CPU detection codes, to choose "sgemm_kernel_16x4_haswell.S" for zen/zen+ and "sgemm_kernel_8x4_haswell.c" for zen2) and make it a PR.

Ok, I've just made a minimal reversion of #2361 which restores speed on znver1 and it also helps on znver2. Let's discuss that in #2430.

@marxin
Copy link
Contributor

marxin commented Feb 18, 2020

I believe the speed drops in xDOT post 0.3.6 might be due to #1965 if they are not just an artefact. If I read your table correctly, your figures for DSDOT/SDSDOT are even worse than for ZDOT, and they definitely did not receive any changes except that fix for undeclared clobbers.
(Possibly the compiler was able to apply some dangerous optimizations before).

I've just re-run that locally and I can't get the slower numbers for current develop branch.

@martin-frbg
Copy link
Collaborator

Perhaps with Ryzen vs EPYC we are introducing some other variable besides znver1/znver2 even when running on a single core ? Unfortunately I cannot run benchmarks on my 2700K in the next few days (and I remember it was not easy to force it to run with a fixed core frequency and actually reproducible speeds)

@MigMuc
Copy link

MigMuc commented Feb 18, 2020

I did the benchmark given above with my new Ryzen 7 3700X. I set the CPU frequency to 3.6 GHz (verified with zenmonitor) switching off any Turbo Core boost or Pecision Boost Overdrive settings in the BIOS. I have installed 2x8 GB RAM @ 3200 MHz. The results for the last 3 releases of OpenBLAS are given in the spreadsheet.
OpenBLAS-AMD_Ryzen_R7_3700X_3600MHz.xlsx
I can confirm that with the releases before v0.3.8 the SGEMM is slightly faster than in the current release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants