scal benchmark: eliminate y, move init/timing out of loop #3847

bartoldeman · 2022-11-29T13:11:19Z

Removing y avoids cache effects (if y is the size of the L1 cache, the main array x is removed from it).
Moving init and timing out of the loop makes the scal benchmark behave like the gemm benchmark, and allows higher accuracy for smaller test cases since the loop overhead is much smaller than the timing overhead.

Example:
OPENBLAS_LOOPS=10000 ./dscal.goto 1024 8192 1024
on AMD Zen2 (7532) with 32k (4k doubles) L1 cache per core.

Before
From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
SIZE Flops
1024 : 5627.08 MFlops 0.000000 sec
2048 : 5907.34 MFlops 0.000000 sec
3072 : 5553.30 MFlops 0.000001 sec
4096 : 5446.38 MFlops 0.000001 sec
5120 : 5504.61 MFlops 0.000001 sec
6144 : 5501.80 MFlops 0.000001 sec
7168 : 5547.43 MFlops 0.000001 sec
8192 : 5548.46 MFlops 0.000001 sec

After
From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
SIZE Flops
1024 : 6310.28 MFlops 0.000000 sec
2048 : 6396.29 MFlops 0.000000 sec
3072 : 6439.14 MFlops 0.000000 sec
4096 : 6327.14 MFlops 0.000001 sec
5120 : 5628.24 MFlops 0.000001 sec
6144 : 5616.41 MFlops 0.000001 sec
7168 : 5553.13 MFlops 0.000001 sec
8192 : 5600.88 MFlops 0.000001 sec

We can see the L1->L2 switchover point is now where it should be, and the number of flops for L1 is more accurate.

Removing y avoids cache effects (if y is the size of the L1 cache, the main array x is removed from it). Moving init and timing out of the loop makes the scal benchmark behave like the gemm benchmark, and allows higher accuracy for smaller test cases since the loop overhead is much smaller than the timing overhead. Example: OPENBLAS_LOOPS=10000 ./dscal.goto 1024 8192 1024 on AMD Zen2 (7532) with 32k (4k doubles) L1 cache per core. Before From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000 SIZE Flops 1024 : 5627.08 MFlops 0.000000 sec 2048 : 5907.34 MFlops 0.000000 sec 3072 : 5553.30 MFlops 0.000001 sec 4096 : 5446.38 MFlops 0.000001 sec 5120 : 5504.61 MFlops 0.000001 sec 6144 : 5501.80 MFlops 0.000001 sec 7168 : 5547.43 MFlops 0.000001 sec 8192 : 5548.46 MFlops 0.000001 sec After From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000 SIZE Flops 1024 : 6310.28 MFlops 0.000000 sec 2048 : 6396.29 MFlops 0.000000 sec 3072 : 6439.14 MFlops 0.000000 sec 4096 : 6327.14 MFlops 0.000001 sec 5120 : 5628.24 MFlops 0.000001 sec 6144 : 5616.41 MFlops 0.000001 sec 7168 : 5553.13 MFlops 0.000001 sec 8192 : 5600.88 MFlops 0.000001 sec We can see the L1->L2 switchover point is now where it should be, and the number of flops for L1 is more accurate.

bartoldeman · 2022-11-29T13:14:47Z

Two notes/questions:

Many other benchmarks have the init in the loop, but not gemm. I couldn't figure out why? Should they be adjusted too?
For some reason I don't understand kernel/x86_64/dscal_microk_haswell-2.c uses xmm registers, where ymm registers are twice as fast on Zen2 (also Broadwell), at least for inner caches. Even stranger, the sscal kernel uses the old code in scal_sse.S, unlike [dcz]. Do you know why? I can supply PRs for those too.

martin-frbg · 2022-11-29T15:19:25Z

Could be simply down to when the benchmarks were added - most appear to have been written by the same contributor in 2014 (and some of them were later used as templates for others), and he wrote the GEMM one years later, when he may have been much more aware of cache effects.
The Haswell kernel precedes the introduction of Ryzen2 by three years, and probably nobody was aware of the register speed issue. I have no idea if the absence of microkernels for SSCAL is/was due to limited benefit expected from a newer-than-SSE2 solution, or if it was never seen as a bottleneck in real-world code. Unfortunately the author of the files in question stopped contributing very suddenly and for unknown (possibly even medical) reasons in early 2017.

brada4 · 2022-12-01T07:45:07Z

Level 1 BLAS is bound to memory speed - you get like few flops per memory access. Better characteristic would be GB/s towards RAM in L1 and in many cases for L2,but for L3 is more like N^(3/2) per memory access.

Not here, but for other benchmarks memset()-ing destination allocation is needed to cancel out memory overcommitment effects.

Best place to start would be the benchmarks which are matching python/julia/R scripts in subdirectory.

bartoldeman · 2022-12-01T14:28:49Z

I did some benchmarks also with bandwidth-1.11.2d from https://zsmith.co/bandwidth.php (patched it to allow avx2/ymm testing on AMD) and the test from #2180 (comment)

Basically for dscal the bottleneck is store bandwidth, and many modern x86 CPUs can sustain one register store per cycle on L1 cache. E.g. if I hit 12417.18 Mflops for dscal at 3.2GHz on Zen2 it's fairly close to the theoretical 4x3.2=12.8 , and ymm is double xmm, about 100 vs. 50 GB/s.

On my Broadwell test the difference between ymm and xmm was gone for L2 cache, but for Zen2 there's still an advantage on L2 and even L3, though on main memory it's gone, that is, down to ~1700 Mflops, or 13.6 GB/s, which could be improved using instructions that bypass the cache (non-temporal moves, vmovntdq and co).

brada4 · 2022-12-01T17:21:51Z

More like X cachelines per memory cycle, but ok, we are on similar page.

bartoldeman mentioned this pull request Dec 1, 2022

dscal: use ymm registers in Haswell microkernel #3848

Merged

martin-frbg added this to the 0.3.22 milestone Dec 2, 2022

martin-frbg merged commit 65984fb into OpenMathLib:develop Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scal benchmark: eliminate y, move init/timing out of loop #3847

scal benchmark: eliminate y, move init/timing out of loop #3847

bartoldeman commented Nov 29, 2022

bartoldeman commented Nov 29, 2022 •

edited

Loading

martin-frbg commented Nov 29, 2022

brada4 commented Dec 1, 2022

bartoldeman commented Dec 1, 2022

brada4 commented Dec 1, 2022

scal benchmark: eliminate y, move init/timing out of loop #3847

scal benchmark: eliminate y, move init/timing out of loop #3847

Conversation

bartoldeman commented Nov 29, 2022

bartoldeman commented Nov 29, 2022 • edited Loading

martin-frbg commented Nov 29, 2022

brada4 commented Dec 1, 2022

bartoldeman commented Dec 1, 2022

brada4 commented Dec 1, 2022

bartoldeman commented Nov 29, 2022 •

edited

Loading