Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scal benchmark: eliminate y, move init/timing out of loop #3847

Merged
merged 1 commit into from
Dec 2, 2022

Conversation

bartoldeman
Copy link
Contributor

Removing y avoids cache effects (if y is the size of the L1 cache, the main array x is removed from it).
Moving init and timing out of the loop makes the scal benchmark behave like the gemm benchmark, and allows higher accuracy for smaller test cases since the loop overhead is much smaller than the timing overhead.

Example:
OPENBLAS_LOOPS=10000 ./dscal.goto 1024 8192 1024
on AMD Zen2 (7532) with 32k (4k doubles) L1 cache per core.

Before
From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
SIZE Flops
1024 : 5627.08 MFlops 0.000000 sec
2048 : 5907.34 MFlops 0.000000 sec
3072 : 5553.30 MFlops 0.000001 sec
4096 : 5446.38 MFlops 0.000001 sec
5120 : 5504.61 MFlops 0.000001 sec
6144 : 5501.80 MFlops 0.000001 sec
7168 : 5547.43 MFlops 0.000001 sec
8192 : 5548.46 MFlops 0.000001 sec

After
From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
SIZE Flops
1024 : 6310.28 MFlops 0.000000 sec
2048 : 6396.29 MFlops 0.000000 sec
3072 : 6439.14 MFlops 0.000000 sec
4096 : 6327.14 MFlops 0.000001 sec
5120 : 5628.24 MFlops 0.000001 sec
6144 : 5616.41 MFlops 0.000001 sec
7168 : 5553.13 MFlops 0.000001 sec
8192 : 5600.88 MFlops 0.000001 sec

We can see the L1->L2 switchover point is now where it should be, and the number of flops for L1 is more accurate.

Removing y avoids cache effects (if y is the size of the L1 cache, the
main array x is removed from it).
Moving init and timing out of the loop makes the scal benchmark behave like
the gemm benchmark, and allows higher accuracy for smaller test cases since
the loop overhead is much smaller than the timing overhead.

Example:
OPENBLAS_LOOPS=10000 ./dscal.goto 1024 8192 1024
on AMD Zen2 (7532) with 32k (4k doubles) L1 cache per core.

Before
From : 1024  To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
   SIZE       Flops
   1024 :     5627.08 MFlops   0.000000 sec
   2048 :     5907.34 MFlops   0.000000 sec
   3072 :     5553.30 MFlops   0.000001 sec
   4096 :     5446.38 MFlops   0.000001 sec
   5120 :     5504.61 MFlops   0.000001 sec
   6144 :     5501.80 MFlops   0.000001 sec
   7168 :     5547.43 MFlops   0.000001 sec
   8192 :     5548.46 MFlops   0.000001 sec

After
From : 1024  To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
   SIZE       Flops
   1024 :     6310.28 MFlops   0.000000 sec
   2048 :     6396.29 MFlops   0.000000 sec
   3072 :     6439.14 MFlops   0.000000 sec
   4096 :     6327.14 MFlops   0.000001 sec
   5120 :     5628.24 MFlops   0.000001 sec
   6144 :     5616.41 MFlops   0.000001 sec
   7168 :     5553.13 MFlops   0.000001 sec
   8192 :     5600.88 MFlops   0.000001 sec

We can see the L1->L2 switchover point is now where it should be, and the
number of flops for L1 is more accurate.
@bartoldeman
Copy link
Contributor Author

bartoldeman commented Nov 29, 2022

Two notes/questions:

  1. Many other benchmarks have the init in the loop, but not gemm. I couldn't figure out why? Should they be adjusted too?
  2. For some reason I don't understand kernel/x86_64/dscal_microk_haswell-2.c uses xmm registers, where ymm registers are twice as fast on Zen2 (also Broadwell), at least for inner caches. Even stranger, the sscal kernel uses the old code in scal_sse.S, unlike [dcz]. Do you know why? I can supply PRs for those too.

@martin-frbg
Copy link
Collaborator

  1. Could be simply down to when the benchmarks were added - most appear to have been written by the same contributor in 2014 (and some of them were later used as templates for others), and he wrote the GEMM one years later, when he may have been much more aware of cache effects.
  2. The Haswell kernel precedes the introduction of Ryzen2 by three years, and probably nobody was aware of the register speed issue. I have no idea if the absence of microkernels for SSCAL is/was due to limited benefit expected from a newer-than-SSE2 solution, or if it was never seen as a bottleneck in real-world code. Unfortunately the author of the files in question stopped contributing very suddenly and for unknown (possibly even medical) reasons in early 2017.

@brada4
Copy link
Contributor

brada4 commented Dec 1, 2022

Level 1 BLAS is bound to memory speed - you get like few flops per memory access. Better characteristic would be GB/s towards RAM in L1 and in many cases for L2,but for L3 is more like N^(3/2) per memory access.

Not here, but for other benchmarks memset()-ing destination allocation is needed to cancel out memory overcommitment effects.

Best place to start would be the benchmarks which are matching python/julia/R scripts in subdirectory.

@bartoldeman
Copy link
Contributor Author

I did some benchmarks also with bandwidth-1.11.2d from https://zsmith.co/bandwidth.php (patched it to allow avx2/ymm testing on AMD) and the test from #2180 (comment)

Basically for dscal the bottleneck is store bandwidth, and many modern x86 CPUs can sustain one register store per cycle on L1 cache. E.g. if I hit 12417.18 Mflops for dscal at 3.2GHz on Zen2 it's fairly close to the theoretical 4x3.2=12.8 , and ymm is double xmm, about 100 vs. 50 GB/s.

On my Broadwell test the difference between ymm and xmm was gone for L2 cache, but for Zen2 there's still an advantage on L2 and even L3, though on main memory it's gone, that is, down to ~1700 Mflops, or 13.6 GB/s, which could be improved using instructions that bypass the cache (non-temporal moves, vmovntdq and co).

@brada4
Copy link
Contributor

brada4 commented Dec 1, 2022

More like X cachelines per memory cycle, but ok, we are on similar page.

@martin-frbg martin-frbg added this to the 0.3.22 milestone Dec 2, 2022
@martin-frbg martin-frbg merged commit 65984fb into OpenMathLib:develop Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants