-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scal benchmark: eliminate y, move init/timing out of loop #3847
Conversation
Removing y avoids cache effects (if y is the size of the L1 cache, the main array x is removed from it). Moving init and timing out of the loop makes the scal benchmark behave like the gemm benchmark, and allows higher accuracy for smaller test cases since the loop overhead is much smaller than the timing overhead. Example: OPENBLAS_LOOPS=10000 ./dscal.goto 1024 8192 1024 on AMD Zen2 (7532) with 32k (4k doubles) L1 cache per core. Before From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000 SIZE Flops 1024 : 5627.08 MFlops 0.000000 sec 2048 : 5907.34 MFlops 0.000000 sec 3072 : 5553.30 MFlops 0.000001 sec 4096 : 5446.38 MFlops 0.000001 sec 5120 : 5504.61 MFlops 0.000001 sec 6144 : 5501.80 MFlops 0.000001 sec 7168 : 5547.43 MFlops 0.000001 sec 8192 : 5548.46 MFlops 0.000001 sec After From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000 SIZE Flops 1024 : 6310.28 MFlops 0.000000 sec 2048 : 6396.29 MFlops 0.000000 sec 3072 : 6439.14 MFlops 0.000000 sec 4096 : 6327.14 MFlops 0.000001 sec 5120 : 5628.24 MFlops 0.000001 sec 6144 : 5616.41 MFlops 0.000001 sec 7168 : 5553.13 MFlops 0.000001 sec 8192 : 5600.88 MFlops 0.000001 sec We can see the L1->L2 switchover point is now where it should be, and the number of flops for L1 is more accurate.
Two notes/questions:
|
|
Level 1 BLAS is bound to memory speed - you get like few flops per memory access. Better characteristic would be GB/s towards RAM in L1 and in many cases for L2,but for L3 is more like N^(3/2) per memory access. Not here, but for other benchmarks memset()-ing destination allocation is needed to cancel out memory overcommitment effects. Best place to start would be the benchmarks which are matching python/julia/R scripts in subdirectory. |
I did some benchmarks also with bandwidth-1.11.2d from https://zsmith.co/bandwidth.php (patched it to allow avx2/ymm testing on AMD) and the test from #2180 (comment) Basically for dscal the bottleneck is store bandwidth, and many modern x86 CPUs can sustain one register store per cycle on L1 cache. E.g. if I hit 12417.18 Mflops for dscal at 3.2GHz on Zen2 it's fairly close to the theoretical 4x3.2=12.8 , and ymm is double xmm, about 100 vs. 50 GB/s. On my Broadwell test the difference between ymm and xmm was gone for L2 cache, but for Zen2 there's still an advantage on L2 and even L3, though on main memory it's gone, that is, down to ~1700 Mflops, or 13.6 GB/s, which could be improved using instructions that bypass the cache (non-temporal moves, |
More like X cachelines per memory cycle, but ok, we are on similar page. |
Removing y avoids cache effects (if y is the size of the L1 cache, the main array x is removed from it).
Moving init and timing out of the loop makes the scal benchmark behave like the gemm benchmark, and allows higher accuracy for smaller test cases since the loop overhead is much smaller than the timing overhead.
Example:
OPENBLAS_LOOPS=10000 ./dscal.goto 1024 8192 1024
on AMD Zen2 (7532) with 32k (4k doubles) L1 cache per core.
Before
From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
SIZE Flops
1024 : 5627.08 MFlops 0.000000 sec
2048 : 5907.34 MFlops 0.000000 sec
3072 : 5553.30 MFlops 0.000001 sec
4096 : 5446.38 MFlops 0.000001 sec
5120 : 5504.61 MFlops 0.000001 sec
6144 : 5501.80 MFlops 0.000001 sec
7168 : 5547.43 MFlops 0.000001 sec
8192 : 5548.46 MFlops 0.000001 sec
After
From : 1024 To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
SIZE Flops
1024 : 6310.28 MFlops 0.000000 sec
2048 : 6396.29 MFlops 0.000000 sec
3072 : 6439.14 MFlops 0.000000 sec
4096 : 6327.14 MFlops 0.000001 sec
5120 : 5628.24 MFlops 0.000001 sec
6144 : 5616.41 MFlops 0.000001 sec
7168 : 5553.13 MFlops 0.000001 sec
8192 : 5600.88 MFlops 0.000001 sec
We can see the L1->L2 switchover point is now where it should be, and the number of flops for L1 is more accurate.