-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pathological threading behavior in OpenBLAS -> makes my test suite 4x slower than if using reference BLAS #731
Comments
Yeah, simply doing small matrix computations in the main thread would certainly be a big improvement in this case. Beyond that, Julian Taylor (who helped me track this down, in IME is usually pretty on point about high-performance code) had some further thoughts on how to possibly address the more fundamental problem:
|
It is fairly obvious from your profiler output that eliminating system calls (i.e OMP_NUM_THREADS=1_ will speed up your computation 20x, while speeding up OpenBLAS 20x will sped it up 2% |
The system calls are OpenBLAS busy-waiting while making calls to |
Sorry, maybe that was too terse to be clear... I think there are two possible improvements here, that are probably both independently valuable:
|
In your case with only small matrices involved a single-thread version should be optimal. |
I fixed |
10s for 400000 pthread_create()-s is fast. |
@brada4: no idea what you're talking about with (After a bit of checking, it seems you don't actually contribute anything to openblas besides posting lots and lots of unhelpful comments on issues? I guess I will ignore you from now on then.) |
Why dont you use single-threaded OpenBLAS? Creating threads are much heavier than processing needed for your LAPACK call. OpenBLAS threading must be avoided in such cases by reasonably rising threading cutoff much like #103 |
Iinterface/swap.c Other not in binary, unlikely with casual compile |
Try this (it will turn small *swap routines single-threaded) interface/swap.c //disable multi-thread when incx==0 or incy==0 if (nthreads == 1) { |
For dswap I cannot find any value of n where multi-threading is useful. From OpenBLAS benchmark directory:
So it looks like we should just remove multi-threading in |
Your lapack under your code calls sswap looots of times with n=2, i just picked some random value |
Cannot comment on the "why", just on "when" - I dug up a copy of the original GotoBLAS sources from 2006 and it already had the multithreading in level1/swap/swap.c (even the "if incx==0..." mentioned above was added only in #6) |
Basically same effect. What about #ifdef SMPTEST? It can be even reduced to swapping pointers...
|
Total benchmark time includes (mostly) rand() on main thread |
Rigged benchmark to just test 1<<x cases, basically showing that 2 CPUs have huge to mild disadvantage until all caches are pouring over (AMD A10-5750M), and then it gives some marginal benefit after (after WHAT is hard to estimate). Incx incy, or 3rd CPU does not change result @jeromerobert can you measure with yours? |
Some haswell has 128MB L4 cache I think it is safe to prime up 2nd thread right after that. |
You are flattering me with avoiding thrashing my 2MB cache. (Warning: incomplete there is no ASM for L3 cache CPUID, does not respect shared caches)
|
I maintain a python library called "patsy", which runs tests on Travis-CI, and I was trying to figure out why some of my test runs were failing due to timeouts.
I eventually tracked it down to, it turns out that some of the test configurations I was using were linked to OpenBLAS, and some were not. The OpenBLAS runs were taking >4x longer to complete. This is true even though the test suite actually spends most of its time doing other things besides linear algebra...
It turns out that the problem is some truly pathological behavior in OpenBLAS's threading code. Here's a simple example of two identical runs of the test suite, one with OpenBLAS in its default mode, and one with
OMP_NUM_THREADS=1
: https://travis-ci.org/pydata/patsy/builds/100343158The biggest culprit appears to be the fact that one of the tests does a few tens of thousands of SVDs on small matrices. Here's a simple test case:
Running
time python svd-demo.py
on my laptop in different configurations:So having access to multiple threads makes things take more than 2x longer by wall clock time, and 8x more CPU time.
Here's the top few lines from a
perf report
, showing that OpenBLAS is spending the vast majority of that time simply churning through the kernel scheduler, accomplishing nothing useful:The text was updated successfully, but these errors were encountered: