Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pathological threading behavior in OpenBLAS -> makes my test suite 4x slower than if using reference BLAS #731

Closed
njsmith opened this issue Jan 5, 2016 · 21 comments

Comments

@njsmith
Copy link

njsmith commented Jan 5, 2016

I maintain a python library called "patsy", which runs tests on Travis-CI, and I was trying to figure out why some of my test runs were failing due to timeouts.

I eventually tracked it down to, it turns out that some of the test configurations I was using were linked to OpenBLAS, and some were not. The OpenBLAS runs were taking >4x longer to complete. This is true even though the test suite actually spends most of its time doing other things besides linear algebra...

It turns out that the problem is some truly pathological behavior in OpenBLAS's threading code. Here's a simple example of two identical runs of the test suite, one with OpenBLAS in its default mode, and one with OMP_NUM_THREADS=1: https://travis-ci.org/pydata/patsy/builds/100343158

The biggest culprit appears to be the fact that one of the tests does a few tens of thousands of SVDs on small matrices. Here's a simple test case:

import numpy as np

# Generate a 20x2 matrix
r = np.random.RandomState(0)
m = r.randn(20, 2)

# Calculate the svd
for i in xrange(100000):
    np.linalg.svd(m)

Running time python svd-demo.py on my laptop in different configurations:

# Reference BLAS:
1.98s user 0.26s system 113% cpu 1.973 total
# OpenBLAS with OMP_NUM_THREADS=1
2.06s user 0.01s system 99% cpu 2.070 total
# OpenBLAS with no tweaks
5.80s user 10.87s system 379% cpu 4.392 total

So having access to multiple threads makes things take more than 2x longer by wall clock time, and 8x more CPU time.

Here's the top few lines from a perf report, showing that OpenBLAS is spending the vast majority of that time simply churning through the kernel scheduler, accomplishing nothing useful:

Overhead  Command  Shared Object                    Symbol                     ◆
   9.87%  python   [kernel.kallsyms]                [k] __schedule             ▒
   7.22%  python   [kernel.kallsyms]                [k] entry_SYSCALL_64       ▒
   7.21%  python   [kernel.kallsyms]                [k] pick_next_task_fair    ▒
   7.06%  python   libc-2.21.so                     [.] __sched_yield          ▒
   5.55%  python   [kernel.kallsyms]                [k] entry_SYSCALL_64_after_▒
   4.77%  python   libopenblas_nehalemp-r0.2.14.so  [.] blas_thread_server     ▒
   4.70%  python   [kernel.kallsyms]                [k] native_sched_clock     ▒
   4.56%  python   [kernel.kallsyms]                [k] cpuacct_charge         ▒
   4.39%  python   [kernel.kallsyms]                [k] _raw_spin_lock_irq     ▒
   4.18%  python   [kernel.kallsyms]                [k] _raw_spin_lock         ▒
   3.56%  python   [kernel.kallsyms]                [k] update_curr            ▒
   2.46%  python   [kernel.kallsyms]                [k] system_call_fast_compar▒
   2.33%  python   [kernel.kallsyms]                [k] pick_next_entity       ▒
@jeromerobert
Copy link
Contributor

It looks like #79. There is at least a problem in interface/ger.c. It call GER_THREAD whatever the size of the matrix. We need something like f8f2e26 or b985cea.

@njsmith
Copy link
Author

njsmith commented Jan 6, 2016

Yeah, simply doing small matrix computations in the main thread would certainly be a big improvement in this case.

Beyond that, Julian Taylor (who helped me track this down, in IME is usually pretty on point about high-performance code) had some further thoughts on how to possibly address the more fundamental problem:

<jtaylor> probably will look something like this: https://twitter.com/jtaylor108/status/556833272190468096
[...]
<jtaylor> yeah thats openblas
<jtaylor> being stupid
<njs> that is extraordinary, this isn't even linear algebra code
<jtaylor> sched_yield is an aweful idea
<jtaylor> it did what you though 15 years ago
<jtaylor> on linux is dumps your process at the end of the queue rescheduling everything
<jtaylor> massive overhead
<njs> I'm sorta curious what would happen if one used LD_PRELOAD to stub-out sched_yield and ran the same benchmark
<jtaylor> it would probably not be much better
<jtaylor> as it would instead busy loop
<jtaylor> probably a little better, but it should just use a standard spinlock

@brada4
Copy link
Contributor

brada4 commented Jan 7, 2016

It is fairly obvious from your profiler output that eliminating system calls (i.e OMP_NUM_THREADS=1_ will speed up your computation 20x, while speeding up OpenBLAS 20x will sped it up 2%

@njsmith
Copy link
Author

njsmith commented Jan 7, 2016

The system calls are OpenBLAS busy-waiting while making calls to sched_yield.

@njsmith
Copy link
Author

njsmith commented Jan 7, 2016

Sorry, maybe that was too terse to be clear... I think there are two possible improvements here, that are probably both independently valuable:

  • OpenBLAS shouldn't use multiple threads at all for small matrices -- even if your threading is super-efficient, it still has some overhead, so it's only worthwhile if the matrix is large enough to amortize this overhead.
  • OpenBLAS probably shouldn't use sched_yield + busywaiting as a synchronization primitive -- it may beat real synchronization primitives in some specialized cases, but this is going to be fragile and highly dependent on all kinds of system-specific details (kernel version, system load, hardware, exact matrix sizes, etc. etc.). And when it goes wrong it goes really wrong. If OpenBLAS had been using real synchronization primitives, then the CPU cost penalty for using multiple threads instead of a single thread would have probably been a few percent, not a factor of 8. (see also)

@brada4
Copy link
Contributor

brada4 commented Jan 7, 2016

In your case with only small matrices involved a single-thread version should be optimal.
Above that it is wild guessing if it is page per CPU or few or hugepage or few or numa stride, or gigabyte.
Let me invent micrometer to measure

@jeromerobert
Copy link
Contributor

I fixed interface/ger.c in my fork and it makes things much better. But it's still not good as interface/swap.c have the same curse.

@brada4
Copy link
Contributor

brada4 commented Jan 15, 2016

10s for 400000 pthread_create()-s is fast.
pthread_create wraps system call and you see that whole logistics around it take 98% of execution time.
creating a thread (or even worse more of them) is definitely heavier than churning 88 bytes on a SSE register file that curiously is only 256 bytes large.
you can always call openblas_set_num_threads(1) before operations on very small matrices

@njsmith
Copy link
Author

njsmith commented Jan 15, 2016

@brada4: no idea what you're talking about with pthread_create. And calling openblas_set_numthreads before operations on small matrices is absurd. Notice that this problem was discovered after a code distributor switched out their blas implementation 2 layers underneath me -- I didn't even know I was using openblas.

(After a bit of checking, it seems you don't actually contribute anything to openblas besides posting lots and lots of unhelpful comments on issues? I guess I will ignore you from now on then.)

@brada4
Copy link
Contributor

brada4 commented Jan 16, 2016

Why dont you use single-threaded OpenBLAS? Creating threads are much heavier than processing needed for your LAPACK call.
Is it OK for you to apply OPENBLAS_NUM_THREADS permanently or rebuild a single-threaded OpenBLAS?
btw numpypy takes about .3s on your test sample with 1-thread OpenBLAS.

OpenBLAS threading must be avoided in such cases by reasonably rising threading cutoff much like #103
Actually the threading is not applied at LAPACK frontend (s)gesdd but in one or more later calls to BLAS and it is a lot of them to say softly.
For next hours I will be looking into which of them misses (creates threads no matter input size) in your case.

@brada4
Copy link
Contributor

brada4 commented Jan 16, 2016

Iinterface/swap.c
does not choose UP swap code if matrix size is small (called with incx=incy=2 in this case)
and affects test case.

Other not in binary, unlikely with casual compile
in interface/ger.c hidden under #ifdef SMPTEST

@brada4
Copy link
Contributor

brada4 commented Jan 16, 2016

Try this (it will turn small *swap routines single-threaded)

interface/swap.c

//disable multi-thread when incx==0 or incy==0
//In that case, the threads would be dependent.
-- if (incx == 0 || incy == 0)
++ if (incx == 0 || incy == 0 || n<4096)
nthreads = 1;

if (nthreads == 1) {

@jeromerobert
Copy link
Contributor

We need a consensus on #742 first.

@brada4 how did you chose the threshold (i.e. 4096) ?

@jeromerobert
Copy link
Contributor

For dswap I cannot find any value of n where multi-threading is useful. From OpenBLAS benchmark directory:

$ time bash -c 'OPENBLAS_NUM_THREADS=1 OPENBLAS_LOOPS=10 ./dswap.goto 1000000 10000000 1000000'
From : 1000000  To : 10000000 Step = 1000000 Inc_x = 1 Inc_y = 1 Loops = 10
   SIZE       Flops
 1000000 :     5280.53 MBytes
 2000000 :     3808.07 MBytes
 3000000 :     3807.05 MBytes
 4000000 :     3780.67 MBytes
 5000000 :     3766.09 MBytes
 6000000 :     3749.88 MBytes
 7000000 :     3709.89 MBytes
 8000000 :     3743.32 MBytes
 9000000 :     3736.98 MBytes
 10000000 :     3718.41 MBytes

real    0m14.091s
user    0m14.020s
sys     0m0.056s

$ time bash -c 'OPENBLAS_NUM_THREADS=2 OPENBLAS_LOOPS=10 ./dswap.goto 1000000 10000000 1000000'
From : 1000000  To : 10000000 Step = 1000000 Inc_x = 1 Inc_y = 1 Loops = 10
   SIZE       Flops
 1000000 :     7075.26 MBytes
 2000000 :     6552.81 MBytes
 3000000 :     5028.81 MBytes
 4000000 :     4628.56 MBytes
 5000000 :     4388.18 MBytes
 6000000 :     4246.62 MBytes
 7000000 :     4106.84 MBytes
 8000000 :     4096.21 MBytes
 9000000 :     4060.36 MBytes
 10000000 :     4085.70 MBytes

real    0m26.831s
user    0m29.196s
sys     0m9.528s

So it looks like we should just remove multi-threading in swap.c. Yet it would be good to know why it was implemented. Can someone else reproduce the same behavior ?

@brada4
Copy link
Contributor

brada4 commented Jan 16, 2016

Your lapack under your code calls sswap looots of times with n=2, i just picked some random value

@martin-frbg
Copy link
Collaborator

Cannot comment on the "why", just on "when" - I dug up a copy of the original GotoBLAS sources from 2006 and it already had the multithreading in level1/swap/swap.c (even the "if incx==0..." mentioned above was added only in #6)
Perhaps it made more sense back then, or it just seemed like a good idea at the time (when dualcore processors just hit the market, so SMP mostly meant relatively rare multisocket systems)

@brada4
Copy link
Contributor

brada4 commented Jan 16, 2016

Basically same effect. What about #ifdef SMPTEST? It can be even reduced to swapping pointers...
4-core piledriver (total of 8MB in caches, making sample 10x that to strip their effect)

> /usr/bin/time bash -c 'OPENBLAS_LOOPS=100 OPENBLAS_NUM_THREADS=1 ./sswap.goto 20000000 20000000'
From : 20000000  To : 20000000 Step =   1 Inc_x = 1 Inc_y = 1 Loops = 100
   SIZE       Flops
 20000000 :     2568.05 MBytes
61.60user 0.04system 1:01.74elapsed 99%CPU (0avgtext+0avgdata 159052maxresident)k
0inputs+0outputs (0major+1520minor)pagefaults 0swaps
> /usr/bin/time bash -c 'OPENBLAS_LOOPS=100 OPENBLAS_NUM_THREADS=2 ./sswap.goto 20000000 20000000'
From : 20000000  To : 20000000 Step =   1 Inc_x = 1 Inc_y = 1 Loops = 100
   SIZE       Flops
 20000000 :     2998.16 MBytes
155.00user 0.04system *2:21.85elapsed* 109%CPU (0avgtext+0avgdata 159064maxresident)k
0inputs+0outputs (0major+1524minor)pagefaults 0swaps

@brada4
Copy link
Contributor

brada4 commented Jan 16, 2016

Total benchmark time includes (mostly) rand() on main thread
Dropping measurement loop start a bit down in benchmarks/swap.c I can measure small (20%) speedup 2 threads vs 1

@brada4
Copy link
Contributor

brada4 commented Jan 17, 2016

Rigged benchmark to just test 1<<x cases, basically showing that 2 CPUs have huge to mild disadvantage until all caches are pouring over (AMD A10-5750M), and then it gives some marginal benefit after (after WHAT is hard to estimate). Incx incy, or 3rd CPU does not change result

@jeromerobert can you measure with yours?

bench_patch.txt
bench_csv.txt

@brada4
Copy link
Contributor

brada4 commented Jan 17, 2016

Some haswell has 128MB L4 cache I think it is safe to prime up 2nd thread right after that.

jeromerobert added a commit to jeromerobert/OpenBLAS that referenced this issue Jan 19, 2016
@brada4
Copy link
Contributor

brada4 commented Jan 19, 2016

You are flattering me with avoiding thrashing my 2MB cache. (Warning: incomplete there is no ASM for L3 cache CPUID, does not respect shared caches)

#ifdef SMP
  nthreads = 2;

  //disable multi-thread when incx==0 or incy==0
  //In that case, the threads would be dependent.
  //prevent cache thrashing with multiple threads
  if (incx == 0 || incy == 0 || 2*n*sizeof(FLOAT)<=MAX(2*1024*1024,get_l2_size()/*+L3*/*num_cpu_avail(1)))
          nthreads = 1;

  if (nthreads == 1) {
#endif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants