Pathological threading behavior in OpenBLAS -> makes my test suite 4x slower than if using reference BLAS #731

njsmith · 2016-01-05T13:50:54Z

I maintain a python library called "patsy", which runs tests on Travis-CI, and I was trying to figure out why some of my test runs were failing due to timeouts.

I eventually tracked it down to, it turns out that some of the test configurations I was using were linked to OpenBLAS, and some were not. The OpenBLAS runs were taking >4x longer to complete. This is true even though the test suite actually spends most of its time doing other things besides linear algebra...

It turns out that the problem is some truly pathological behavior in OpenBLAS's threading code. Here's a simple example of two identical runs of the test suite, one with OpenBLAS in its default mode, and one with OMP_NUM_THREADS=1: https://travis-ci.org/pydata/patsy/builds/100343158

The biggest culprit appears to be the fact that one of the tests does a few tens of thousands of SVDs on small matrices. Here's a simple test case:

import numpy as np

# Generate a 20x2 matrix
r = np.random.RandomState(0)
m = r.randn(20, 2)

# Calculate the svd
for i in xrange(100000):
    np.linalg.svd(m)

Running time python svd-demo.py on my laptop in different configurations:

# Reference BLAS:
1.98s user 0.26s system 113% cpu 1.973 total
# OpenBLAS with OMP_NUM_THREADS=1
2.06s user 0.01s system 99% cpu 2.070 total
# OpenBLAS with no tweaks
5.80s user 10.87s system 379% cpu 4.392 total

So having access to multiple threads makes things take more than 2x longer by wall clock time, and 8x more CPU time.

Here's the top few lines from a perf report, showing that OpenBLAS is spending the vast majority of that time simply churning through the kernel scheduler, accomplishing nothing useful:

Overhead  Command  Shared Object                    Symbol                     ◆
   9.87%  python   [kernel.kallsyms]                [k] __schedule             ▒
   7.22%  python   [kernel.kallsyms]                [k] entry_SYSCALL_64       ▒
   7.21%  python   [kernel.kallsyms]                [k] pick_next_task_fair    ▒
   7.06%  python   libc-2.21.so                     [.] __sched_yield          ▒
   5.55%  python   [kernel.kallsyms]                [k] entry_SYSCALL_64_after_▒
   4.77%  python   libopenblas_nehalemp-r0.2.14.so  [.] blas_thread_server     ▒
   4.70%  python   [kernel.kallsyms]                [k] native_sched_clock     ▒
   4.56%  python   [kernel.kallsyms]                [k] cpuacct_charge         ▒
   4.39%  python   [kernel.kallsyms]                [k] _raw_spin_lock_irq     ▒
   4.18%  python   [kernel.kallsyms]                [k] _raw_spin_lock         ▒
   3.56%  python   [kernel.kallsyms]                [k] update_curr            ▒
   2.46%  python   [kernel.kallsyms]                [k] system_call_fast_compar▒
   2.33%  python   [kernel.kallsyms]                [k] pick_next_entity       ▒

The text was updated successfully, but these errors were encountered:

jeromerobert · 2016-01-05T21:34:26Z

It looks like #79. There is at least a problem in interface/ger.c. It call GER_THREAD whatever the size of the matrix. We need something like f8f2e26 or b985cea.

njsmith · 2016-01-06T01:36:09Z

Yeah, simply doing small matrix computations in the main thread would certainly be a big improvement in this case.

Beyond that, Julian Taylor (who helped me track this down, in IME is usually pretty on point about high-performance code) had some further thoughts on how to possibly address the more fundamental problem:

<jtaylor> probably will look something like this: https://twitter.com/jtaylor108/status/556833272190468096
[...]
<jtaylor> yeah thats openblas
<jtaylor> being stupid
<njs> that is extraordinary, this isn't even linear algebra code
<jtaylor> sched_yield is an aweful idea
<jtaylor> it did what you though 15 years ago
<jtaylor> on linux is dumps your process at the end of the queue rescheduling everything
<jtaylor> massive overhead
<njs> I'm sorta curious what would happen if one used LD_PRELOAD to stub-out sched_yield and ran the same benchmark
<jtaylor> it would probably not be much better
<jtaylor> as it would instead busy loop
<jtaylor> probably a little better, but it should just use a standard spinlock

brada4 · 2016-01-07T00:38:11Z

It is fairly obvious from your profiler output that eliminating system calls (i.e OMP_NUM_THREADS=1_ will speed up your computation 20x, while speeding up OpenBLAS 20x will sped it up 2%

njsmith · 2016-01-07T01:35:31Z

The system calls are OpenBLAS busy-waiting while making calls to sched_yield.

njsmith · 2016-01-07T01:52:09Z

Sorry, maybe that was too terse to be clear... I think there are two possible improvements here, that are probably both independently valuable:

OpenBLAS shouldn't use multiple threads at all for small matrices -- even if your threading is super-efficient, it still has some overhead, so it's only worthwhile if the matrix is large enough to amortize this overhead.
OpenBLAS probably shouldn't use sched_yield + busywaiting as a synchronization primitive -- it may beat real synchronization primitives in some specialized cases, but this is going to be fragile and highly dependent on all kinds of system-specific details (kernel version, system load, hardware, exact matrix sizes, etc. etc.). And when it goes wrong it goes really wrong. If OpenBLAS had been using real synchronization primitives, then the CPU cost penalty for using multiple threads instead of a single thread would have probably been a few percent, not a factor of 8. (see also)

brada4 · 2016-01-07T12:44:46Z

In your case with only small matrices involved a single-thread version should be optimal.
Above that it is wild guessing if it is page per CPU or few or hugepage or few or numa stride, or gigabyte.
Let me invent micrometer to measure

jeromerobert · 2016-01-15T17:17:16Z

I fixed interface/ger.c in my fork and it makes things much better. But it's still not good as interface/swap.c have the same curse.

brada4 · 2016-01-15T19:48:05Z

10s for 400000 pthread_create()-s is fast.
pthread_create wraps system call and you see that whole logistics around it take 98% of execution time.
creating a thread (or even worse more of them) is definitely heavier than churning 88 bytes on a SSE register file that curiously is only 256 bytes large.
you can always call openblas_set_num_threads(1) before operations on very small matrices

njsmith · 2016-01-15T20:12:42Z

@brada4: no idea what you're talking about with pthread_create. And calling openblas_set_numthreads before operations on small matrices is absurd. Notice that this problem was discovered after a code distributor switched out their blas implementation 2 layers underneath me -- I didn't even know I was using openblas.

(After a bit of checking, it seems you don't actually contribute anything to openblas besides posting lots and lots of unhelpful comments on issues? I guess I will ignore you from now on then.)

brada4 · 2016-01-16T07:30:15Z

Why dont you use single-threaded OpenBLAS? Creating threads are much heavier than processing needed for your LAPACK call.
Is it OK for you to apply OPENBLAS_NUM_THREADS permanently or rebuild a single-threaded OpenBLAS?
btw numpypy takes about .3s on your test sample with 1-thread OpenBLAS.

OpenBLAS threading must be avoided in such cases by reasonably rising threading cutoff much like #103
Actually the threading is not applied at LAPACK frontend (s)gesdd but in one or more later calls to BLAS and it is a lot of them to say softly.
For next hours I will be looking into which of them misses (creates threads no matter input size) in your case.

brada4 · 2016-01-16T15:12:03Z

Iinterface/swap.c
does not choose UP swap code if matrix size is small (called with incx=incy=2 in this case)
and affects test case.

Other not in binary, unlikely with casual compile
in interface/ger.c hidden under #ifdef SMPTEST

brada4 · 2016-01-16T16:41:03Z

Try this (it will turn small *swap routines single-threaded)

interface/swap.c

//disable multi-thread when incx==0 or incy==0
//In that case, the threads would be dependent.
-- if (incx == 0 || incy == 0)
++ if (incx == 0 || incy == 0 || n<4096)
nthreads = 1;

if (nthreads == 1) {

jeromerobert · 2016-01-16T16:57:47Z

We need a consensus on #742 first.

@brada4 how did you chose the threshold (i.e. 4096) ?

jeromerobert · 2016-01-16T17:19:58Z

For dswap I cannot find any value of n where multi-threading is useful. From OpenBLAS benchmark directory:

$ time bash -c 'OPENBLAS_NUM_THREADS=1 OPENBLAS_LOOPS=10 ./dswap.goto 1000000 10000000 1000000'
From : 1000000  To : 10000000 Step = 1000000 Inc_x = 1 Inc_y = 1 Loops = 10
   SIZE       Flops
 1000000 :     5280.53 MBytes
 2000000 :     3808.07 MBytes
 3000000 :     3807.05 MBytes
 4000000 :     3780.67 MBytes
 5000000 :     3766.09 MBytes
 6000000 :     3749.88 MBytes
 7000000 :     3709.89 MBytes
 8000000 :     3743.32 MBytes
 9000000 :     3736.98 MBytes
 10000000 :     3718.41 MBytes

real    0m14.091s
user    0m14.020s
sys     0m0.056s

$ time bash -c 'OPENBLAS_NUM_THREADS=2 OPENBLAS_LOOPS=10 ./dswap.goto 1000000 10000000 1000000'
From : 1000000  To : 10000000 Step = 1000000 Inc_x = 1 Inc_y = 1 Loops = 10
   SIZE       Flops
 1000000 :     7075.26 MBytes
 2000000 :     6552.81 MBytes
 3000000 :     5028.81 MBytes
 4000000 :     4628.56 MBytes
 5000000 :     4388.18 MBytes
 6000000 :     4246.62 MBytes
 7000000 :     4106.84 MBytes
 8000000 :     4096.21 MBytes
 9000000 :     4060.36 MBytes
 10000000 :     4085.70 MBytes

real    0m26.831s
user    0m29.196s
sys     0m9.528s

So it looks like we should just remove multi-threading in swap.c. Yet it would be good to know why it was implemented. Can someone else reproduce the same behavior ?

brada4 · 2016-01-16T17:52:51Z

Your lapack under your code calls sswap looots of times with n=2, i just picked some random value

martin-frbg · 2016-01-16T18:36:03Z

Cannot comment on the "why", just on "when" - I dug up a copy of the original GotoBLAS sources from 2006 and it already had the multithreading in level1/swap/swap.c (even the "if incx==0..." mentioned above was added only in #6)
Perhaps it made more sense back then, or it just seemed like a good idea at the time (when dualcore processors just hit the market, so SMP mostly meant relatively rare multisocket systems)

brada4 · 2016-01-16T18:53:07Z

Basically same effect. What about #ifdef SMPTEST? It can be even reduced to swapping pointers...
4-core piledriver (total of 8MB in caches, making sample 10x that to strip their effect)

> /usr/bin/time bash -c 'OPENBLAS_LOOPS=100 OPENBLAS_NUM_THREADS=1 ./sswap.goto 20000000 20000000'
From : 20000000  To : 20000000 Step =   1 Inc_x = 1 Inc_y = 1 Loops = 100
   SIZE       Flops
 20000000 :     2568.05 MBytes
61.60user 0.04system 1:01.74elapsed 99%CPU (0avgtext+0avgdata 159052maxresident)k
0inputs+0outputs (0major+1520minor)pagefaults 0swaps
> /usr/bin/time bash -c 'OPENBLAS_LOOPS=100 OPENBLAS_NUM_THREADS=2 ./sswap.goto 20000000 20000000'
From : 20000000  To : 20000000 Step =   1 Inc_x = 1 Inc_y = 1 Loops = 100
   SIZE       Flops
 20000000 :     2998.16 MBytes
155.00user 0.04system *2:21.85elapsed* 109%CPU (0avgtext+0avgdata 159064maxresident)k
0inputs+0outputs (0major+1524minor)pagefaults 0swaps

brada4 · 2016-01-16T20:55:55Z

Total benchmark time includes (mostly) rand() on main thread
Dropping measurement loop start a bit down in benchmarks/swap.c I can measure small (20%) speedup 2 threads vs 1

brada4 · 2016-01-17T10:11:19Z

Rigged benchmark to just test 1<<x cases, basically showing that 2 CPUs have huge to mild disadvantage until all caches are pouring over (AMD A10-5750M), and then it gives some marginal benefit after (after WHAT is hard to estimate). Incx incy, or 3rd CPU does not change result

@jeromerobert can you measure with yours?

bench_patch.txt
bench_csv.txt

brada4 · 2016-01-17T13:12:38Z

Some haswell has 128MB L4 cache I think it is safe to prime up 2nd thread right after that.

Ref OpenMathLib#731

brada4 · 2016-01-19T21:45:47Z

You are flattering me with avoiding thrashing my 2MB cache. (Warning: incomplete there is no ASM for L3 cache CPUID, does not respect shared caches)

#ifdef SMP
  nthreads = 2;

  //disable multi-thread when incx==0 or incy==0
  //In that case, the threads would be dependent.
  //prevent cache thrashing with multiple threads
  if (incx == 0 || incy == 0 || 2*n*sizeof(FLOAT)<=MAX(2*1024*1024,get_l2_size()/*+L3*/*num_cpu_avail(1)))
          nthreads = 1;

  if (nthreads == 1) {
#endif

njsmith mentioned this issue Jan 8, 2016

Suprisingly slow dsyrk on Mac #730

Closed

jeromerobert added a commit to jeromerobert/OpenBLAS that referenced this issue Jan 19, 2016

Disable multi-threading for small matrices in [z]ger

3ae30cd

Ref OpenMathLib#731

xianyi closed this as completed in 66eafb1 Jan 20, 2016

jakirkham mentioned this issue Apr 21, 2016

Add OpenBLAS conda-forge/staged-recipes#364

Merged

jeromerobert mentioned this issue May 13, 2016

multithreaded eigen decomposition slower than single-threaded version, with NumPy #873

Closed

martin-frbg mentioned this issue May 19, 2016

Deadlock in saxpy_ #883

Closed

martin-frbg mentioned this issue Jul 14, 2016

caffe + openblas not using cpu to the max #923

Closed

martin-frbg mentioned this issue Jan 21, 2017

Slower performance of OpenBLAS for inverse operation on ARMv7 #1045

Closed

martin-frbg mentioned this issue Jun 21, 2018

Threading threshold tuning needed for sgemm/dgemm #1622

Closed

This was referenced Jul 12, 2018

MRG: Add pip requirements.txt and Travis runs mne-tools/mne-python#5251

Merged

ENH: Use one thread in OpenBLAS scipy/scipy#9056

Merged

SebWouters mentioned this issue Dec 7, 2018

Tests failing with openblas SebWouters/CheMPS2#70

Closed

ghop02 mentioned this issue Nov 1, 2020

Set openblas num threads to 1 to improve multiprocessing performance act-now-coalition/covid-data-model#776

Merged

bertsky mentioned this issue Feb 12, 2021

ocrd_tesserocr processors waste CPU performance because of numpy blas threads OCR-D/ocrd_tesserocr#157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pathological threading behavior in OpenBLAS -> makes my test suite 4x slower than if using reference BLAS #731

Pathological threading behavior in OpenBLAS -> makes my test suite 4x slower than if using reference BLAS #731

njsmith commented Jan 5, 2016

jeromerobert commented Jan 5, 2016

njsmith commented Jan 6, 2016

brada4 commented Jan 7, 2016

njsmith commented Jan 7, 2016

njsmith commented Jan 7, 2016

brada4 commented Jan 7, 2016

jeromerobert commented Jan 15, 2016

brada4 commented Jan 15, 2016

njsmith commented Jan 15, 2016

brada4 commented Jan 16, 2016

brada4 commented Jan 16, 2016

brada4 commented Jan 16, 2016

jeromerobert commented Jan 16, 2016

jeromerobert commented Jan 16, 2016

brada4 commented Jan 16, 2016

martin-frbg commented Jan 16, 2016

brada4 commented Jan 16, 2016

brada4 commented Jan 16, 2016

brada4 commented Jan 17, 2016

brada4 commented Jan 17, 2016

brada4 commented Jan 19, 2016

Pathological threading behavior in OpenBLAS -> makes my test suite 4x slower than if using reference BLAS #731

Pathological threading behavior in OpenBLAS -> makes my test suite 4x slower than if using reference BLAS #731

Comments

njsmith commented Jan 5, 2016

jeromerobert commented Jan 5, 2016

njsmith commented Jan 6, 2016

brada4 commented Jan 7, 2016

njsmith commented Jan 7, 2016

njsmith commented Jan 7, 2016

brada4 commented Jan 7, 2016

jeromerobert commented Jan 15, 2016

brada4 commented Jan 15, 2016

njsmith commented Jan 15, 2016

brada4 commented Jan 16, 2016

brada4 commented Jan 16, 2016

brada4 commented Jan 16, 2016

jeromerobert commented Jan 16, 2016

jeromerobert commented Jan 16, 2016

brada4 commented Jan 16, 2016

martin-frbg commented Jan 16, 2016

brada4 commented Jan 16, 2016

brada4 commented Jan 16, 2016

brada4 commented Jan 17, 2016

brada4 commented Jan 17, 2016

brada4 commented Jan 19, 2016