-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMP nested parallelism bug in BLIS #267
Comments
@fgvanzee So far I've been able to trace this as far as |
Interestingly the program runs just fine in valgrind, which is quite annoying. |
@devinamatthews Doesn't Valgrind pad allocations to detect overruns and thus may inadvertently fix the bug if the buffer overrun is by a small amount? I've found that Valgrind missed off-by-1 errors before. |
NVM, I recompiled and that went away and I got the behavior described by @jkd2016. Setting |
I'm not looking at the spec right now, but I'm not sure |
Wouldn't it have to return |
Or, we could test |
Detect when OpenMP uses fewer threads than requested and correct accordingly, so that we don't wait forever for nonexistent threads. Fixes #267.
@jkd2016 will you try the |
Well, it doesn't quite work yet: I forgot that the number of threads is also saved in the |
It should work now. |
@devinamatthews I'm having a hard time understanding your patch. Would you mind walking me through what it does? |
Basically it checks if the actual number of threads spawned is equal to the number of threads requested. If not, then it reinitialized the number of threads (expressed by and in the |
The problem was that BLIS expected to get N threads, but the omp parallel region only spawned one (because nesting was disabled). Then, any barrier will get stuck while that one thread waits on the N-1 other nonexistent ones. |
I'm still not quite following. Is nesting disabled by default? Let's assume
resulted in two threads, each calling |
Second question: yes. @jkd2016 in the context of the full Elk program, I am guessing that you have OpenMP nesting explicitly enabled, right? If so, then there may be additional problems as evidenced by the segfault. If you could isolate then then I'll take a whack at it. |
I've been testing for the last hour or so with different thread number combinations and all is well: no segfaults at all. |
@jkd2016 excellent. Let us know when you are confident enough that it is fixed and I'll merge. |
Hold off on merging. I'm going to add comments. :) |
Okay, comments added in 3612eca. |
I ran a few additional tests and found that with
the code runs extremely slowly, most likely because it's spawning 32*32 threads. With
It's much faster but not as fast as OpenBLAS, which seems to spawn fewer threads. It's getting late here, but I'll pick this up tomorrow using Elk's omp_hold and omp_free commands which strictly limit the total number of threads to the number of cores. Night-night. |
Finished testing. Everything works as expected. No segfaults, no threading over- or under subscriptions, no hang-ups and bli_thread_set_num_threads can now be called from Fortran. Here is a simple Fortran program illustrating how nested threading is handled in Elk (the required module is attached).
When omp_hold is called Elk checks how many threads are idle and assigns the number of threads accordingly. This way the number of threads never larger than the number of processors. Oddly enough, it's now OpenBLAS that's not working with this example: it seems to ignore calls to 'openblas_set_num_threads'. I'll go and bother them next. Incidentally, both gfortran and ifort report the number of number of processors equal to twice the number of physical cores because of hyperthreading. This always makes our numerical codes run more slowly than using the number of physical cores as the maximum number of threads. So BLIS is now a part of Elk. The next release should be in a few months which gives me some time for testing and maybe we can try and get some of the BLIS extensions to work as well (eg. bli_zger). |
Have you tried setting the thread-to-core affinity via
Thanks for giving us another success story to tell! :) Please keep in touch with us going forward. |
Incidentally, both gfortran and ifort report the number of number of
processors equal to twice the number of physical cores because of
hyperthreading. This always makes our numerical codes run more slowly than
using the number of physical cores as the maximum number of threads.
Any OpenMP implementation should return the number of hardware threads rather than the number of physical cores because it is impossible to portably determine what is optimal and it is often workload dependent. On architectures like Blue Gene/Q and Knights Corner, one has to use more than one thread per core to saturate instruction throughput.
Have you tried setting the thread-to-core affinity via GOMP_CPU_AFFINITY?
Granted, this is a GNU-specific OpenMP feature, if I'm not mistaken. But,
in my experience, when you find the right mapping it seems to work well. In
your case, you would want a mapping that instantiates and binds each new
thread to a "hypercore" such that only one hypercore per physical core is
mapped.
OMP_PROC_BIND=SPREAD and OMP_PLACES=THREADS should work.
With KMP (Intel/LLVM), use KMP_AFFINITY=compact,granularity=fine and KMP_HW_SUBSET=1s,12c,1t for a 12-core CPU.
|
Here is a simple parallel code which is typical of that found in Elk:
The code was compiled with Intel Fortran version 18.0.0 using
and run on a 24 core Xeon E5-2680. Using
I get
With
the times are
Setting
gives
Finally, the Intel-only options
give the worst times of all:
Incidently, setting
yields the warning:
For this example and comparing the overall run-times for Elk we concluded that hyperthreading gave no advantage and even degraded the performance slightly. Thus Elk uses the number of CPUs reported by the OS and not the Fortran compiler. |
By Xeon E5-2680 , I assume you mean Xeon E5-2680v3, which is a 12-core processor. You need to use The timing you report for In |
@jkd2016 In any case, please do not time allocate and deallocate in OpenMP parallel loops. I really hope Elk does not actually do this, because it is really not efficient, particularly with Intel Fortran, because |
@jkd2016 Also, in my experience, a threaded loop over sequential GEMM is slower than a sequential loop over threaded GEMM for relatively large matrices (e.g. dim=2000). |
Utterly impossible. The code is extremely complicated. If you want a taste, look at the parallel loop in the routine 'gwsefm' and then dive into 'gwsefmk'. Try doing all that without allocating new arrays. Since you work for Intel, here's a request: make allocating arrays in parallel loops as efficient as possible. |
The opposite is true for the code above: With
the best time is
Using
the best time is
|
Fair enough, but what environment variables should be set to beat the 21 second wall-clock time obtained with the vanilla options
|
@devinamatthews @jkd2016 Are you both good with the changes in the |
It is certainly not impossible. I have to perform such transformations on NWChem regularly, e.g. https://github.com/nwchemgit/nwchem/blob/master/src/ccsd/ccsd_itm_omp.F#L90, not just starting from Fortran allocatable, but as part of replacing the thread-unsafe stack allocator NWChem used by default (MA) with Fortran heap management while adding all of the OpenMP business along the way. Please rename 0001-hoist-allocate-outside-of-parallel-region.txt to 0001-hoist-allocate-outside-of-parallel-region.patch and
I made this request years ago. Unfortunately, lock-free memory allocators that also perform well outside of a multi-threaded context are not easy to implement. As for 1x24 vs 24x1, I would not try to run MKL across two sockets. What do you see for 1x12 vs 12x1 running within a single NUMA node? |
I forgot the second use of |
Fine by me. If any further issues crop up, I'll report them immediately. |
Thanks Jeff -- I really appreciate you looking into this. The main problem is not with the uppermost loop in gensefm, it's the subroutines called in gensefmk. These are complicated, have further nested loops and call additional subroutines which require their own allocatable arrays. I've tried where possible to use automatic arrays (when they are not too large), but it's impossible to pre-allocate all the required arrays in 'gensefm' and pass them down subroutine tree to 'gensefmk' and beyond. (By 'impossible' I mean 'unfeasibly difficult') Nevertheless, I've simplified your patch down to this:
The timings in seconds are:
In the grand scheme of things, this is an acceptable overhead. (FYI: gensefmk is probably the most complicated routine in Elk and calculates a type of Feynman diagram. The subroutine that usually calls it (gwbandstr) typically runs for around 1000 CPU days) |
There appears to be an threading issue with BLIS compiled with OpenMP and run inside a parallel nested loop.
It crashes the Elk code with a segmentation fault.
I can't reproduce the seg. fault with a small example, but the following program never terminates on our AMD machine:
The above code has to be compiled with
Both the number of OpenMP and BLIS threads have to be larger than 1:
If either OMP_NUM_THREADS=1 or BLIS_NUM_THREADS=1 then the code runs fine.
If BLIS is compiled with pthreads:
then the code also runs fine.
The official AMD release (https://developer.amd.com/amd-cpu-libraries/blas-library/) was compiled with OpenMP and therefore also has the bug.
The text was updated successfully, but these errors were encountered: