Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMP nested parallelism bug in BLIS #267

Closed
jkd2016 opened this issue Oct 11, 2018 · 35 comments
Closed

OpenMP nested parallelism bug in BLIS #267

jkd2016 opened this issue Oct 11, 2018 · 35 comments

Comments

@jkd2016
Copy link

jkd2016 commented Oct 11, 2018

There appears to be an threading issue with BLIS compiled with OpenMP and run inside a parallel nested loop.

It crashes the Elk code with a segmentation fault.

I can't reproduce the seg. fault with a small example, but the following program never terminates on our AMD machine:

program test
implicit none
integer, parameter :: n=20
integer i
real(8), allocatable :: a(:,:),b(:,:),c(:,:)

!$OMP PARALLEL DEFAULT(SHARED) &
!$OMP PRIVATE(a,b,c)
!$OMP DO
do i=1,100
  allocate(a(n,n),b(n,n),c(n,n))
  a=1.d0
  b=1.d0
  call dgemm('N','N',n,n,n,1.d0,a,n,b,n,0.d0,c,n)
  deallocate(a,b,c)
end do
!$OMP END DO
!$OMP END PARALLEL

end program

The above code has to be compiled with

gfortran -fopenmp test.f90 lapack.a libblis.a

Both the number of OpenMP and BLIS threads have to be larger than 1:

export OMP_NUM_THREADS=2
export BLIS_NUM_THREADS=2

If either OMP_NUM_THREADS=1 or BLIS_NUM_THREADS=1 then the code runs fine.

If BLIS is compiled with pthreads:

./configure -t pthreads zen

then the code also runs fine.

The official AMD release (https://developer.amd.com/amd-cpu-libraries/blas-library/) was compiled with OpenMP and therefore also has the bug.

@devinamatthews
Copy link
Member

@fgvanzee So far I've been able to trace this as far as frame/3/bli_l3_packm.c where cntl_mem_p->pblk.buf_sys = 0x1, so almost definitely a memory pool problem. I'll keep digging.

@devinamatthews
Copy link
Member

Interestingly the program runs just fine in valgrind, which is quite annoying.

@jeffhammond
Copy link
Member

@devinamatthews Doesn't Valgrind pad allocations to detect overruns and thus may inadvertently fix the bug if the buffer overrun is by a small amount? I've found that Valgrind missed off-by-1 errors before.

@devinamatthews
Copy link
Member

NVM, I recompiled and that went away and I got the behavior described by @jkd2016. Setting OMP_NESTED=TRUE makes it go away, so I think the fix is to use omp_get_max_threads() to limit how many threads are used (expected) in the OpenMP thread decorator.

@jeffhammond
Copy link
Member

I'm not looking at the spec right now, but I'm not sure omp_get_max_threads() solves a problem caused by nested parallel regions. This function only tells you the maximum number of threads in a given parallel region GCC docs.

@devinamatthews
Copy link
Member

Wouldn't it have to return 1 if nesting (at this level) is disabled, though? I get there is always omp_get_nested() but some implementations have a max nesting level/max number of total threads etc.

@devinamatthews
Copy link
Member

Or, we could test omp_num_threads() after the fact and adjust accordingly.

devinamatthews added a commit that referenced this issue Oct 11, 2018
Detect when OpenMP uses fewer threads than requested and correct accordingly, so that we don't wait forever for nonexistent threads. Fixes #267.
@devinamatthews
Copy link
Member

@jkd2016 will you try the nested-omp-patch branch and verify that it fixes your problem?

@devinamatthews
Copy link
Member

Well, it doesn't quite work yet: I forgot that the number of threads is also saved in the rntm_t struct. But, the threads are partitioned amongst the various loops there so not easy to just change willy nilly. I will make the patch only work for getting a number of threads exactly equal to what we wanted, or one.

@devinamatthews
Copy link
Member

It should work now.

@fgvanzee
Copy link
Member

@devinamatthews I'm having a hard time understanding your patch. Would you mind walking me through what it does?

@devinamatthews
Copy link
Member

Basically it checks if the actual number of threads spawned is equal to the number of threads requested. If not, then it reinitialized the number of threads (expressed by and in the n_threads, g_comm, and rntm variables). I specifically check if the new number of threads is one because if it is not, then it would be somewhat more complicated to reinitialize the rntm srtructure properly.

@devinamatthews
Copy link
Member

The problem was that BLIS expected to get N threads, but the omp parallel region only spawned one (because nesting was disabled). Then, any barrier will get stuck while that one thread waits on the N-1 other nonexistent ones.

@fgvanzee
Copy link
Member

The problem was that BLIS expected to get N threads, but the omp parallel region only spawned one (because nesting was disabled). Then, any barrier will get stuck while that one thread waits on the N-1 other nonexistent ones.

I'm still not quite following. Is nesting disabled by default?

Let's assume OMP_NUM_THREADS = 2 and BLIS_NUM_THREADS = 2. You're saying that the following parallel region

!$OMP PARALLEL DEFAULT(SHARED) &
!$OMP PRIVATE(a,b,c)
!$OMP DO
do i=1,100
  allocate(a(n,n),b(n,n),c(n,n))
  a=1.d0
  b=1.d0
  call dgemm('N','N',n,n,n,1.d0,a,n,b,n,0.d0,c,n)
  deallocate(a,b,c)
end do
!$OMP END DO
!$OMP END PARALLEL

resulted in two threads, each calling dgemm(). Each call to dgemm() attempted to spawn one additional thread (for a total of two per dgemm()), but since nesting was disabled and OpenMP saw that two threads were already created at the application level, it didn't allow any additional threads within BLIS? (And then the issue with barriers cropped up, as you described.)

@devinamatthews
Copy link
Member

Second question: yes.
First question: it depends on the implementation.

@jkd2016 in the context of the full Elk program, I am guessing that you have OpenMP nesting explicitly enabled, right? If so, then there may be additional problems as evidenced by the segfault. If you could isolate then then I'll take a whack at it.

@jkd2016
Copy link
Author

jkd2016 commented Oct 11, 2018

Second question: yes.
First question: it depends on the implementation.

@jkd2016 in the context of the full Elk program, I am guessing that you have OpenMP nesting explicitly enabled, right? If so, then there may be additional problems as evidenced by the segfault. If you could isolate then then I'll take a whack at it.

I've been testing for the last hour or so with different thread number combinations and all is well: no segfaults at all.

@devinamatthews
Copy link
Member

@jkd2016 excellent. Let us know when you are confident enough that it is fixed and I'll merge.

@fgvanzee
Copy link
Member

@jkd2016 excellent. Let us know when you are confident enough that it is fixed and I'll merge.

Hold off on merging. I'm going to add comments. :)

@fgvanzee
Copy link
Member

Okay, comments added in 3612eca.

@jkd2016
Copy link
Author

jkd2016 commented Oct 11, 2018

I ran a few additional tests and found that with

export OMP_NUM_THREADS=32
export BLIS_NUM_THREADS=32
export OMP_DYNAMIC=false

the code runs extremely slowly, most likely because it's spawning 32*32 threads.

With

export OMP_DYNAMIC=true

It's much faster but not as fast as OpenBLAS, which seems to spawn fewer threads.

It's getting late here, but I'll pick this up tomorrow using Elk's omp_hold and omp_free commands which strictly limit the total number of threads to the number of cores.

Night-night.

@jkd2016
Copy link
Author

jkd2016 commented Oct 12, 2018

Finished testing. Everything works as expected. No segfaults, no threading over- or under subscriptions, no hang-ups and bli_thread_set_num_threads can now be called from Fortran.

Here is a simple Fortran program illustrating how nested threading is handled in Elk (the required module is attached).

program test
use modomp
implicit none
integer, parameter :: n=2000,nloop=10
integer i,nthd1,nthd2
real(8), allocatable :: a(:,:),b(:,:),c(:,:)

maxthd=0

call omp_init

print *,'maxthd',maxthd

call omp_hold(nloop,nthd1)
print *,'nthd1',nthd1
!$OMP PARALLEL DEFAULT(SHARED) &
!$OMP PRIVATE(a,b,c,nthd2) &
!$OMP NUM_THREADS(nthd1)
!$OMP DO
do i=1,nloop
  allocate(a(n,n),b(n,n),c(n,n))
  a=1.d0
  b=1.d0
  call omp_hold(maxthd,nthd2)
  print *,'nthd2',nthd2
  call bli_thread_set_num_threads(nthd2)
  call dgemm('N','N',n,n,n,1.d0,a,n,b,n,0.d0,c,n)
  call omp_free(nthd2)
  call bli_thread_set_num_threads(1)
  deallocate(a,b,c)
end do
!$OMP END DO
!$OMP END PARALLEL
call omp_free(nthd1)

end program

When omp_hold is called Elk checks how many threads are idle and assigns the number of threads accordingly. This way the number of threads never larger than the number of processors.

Oddly enough, it's now OpenBLAS that's not working with this example: it seems to ignore calls to 'openblas_set_num_threads'. I'll go and bother them next.

Incidentally, both gfortran and ifort report the number of number of processors equal to twice the number of physical cores because of hyperthreading. This always makes our numerical codes run more slowly than using the number of physical cores as the maximum number of threads.

So BLIS is now a part of Elk. The next release should be in a few months which gives me some time for testing and maybe we can try and get some of the BLIS extensions to work as well (eg. bli_zger).

modomp.zip

@fgvanzee
Copy link
Member

Incidentally, both gfortran and ifort report the number of number of processors equal to twice the number of physical cores because of hyperthreading. This always makes our numerical codes run more slowly than using the number of physical cores as the maximum number of threads.

Have you tried setting the thread-to-core affinity via GOMP_CPU_AFFINITY? Granted, this is a GNU-specific OpenMP feature, if I'm not mistaken. But, in my experience, when you find the right mapping it seems to work well. In your case, you would want a mapping that instantiates and binds each new thread to a "hypercore" such that only one hypercore per physical core is mapped.

So BLIS is now a part of Elk. The next release should be in a few months which gives me some time for testing and maybe we can try and get some of the BLIS extensions to work as well (eg. bli_zger).

Thanks for giving us another success story to tell! :) Please keep in touch with us going forward.

@jeffhammond
Copy link
Member

jeffhammond commented Oct 13, 2018 via email

@jkd2016
Copy link
Author

jkd2016 commented Oct 14, 2018

Here is a simple parallel code which is typical of that found in Elk:

program test
implicit none
integer, parameter :: n=2000,nloop=1000
integer i
real(8), allocatable :: a(:,:),b(:,:),c(:,:)
!$OMP PARALLEL DEFAULT(SHARED) &
!$OMP PRIVATE(a,b,c)
!$OMP DO
do i=1,nloop
  allocate(a(n,n),b(n,n),c(n,n))
  a=1.d0
  b=1.d0
  call dgemm('N','N',n,n,n,1.d0,a,n,b,n,0.d0,c,n)
  deallocate(a,b,c)
end do
!$OMP END DO
!$OMP END PARALLEL
end program

The code was compiled with Intel Fortran version 18.0.0 using

ifort -O3 -ip -axCORE-AVX2,AVX,SSE4.2 -qopenmp test.f90 -L/cluster/intel_2018/mkl/lib/intel64/ -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread

and run on a 24 core Xeon E5-2680.

Using

export OMP_NUM_THREADS=24
export MKL_NUM_THREADS=1

I get

real	0m21.051s
user	8m0.550s
sys	0m2.342s

With

export OMP_NUM_THREADS=48

the times are

real	0m21.470s
user	16m41.245s
sys	0m3.736s

Setting

export OMP_PROC_BIND=SPREAD
export OMP_PLACES=THREADS

gives

real	0m21.712s
user	16m57.692s
sys	0m3.681s

Finally, the Intel-only options

export KMP_AFFINITY=compact,granularity=fine
export KMP_HW_SUBSET=1s,12c,1t

give the worst times of all:

real	0m39.500s
user	7m46.666s
sys	0m3.486s

Incidently, setting

export KMP_HW_SUBSET=1s,24c,1t

yields the warning:

OMP: Warning #242: KMP_HW_SUBSET ignored: too many cores requested.

For this example and comparing the overall run-times for Elk we concluded that hyperthreading gave no advantage and even degraded the performance slightly.

Thus Elk uses the number of CPUs reported by the OS and not the Fortran compiler.

@jeffhammond
Copy link
Member

jeffhammond commented Oct 15, 2018

By Xeon E5-2680 , I assume you mean Xeon E5-2680v3, which is a 12-core processor. You need to use KMP_HW_SUBSET=2s,12c,1t to run on all the cores, or KMP_HW_SUBSET=2s,12c,2t to run on all the hardware threads.

The timing you report for KMP_HW_SUBSET is about what I'd explect given this environment variable is ignored. KMP_AFFINITY=compact,granularity=fine is going to pack the hardware threads and if you had OMP_NUM_THREADS=24 in your environment already, it would only run on one socket.

In KMP_HW_SUBSET, s=socket, c=core (per socket), t=thread (per core).

@jeffhammond
Copy link
Member

@jkd2016 In any case, please do not time allocate and deallocate in OpenMP parallel loops. I really hope Elk does not actually do this, because it is really not efficient, particularly with Intel Fortran, because ALLOCATABLE requires a lock.

@jeffhammond
Copy link
Member

@jkd2016 Also, in my experience, a threaded loop over sequential GEMM is slower than a sequential loop over threaded GEMM for relatively large matrices (e.g. dim=2000).

@jkd2016
Copy link
Author

jkd2016 commented Oct 15, 2018

@jkd2016 In any case, please do not time allocate and deallocate in OpenMP parallel loops. I really hope Elk does not actually do this, because it is really not efficient, particularly with Intel Fortran, because ALLOCATABLE requires a lock.

Utterly impossible. The code is extremely complicated. If you want a taste, look at the parallel loop in the routine 'gwsefm' and then dive into 'gwsefmk'. Try doing all that without allocating new arrays.

Since you work for Intel, here's a request: make allocating arrays in parallel loops as efficient as possible.

@jkd2016
Copy link
Author

jkd2016 commented Oct 15, 2018

@jkd2016 Also, in my experience, a threaded loop over sequential GEMM is slower than a sequential loop over threaded GEMM for relatively large matrices (e.g. dim=2000).

The opposite is true for the code above:

With

export OMP_NUM_THREADS=24
export MKL_NUM_THREADS=1

the best time is

real	0m21.148s
user	8m5.911s
sys	0m11.487s

Using

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=24

the best time is

real	0m48.207s
user	18m33.178s
sys	0m41.911s

@jkd2016
Copy link
Author

jkd2016 commented Oct 15, 2018

By Xeon E5-2680 , I assume you mean Xeon E5-2680v3, which is a 12-core processor. You need to use KMP_HW_SUBSET=2s,12c,1t to run on all the cores, or KMP_HW_SUBSET=2s,12c,2t to run on all the hardware threads.

The timing you report for KMP_HW_SUBSET is about what I'd explect given this environment variable is ignored. KMP_AFFINITY=compact,granularity=fine is going to pack the hardware threads and if you had OMP_NUM_THREADS=24 in your environment already, it would only run on one socket.

In KMP_HW_SUBSET, s=socket, c=core (per socket), t=thread (per core).

Fair enough, but what environment variables should be set to beat the 21 second wall-clock time obtained with the vanilla options

export OMP_NUM_THREADS=24
export MKL_NUM_THREADS=1

@fgvanzee
Copy link
Member

@devinamatthews @jkd2016 Are you both good with the changes in the nested-omp-patch branch? I'm preparing a new version release and would like to merge it, if it's ready.

@jeffhammond
Copy link
Member

jeffhammond commented Oct 15, 2018

@jkd2016

Utterly impossible. The code is extremely complicated. If you want a taste, look at the parallel loop in the routine 'gwsefm' and then dive into 'gwsefmk'. Try doing all that without allocating new arrays.

It is certainly not impossible. I have to perform such transformations on NWChem regularly, e.g. https://github.com/nwchemgit/nwchem/blob/master/src/ccsd/ccsd_itm_omp.F#L90, not just starting from Fortran allocatable, but as part of replacing the thread-unsafe stack allocator NWChem used by default (MA) with Fortran heap management while adding all of the OpenMP business along the way.

Please rename 0001-hoist-allocate-outside-of-parallel-region.txt to 0001-hoist-allocate-outside-of-parallel-region.patch and git am to see if this is a valid transformation to hoist the allocation and deallocation of se outside of the parallel loop.

Since you work for Intel, here's a request: make allocating arrays in parallel loops as efficient as possible.

I made this request years ago. Unfortunately, lock-free memory allocators that also perform well outside of a multi-threaded context are not easy to implement.

As for 1x24 vs 24x1, I would not try to run MKL across two sockets. What do you see for 1x12 vs 12x1 running within a single NUMA node?

@jeffhammond
Copy link
Member

I forgot the second use of se. New patch:
0001-hoist-allocate-outside-of-parallel-region.txt

@jkd2016
Copy link
Author

jkd2016 commented Oct 16, 2018

@devinamatthews @jkd2016 Are you both good with the changes in the nested-omp-patch branch? I'm preparing a new version release and would like to merge it, if it's ready.

Fine by me. If any further issues crop up, I'll report them immediately.

@jkd2016
Copy link
Author

jkd2016 commented Oct 16, 2018

I forgot the second use of se. New patch:
0001-hoist-allocate-outside-of-parallel-region.txt

Thanks Jeff -- I really appreciate you looking into this.

The main problem is not with the uppermost loop in gensefm, it's the subroutines called in gensefmk. These are complicated, have further nested loops and call additional subroutines which require their own allocatable arrays.

I've tried where possible to use automatic arrays (when they are not too large), but it's impossible to pre-allocate all the required arrays in 'gensefm' and pass them down subroutine tree to 'gensefmk' and beyond. (By 'impossible' I mean 'unfeasibly difficult')

Nevertheless, I've simplified your patch down to this:

program test
implicit none
integer, parameter :: n=2000,nloop=1000
integer nt,tid,i
integer(8) c0,c1,cr
real(8), allocatable :: a(:,:,:),b(:,:,:),c(:,:,:)
integer omp_get_max_threads,omp_get_thread_num

call system_clock(c0)
!$OMP PARALLEL DEFAULT(SHARED) &
!$OMP PRIVATE(a,b,c)
!$OMP DO
do i=1,nloop
  allocate(a(n,n,1),b(n,n,1),c(n,n,1))
  a(:,:,1)=1.d0
  b(:,:,1)=1.d0
  call dgemm('N','N',n,n,n,1.d0,a,n,b,n,0.d0,c,n)
  deallocate(a,b,c)
end do
!$OMP END DO
!$OMP END PARALLEL
call system_clock(c1,cr)
print *,dble(c1-c0)/dble(cr)


call system_clock(c0)
nt=omp_get_max_threads()
allocate(a(n,n,nt),b(n,n,nt),c(n,n,nt))
!$OMP PARALLEL DEFAULT(SHARED) &
!$OMP PRIVATE(tid)
!$OMP DO
do i=1,nloop
  tid=omp_get_thread_num()+1
  a(:,:,tid)=1.d0
  b(:,:,tid)=1.d0
  call dgemm('N','N',n,n,n,1.d0,a(:,:,tid),n,b(:,:,tid),n,0.d0,c(:,:,tid),n)
end do
!$OMP END DO
!$OMP END PARALLEL
call system_clock(c1,cr)
print *,dble(c1-c0)/dble(cr)

end program

The timings in seconds are:

   21.9067180000000 
   20.4162200000000

In the grand scheme of things, this is an acceptable overhead.

(FYI: gensefmk is probably the most complicated routine in Elk and calculates a type of Feynman diagram. The subroutine that usually calls it (gwbandstr) typically runs for around 1000 CPU days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants