-
Notifications
You must be signed in to change notification settings - Fork 16
Hierarchical Parallelism
YAKL supports two-level hierarchical parallelism, and this capability is increasing over time. This is accomplished with yakl::parallel_outer
and yakl::parallel_inner
, yakl::fence_inner
, and yakl::single_inner
. The syntax necessarily increases in complexity with the choice of hierarchical parallelism. Please see a minimal example below.
parallel_outer( "blah" , Bounds<1>(n1) , YAKL_LAMBDA (int k, InnerHandler handler ) {
parallel_inner( Bounds<2>(n2,n3) , [&] (int j, int i) {
arr3d(k,j,i) = 2.;
} , handler );
fence_inner( handler );
parallel_inner( Bounds<2>(n2,n3) , [&] (int j, int i) {
arr3d(k,j,i) = 3.;
} , handler );
fence_inner( handler );
single_inner( [&] () {
arr3d(k,0,0) = 0;
} , handler );
} , LaunchConfig<n2*n3>() );
The LaunchConfig<>()
object tells YAKL what the size of the inner parallelism is, and it must be known at compile time. parallel_outer
will create an InnerHandler
object and pass it into the functor / lambda passed to parallel_outer
. This object is then passed to parallel_inner
, fence_inner
, and single_inner
.
parallel_outer
maps to CUDA "grid" level parallelism (or "gang" level parallelism in OpenACC), and parallel_inner
maps to CUDA "block" level parallelism (or "vector" level parallelism in OpenACC).
The parallel_inner
function executes the contained lambda in parallel over all threads within a CUDA block. The single_inner
function executes the contained lambda with only a single thread within the CUDA block. fence_inner
synchronizes all threads within a CUDA block (e.g., __syncthreads()
in CUDA).
Inner reductions are coming soon. Atomics will operate in the same manner.
Note that you can collapse multiple loops into both the outer and inner level parallelism.