Skip to content

Hierarchical Parallelism

Matt Norman edited this page Jul 13, 2022 · 6 revisions

YAKL supports two-level hierarchical parallelism, and this capability is increasing over time. This is accomplished with yakl::parallel_outer and yakl::parallel_inner, yakl::fence_inner, and yakl::single_inner. The syntax necessarily increases in complexity with the choice of hierarchical parallelism. Please see a minimal example below.

parallel_outer( "blah" , Bounds<1>(n1) , YAKL_LAMBDA (int k, InnerHandler handler ) {
  parallel_inner( Bounds<2>(n2,n3) , [&] (int j, int i) {
    arr3d(k,j,i) = 2.;
  } , handler );
  fence_inner( handler );
  parallel_inner( Bounds<2>(n2,n3) , [&] (int j, int i) {
    arr3d(k,j,i) = 3.;
  } , handler );
  fence_inner( handler );
  single_inner( [&] () {
    arr3d(k,0,0) = 0;
  } , handler );
} , LaunchConfig<n2*n3>() );

The LaunchConfig<>() object tells YAKL what the size of the inner parallelism is, and it must be known at compile time. parallel_outer will create an InnerHandler object and pass it into the functor / lambda passed to parallel_outer. This object is then passed to parallel_inner, fence_inner, and single_inner.

parallel_outer maps to CUDA "grid" level parallelism (or "gang" level parallelism in OpenACC), and parallel_inner maps to CUDA "block" level parallelism (or "vector" level parallelism in OpenACC).

The parallel_inner function executes the contained lambda in parallel over all threads within a CUDA block. The single_inner function executes the contained lambda with only a single thread within the CUDA block. fence_inner synchronizes all threads within a CUDA block (e.g., __syncthreads() in CUDA).

Inner reductions are coming soon. Atomics will operate in the same manner.

Note that you can collapse multiple loops into both the outer and inner level parallelism.

Clone this wiki locally