-
Notifications
You must be signed in to change notification settings - Fork 16
Hierarchical Parallelism
YAKL supports two-level hierarchical parallelism with yakl::parallel_outer
, yakl::parallel_inner
, yakl::fence_inner
, and yakl::single_inner
. Please be warned that the syntax necessarily increases in complexity with the choice of hierarchical parallelism. There are computational concerns that YAKL can no longer shield the user from as it can with vanilla parallel_for
. Please see a small example below:
using yakl::c::parallel_outer;
using yakl::c::parallel_inner;
using yakl::c::Bounds;
parallel_outer( "blah" , Bounds<2>(ny,nx) , YAKL_LAMBDA (int j, int i, InnerHandler handler ) {
// Load data
parallel_inner( nz , [&] (int k) {
w_velocity(j,i,1+k) = state(idW,k,j,i);
} , handler );
inner_fence( handler );
// Boundary conditions
single_inner( [&] () {
w_velocity(j,i,0 ) = 0;
w_velocity(j,i,nz+1) = 0;
} , handler );
fence_inner( handler );
// Compute momentum from velocity and pre-computed density
parallel_inner( nz+2 , [&] (int k) {
w_momentum(j,i,k) = w_velocity(j,i,k) * density(j,i,k);
} , handler );
} , LaunchConfig<nz>() );
Note that parallel_outer
translates to simple for loops for a serial CPU backend, and it translates to OpenMP-threaded loops for the OpenMP CPU threading backend. parallel_inner
translates to simple for loops for serial and OpenMP-threaded CPU backends, meaning it is not threaded for the OpenMP CPU threading backend. fence_inner
does nothing for serial and OpenMP-threaded CPU backends.
The yakl::c::parallel_outer
and yakl::fortran::parallel_outer
functions distribute parallel threads over a CUDA "grid" or an OpenACC "gang". It is the coarsest level parallelism on GPUs. The function has the following signature:
template <class F, int N, bool simple, int VecLen=YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>
inline void parallel_outer( char const * str , Bounds<N,simple> const &bounds , F const &f ,
LaunchConfig<VecLen,B4B> config = LaunchConfig<>() );
Call parallel_outer
as, e.g.,
parallel_outer("mylabel",Bounds<2>(ny,nx),YAKL_LAMBDA (int j, int i, InnerHandler handler) {
// My lambda code
} , LaunchConfig<inner_loop_size>() );
OR
parallel_outer("mylabel",Bounds<2>(ny,nx),YAKL_LAMBDA (int j, int i, InnerHandler handler) {
// My lambda code
} , LaunchConfig<MAX_INNER_LOOP_SIZE>().set_inner_size(actual_inner_loop_size) );
Ensure that actual_inner_loop_size <= MAX_INNER_LOOP_SIZE
.
Always accept an InnerHandler
object at the end of your outer loop indices in the functor or lambda you pass to parallel_outer
.
The lambda you pass to parallel_outer
must be created with YAKL_LAMBDA
.
As with other parallel_for
APIs, the beginning string label str
is optional and can be omitted. As always, you can pass an integer type or a loop bound expression {lower_bound,upper_bound,stride}
in place of a Bounds
or SimpleBounds
object. f
is the functor or lambda you're distributing across threads. The LaunchConfig
object gives YAKL important information about the kernel launch.
The first template parameter of LaunchConfig
tells YAKL what the maximum size of the inner loop should be. As a template parameter, it must be known at compile time, and YAKL has reasons for this requirement (mostly computational efficiency reasons). It's highly recommended for the user to set this with a compile-time-known value that provides a maximum bound on the size of the inner parallel loop. Without the user setting this template parameter for LaunchConfig
, the maximum bound defaults to the default vector length for a given architecture, which may not be as large as the user wants. In CUDA, you will get a runtime failure at kernel launch if the inner loop size exceeds the VecLen
template parameter, but with other architectures, for now, you get a silent wrong answer that will be very difficult to track down.
There are two ways to tell YAKL the inner loop size. You can pass LaunchConfig<max_vector_length>()
, which will automatically set the inner loop size to max_vector_length
for you. Or you can pass LaunchConfig<max_vector_length>().set_inner_size(actual_vector_length)
to parallel_outer
, **making sure that actual_vector_length <= max_vector_length()*.
The InnerHandler
object you accept in the functor or lambda you pass to parallel_outer
is needed for some of the backends, and therefore, it must be used for all backends for portability. You will pass this object to all inner
routines as you see in the example above.
The yakl::c::parallel_inner
and yakl::fortran::parallel_inner
functions distribute parallel threads over a CUDA "block" or an OpenACC "vector" (necessarily nested within a parallel_outer
function). It is the finest level parallelism on GPUs. This will launch the passed lambda or functor over all inner loop threads in parallel with no guarantee of execution order. The function has the following signature:
template <class F, int N, bool simple>
inline void parallel_inner( Bounds<N,simple> const &bounds , F const &f , InnerHandler handler );
Call parallel_inner
as, e.g.,
parallel_inner(Bounds<2>(nz,nvars), [&] (int k, int ll) {
// My lambda code
} , handler );
The object handler
is an InnerHandler
object passed into the lambda called by parallel_outer
.
Important!!!: Please use [&]
to create your C++ lambda rather than YAKL_LAMBDA
. This is already on the device, so a simple capture-by-reference lambda is what you need to use. Most backends will give compile-time warnings or errors if you attempt to use YAKL_LAMBDA
to create the lambda expression passed to parallel_inner
.
As always, you can pass an integer type or a loop bound expression {lower_bound,upper_bound,stride}
in place of a Bounds
or SimpleBounds
object.
The yakl::c::single_inner
and yakl::fortran::single_inner
functions execute a single thread at the same parallelism level as parallel_inner
(necessarily nested within a parallel_outer
function). single_inner
does not synchronize threads before or after, so without the user adding fence_inner(handler)
before and after single_inner
, the other parallel threads within the inner level parallelism are still executing in parallel with no guarantee of order. Therefore, single_inner
is typically (nearly always?) surrounded by fence_inner()
calls to ensure the single thread executes in isolation. The function has the following signature:
template <class F>
inline void single_inner( F const &f , InnerHandler handler );
Call single_inner
as, e.g.,
yakl::fence_inner(handler);
single_inner( [&] () {
// My lambda code
} , handler );
yakl::fence_inner(handler);
As with parallel_inner
, the user must pass an InnerHandler
object. The lambda passed to single_inner
must accept zero parameters, and again, the lambda must be created with [&]
and not YAKL_LAMBDA
.
The YAKL inline void fence_inner( InnerHandler handler )
routine synchronizes threads within inner-level parallelism (CUDA "block" or OpenACC "vector"). This is equivalent to CUDA and HIP __syncthreads()
, for instance.
fence_inner
must be nested within a parallel_outer
call, and you will likely get errors if you attempt to call this outside a parallel_outer
call for the GPU backends.
These are under development and are not available yet.