-
Notifications
You must be signed in to change notification settings - Fork 102
KernelOps usage
The KernelOps
struct is used to keep track of resources (such as shared memory) that may be used by kernels and to allow targets to pass in arguments (that could contain things like shared memory pointers that were allocated at kernel launch) when objects that use them are constructed.
The QUDA operations that are tracked by KernelOps
are
-
op_blockSync
(currently only in SYCL branch) -
op_warp_combine
(currently only in SYCL branch) thread_array
ThreadLocalCache
SharedMemoryCache
-
BlockReduce
(currently only in SYCL branch)
Kernels using these operations need to inherit from the corresponding KernelOps
type.
For example
(from https://github.com/lattice/quda/blob/develop/include/kernels/gauge_force.cuh#L47)
template <typename Arg> struct GaugeForce : KernelOps<thread_array<int, 4>> {
const Arg &arg;
template <typename... OpsArgs>
constexpr GaugeForce(const Arg &arg, const OpsArgs &...ops) : KernelOpsT(ops...), arg(arg)
{
}
GaugeForce
uses thread_array<int,4>
. Also note that the constructor now allows an additional optional argument const OpsArgs &...ops
which gets passed into the KernelOpsT
constructor (which is a member of KernelOps
https://github.com/lattice/quda/blob/develop/include/targets/generic/kernel_ops.h#L98).
Instantiating the thread_array
object now requires passing in the KernelOps
structure, which is a parent of the GaugeForce
functor, so
passing in *this
from inside the operator()
routine suffices
__device__ __host__ void operator()(int x_cb, int parity, int dir)
{
...
thread_array<int, 4> dx {*this};
Here's an example from https://github.com/lattice/quda/blob/develop/include/kernels/hisq_paths_force.cuh#L308
template <typename Param> struct AllThreeAllLepageLinkOps {
using Link = Matrix<complex<typename Param::Arg::real>, Param::Arg::nColor>;
using Ops = KernelOps<ThreadLocalCache<Link>>;
};
template <typename Param> struct AllThreeAllLepageLink : AllThreeAllLepageLinkOps<Param>::Ops {
...
using typename AllThreeAllLepageLinkOps<Param>::Ops::KernelOpsT;
template <typename... OpsArgs>
constexpr AllThreeAllLepageLink(const Param ¶m, const OpsArgs &...ops) : KernelOpsT(ops...), arg(param.arg)
{
}
...
__device__ __host__ void operator()(int x_cb, int parity)
{
...
ThreadLocalCache<Link> Uab_cache {*this};
For convenience, a helper struct AllThreeAllLepageLinkOps
is created to define the needed KernelOps<ThreadLocalCache<Link>>
type that AllThreeAllLepageLink
will inherit from, defined as AllThreeAllLepageLinkOps<Param>::Ops
.
Again the instantiation of ThreadLocalCache<Link>
must pass in the KernelOps
struct that is contained in the functor *this
.
In this case ThreadLocalCache<Link>
has storage for a single Link
variable per thread.
Note that here the KernelOpsT
must be explicitly declared
using typename AllThreeAllLepageLinkOps<Param>::Ops::KernelOpsT;
whereas in the previous example, it wasn't necessary.
This is because the KernelOps
being inherited from
AllThreeAllLepageLinkOps<Param>::Ops
depends on the parameter Param
and is thus a dependent type. In this case the KernelOpsT
member is not automatically
exposed to the child type (struct AllThreeAllLepageLink
) and must be explicitly requested.
Another usage in this file is https://github.com/lattice/quda/blob/develop/include/kernels/hisq_paths_force.cuh#L680
template <typename Param> struct AllFiveAllSevenLinkOps {
static constexpr int cache_len = Param::sig_positive ? 3 : 2;
using Link = Matrix<complex<typename Param::Arg::real>, Param::Arg::nColor>;
using Ops = KernelOps<ThreadLocalCache<Link, cache_len>>;
};
template <typename Param> struct AllFiveAllSevenLink : AllFiveAllSevenLinkOps<Param>::Ops {
...
__device__ __host__ void operator()(int x_cb, int parity)
{
...
constexpr int cache_len = sig_positive ? 3 : 2;
ThreadLocalCache<Link, cache_len> Matrix_cache {*this};
This is similar to above, except that ThreadLocalCache<Link, cache_len>
contains an array of either 3 or 2 Link
elements per thread.
Note that ThreadLocalCache<Link>
acts like a single Link
object, while ThreadLocalCache<Link, 1>
is like an array of 1 Link
, so
is semantically different (the array needs to be indexed to get the Link
object).
As a special case ThreadLocalCache<Link, 0>
acts as a single Link
object, exactly like ThreadLocalCache<Link>
as if the 0 weren't there.
This is useful when specifying an offset (the optional third parameter,
see https://github.com/lattice/quda/wiki/KernelOps-usage#multiple-kernelops-types for more details).
A simple example is from https://github.com/lattice/quda/blob/develop/include/kernels/dslash_clover_helper.cuh#L198
template <typename Arg>
using NdegTwistCloverApplyOps
= KernelOps<SharedMemoryCache<ColorSpinor<typename Arg::real, Arg::nColor, Arg::nSpin / 2>>>;
template <typename Arg> struct NdegTwistCloverApply : NdegTwistCloverApplyOps<Arg> {
...
using half_fermion = ColorSpinor<typename Arg::real, Arg::nColor, Arg::nSpin / 2>;
...
__device__ __host__ inline void operator()(int x_cb, int src_flavor, int parity)
{
...
SharedMemoryCache<half_fermion> cache {*this};
This creates a SharedMemoryCache
with one half_fermion
per thread (that can be accessed by other threads in the same block).
SharedMemoryCache
also allows one to specify cache dimensions that are different than the block dimensions.
An example is in https://github.com/lattice/quda/blob/develop/include/kernels/block_transpose.cuh#L45
template <typename Arg> struct BlockTransposeKernelOps {
struct CacheDims {
static constexpr dim3 dims(dim3 block)
{
block.x += 1;
block.z = 1;
return block;
}
};
using color_spinor_t = ColorSpinor<typename Arg::real, 1, Arg::nSpin>;
using CacheT = SharedMemoryCache<color_spinor_t, CacheDims>;
using Ops = KernelOps<CacheT>;
};
template <typename Arg> struct BlockTransposeKernel : BlockTransposeKernelOps<Arg>::Ops {
...
__device__ __host__ inline void operator()(int x_cb, int)
{
...
typename BlockTransposeKernelOps<Arg>::CacheT cache {*this};
The dimensions of the cache are obtained from the return value of the CacheDims::dims
member function, which gets the current block size as an argument.
The NoKernelOps
struct can be inherited from for kernels that don't have any of the tracked kernel operations.
This generally isn't necessary, but it is useful in situations where the kernel conditionally uses an operation
https://github.com/lattice/quda/blob/develop/include/kernels/dslash_domain_wall_m5.cuh#L212
template <typename Arg, bool shared = false> struct d5Params {
using Vec = ColorSpinor<typename Arg::real, Arg::nColor, mobius_m5::use_half_vector() ? 4 / 2 : 4>;
using Cache = SharedMemoryCache<Vec>;
using Ops = std::conditional_t<shared, KernelOps<Cache>, NoKernelOps>;
};
Multiple kernel operations can be included as template parameters to KernelOps
.
For example
KernelOps<thread_array<int,4>, SharedMemoryCache<Link>>
would be for a kernel that used both a thread_array<int,4>
and SharedMemoryCache<Link>>
.
Note that these objects would overlap in shared memory, so can only be safely used one at a time.
For concurrent use of multiple operations, an offset must be given to one of the operations, so that it occupies
a separate region of shared memory. One example comes from ThreadLocalCache
use in
https://github.com/lattice/quda/blob/develop/include/kernels/gauge_stout.cuh#L118
template <typename Arg> struct OvrImpSTOUTOps {
using real = typename Arg::Float;
using Complex = complex<real>;
using Link = Matrix<complex<real>, Arg::nColor>;
using StapCacheT = ThreadLocalCache<Link, 0, computeStapleRectangleOps>; // offset by computeStapleRectangleOps
using RectCacheT = ThreadLocalCache<Link, 0, StapCacheT>; // offset by StapCacheT
using Ops = combineOps<computeStapleRectangleOps, KernelOps<StapCacheT, RectCacheT>>;
};
template <typename Arg> struct OvrImpSTOUT : OvrImpSTOUTOps<Arg>::Ops {
__device__ __host__ inline void operator()(int x_cb, int parity, int dir)
{
...
typename OvrImpSTOUTOps<Arg>::StapCacheT Stap {*this};
typename OvrImpSTOUTOps<Arg>::RectCacheT Rect {*this};
OvrImpSTOUT
calls computeStapleRectangle
which has its own operation requirements
using computeStapleRectangleOps = KernelOps<thread_array<int, 4>>;
which needs to be used concurrently with the ThreadLocalCache
's being used to store the staple and rectangle.
This is handled by offsetting the staple cache by the computeStapleRectangleOps
(which is KernelOps<thread_array<int, 4>>
), then
by offsetting the rectangle cache by the staple cache type StapCacheT
. Note that StapCacheT
includes the offset for the thread_array
so that RectCacheT
will automatically be offset by both thread_array<int,4>
and the ThreadLocalCache<Link>
for the staple.
The offset type is always the last template parameter to the operation. Since ThreadLocalCache
takes an optional second argument,
an array length, we must specify it too.
In this case an array length of 0 is a special case that makes ThreadLocalCache
act like a single Link
object, instead of an array
of objects.