KernelOps usage

The KernelOps struct is used to keep track of resources (such as shared memory) that may be used by kernels and to allow targets to pass in arguments (that could contain things like shared memory pointers that were allocated at kernel launch) when objects that use them are constructed.

The QUDA operations that are tracked by KernelOps are

op_blockSync (currently only in SYCL branch)
op_warp_combine (currently only in SYCL branch)
thread_array
ThreadLocalCache
SharedMemoryCache
BlockReduce (currently only in SYCL branch)

Introduction and thread_array example

Kernels using these operations need to inherit from the corresponding KernelOps type. For example (from https://github.com/lattice/quda/blob/develop/include/kernels/gauge_force.cuh#L47)

template <typename Arg> struct GaugeForce : KernelOps<thread_array<int, 4>> {
    const Arg &arg;
    template <typename... OpsArgs>
    constexpr GaugeForce(const Arg &arg, const OpsArgs &...ops) : KernelOpsT(ops...), arg(arg)
    {
    }

GaugeForce uses thread_array<int,4>. Also note that the constructor now allows an additional optional argument const OpsArgs &...ops which gets passed into the KernelOpsT constructor (which is a member of KernelOps https://github.com/lattice/quda/blob/develop/include/targets/generic/kernel_ops.h#L98).

Instantiating the thread_array object now requires passing in the KernelOps structure, which is a parent of the GaugeForce functor, so passing in *this from inside the operator() routine suffices

    __device__ __host__ void operator()(int x_cb, int parity, int dir)
    {
      ...
      thread_array<int, 4> dx {*this};

ThreadLocalCache examples

Here's an example from https://github.com/lattice/quda/blob/develop/include/kernels/hisq_paths_force.cuh#L308

    template <typename Param> struct AllThreeAllLepageLinkOps {
      using Link = Matrix<complex<typename Param::Arg::real>, Param::Arg::nColor>;
      using Ops = KernelOps<ThreadLocalCache<Link>>;
    };

    template <typename Param> struct AllThreeAllLepageLink : AllThreeAllLepageLinkOps<Param>::Ops {
      ...
      using typename AllThreeAllLepageLinkOps<Param>::Ops::KernelOpsT;
      template <typename... OpsArgs>
      constexpr AllThreeAllLepageLink(const Param &param, const OpsArgs &...ops) : KernelOpsT(ops...), arg(param.arg)
      {
      }
      ...
      __device__ __host__ void operator()(int x_cb, int parity)
      {
        ...
        ThreadLocalCache<Link> Uab_cache {*this};

For convenience, a helper struct AllThreeAllLepageLinkOps is created to define the needed KernelOps<ThreadLocalCache<Link>> type that AllThreeAllLepageLink will inherit from, defined as AllThreeAllLepageLinkOps<Param>::Ops. Again the instantiation of ThreadLocalCache<Link> must pass in the KernelOps struct that is contained in the functor *this. In this case ThreadLocalCache<Link> has storage for a single Link variable per thread.

Note that here the KernelOpsT must be explicitly declared

    using typename AllThreeAllLepageLinkOps<Param>::Ops::KernelOpsT;

whereas in the previous example, it wasn't necessary. This is because the KernelOps being inherited from

AllThreeAllLepageLinkOps<Param>::Ops

depends on the parameter Param and is thus a dependent type. In this case the KernelOpsT member is not automatically exposed to the child type (struct AllThreeAllLepageLink) and must be explicitly requested.

Another usage in this file is https://github.com/lattice/quda/blob/develop/include/kernels/hisq_paths_force.cuh#L680

    template <typename Param> struct AllFiveAllSevenLinkOps {
      static constexpr int cache_len = Param::sig_positive ? 3 : 2;
      using Link = Matrix<complex<typename Param::Arg::real>, Param::Arg::nColor>;
      using Ops = KernelOps<ThreadLocalCache<Link, cache_len>>;
    };

    template <typename Param> struct AllFiveAllSevenLink : AllFiveAllSevenLinkOps<Param>::Ops {
      ...
      __device__ __host__ void operator()(int x_cb, int parity)
      {
        ...
        constexpr int cache_len = sig_positive ? 3 : 2;
        ThreadLocalCache<Link, cache_len> Matrix_cache {*this};

This is similar to above, except that ThreadLocalCache<Link, cache_len> contains an array of either 3 or 2 Link elements per thread.

Note that ThreadLocalCache<Link> acts like a single Link object, while ThreadLocalCache<Link, 1> is like an array of 1 Link, so is semantically different (the array needs to be indexed to get the Link object). As a special case ThreadLocalCache<Link, 0> acts as a single Link object, exactly like ThreadLocalCache<Link> as if the 0 weren't there. This is useful when specifying an offset (the optional third parameter, see https://github.com/lattice/quda/wiki/KernelOps-usage#multiple-kernelops-types for more details).

SharedMemoryCache examples

A simple example is from https://github.com/lattice/quda/blob/develop/include/kernels/dslash_clover_helper.cuh#L198

  template <typename Arg>
  using NdegTwistCloverApplyOps
    = KernelOps<SharedMemoryCache<ColorSpinor<typename Arg::real, Arg::nColor, Arg::nSpin / 2>>>;

  template <typename Arg> struct NdegTwistCloverApply : NdegTwistCloverApplyOps<Arg> {
    ...
    using half_fermion = ColorSpinor<typename Arg::real, Arg::nColor, Arg::nSpin / 2>;
    ...
    __device__ __host__ inline void operator()(int x_cb, int src_flavor, int parity)
    {
      ...
      SharedMemoryCache<half_fermion> cache {*this};

This creates a SharedMemoryCache with one half_fermion per thread (that can be accessed by other threads in the same block).

SharedMemoryCache also allows one to specify cache dimensions that are different than the block dimensions. An example is in https://github.com/lattice/quda/blob/develop/include/kernels/block_transpose.cuh#L45

  template <typename Arg> struct BlockTransposeKernelOps {
    struct CacheDims {
      static constexpr dim3 dims(dim3 block)
      {
        block.x += 1;
        block.z = 1;
        return block;
      }
    };
    using color_spinor_t = ColorSpinor<typename Arg::real, 1, Arg::nSpin>;
    using CacheT = SharedMemoryCache<color_spinor_t, CacheDims>;
    using Ops = KernelOps<CacheT>;
  };

  template <typename Arg> struct BlockTransposeKernel : BlockTransposeKernelOps<Arg>::Ops {
    ...
    __device__ __host__ inline void operator()(int x_cb, int)
    {
      ...
      typename BlockTransposeKernelOps<Arg>::CacheT cache {*this};

The dimensions of the cache are obtained from the return value of the CacheDims::dims member function, which gets the current block size as an argument.

NoKernelOps

The NoKernelOps struct can be inherited from for kernels that don't have any of the tracked kernel operations. This generally isn't necessary, but it is useful in situations where the kernel conditionally uses an operation https://github.com/lattice/quda/blob/develop/include/kernels/dslash_domain_wall_m5.cuh#L212

  template <typename Arg, bool shared = false> struct d5Params {
    using Vec = ColorSpinor<typename Arg::real, Arg::nColor, mobius_m5::use_half_vector() ? 4 / 2 : 4>;
    using Cache = SharedMemoryCache<Vec>;
    using Ops = std::conditional_t<shared, KernelOps<Cache>, NoKernelOps>;
  };

Multiple KernelOps types

Multiple kernel operations can be included as template parameters to KernelOps. For example

KernelOps<thread_array<int,4>, SharedMemoryCache<Link>>

would be for a kernel that used both a thread_array<int,4> and SharedMemoryCache<Link>>. Note that these objects would overlap in shared memory, so can only be safely used one at a time.

For concurrent use of multiple operations, an offset must be given to one of the operations, so that it occupies a separate region of shared memory. One example comes from ThreadLocalCache use in https://github.com/lattice/quda/blob/develop/include/kernels/gauge_stout.cuh#L118

  template <typename Arg> struct OvrImpSTOUTOps {
    using real = typename Arg::Float;
    using Complex = complex<real>;
    using Link = Matrix<complex<real>, Arg::nColor>;
    using StapCacheT = ThreadLocalCache<Link, 0, computeStapleRectangleOps>; // offset by computeStapleRectangleOps
    using RectCacheT = ThreadLocalCache<Link, 0, StapCacheT>;                // offset by StapCacheT
    using Ops = combineOps<computeStapleRectangleOps, KernelOps<StapCacheT, RectCacheT>>;
  };

  template <typename Arg> struct OvrImpSTOUT : OvrImpSTOUTOps<Arg>::Ops {
    __device__ __host__ inline void operator()(int x_cb, int parity, int dir)
    {
      ...
      typename OvrImpSTOUTOps<Arg>::StapCacheT Stap {*this};
      typename OvrImpSTOUTOps<Arg>::RectCacheT Rect {*this};

OvrImpSTOUT calls computeStapleRectangle which has its own operation requirements

  using computeStapleRectangleOps = KernelOps<thread_array<int, 4>>;

which needs to be used concurrently with the ThreadLocalCache's being used to store the staple and rectangle. This is handled by offsetting the staple cache by the computeStapleRectangleOps (which is KernelOps<thread_array<int, 4>>), then by offsetting the rectangle cache by the staple cache type StapCacheT. Note that StapCacheT includes the offset for the thread_array so that RectCacheT will automatically be offset by both thread_array<int,4> and the ThreadLocalCache<Link> for the staple.

The offset type is always the last template parameter to the operation. Since ThreadLocalCache takes an optional second argument, an array length, we must specify it too. In this case an array length of 0 is a special case that makes ThreadLocalCache act like a single Link object, instead of an array of objects.

QUDA calls

Provide feedback

Saved searches