Implement GPU MinLoc reduction #2882

eschnett · 2022-07-21T14:22:32Z

Summary

I am looking for a GPU-enabled MinLoc reduction operator. This PR provides a proof-of-concept implementation. I am looking for feedback.

Additional background

To implement MinLoc, I view the quantity that is to be reduced as a tuple of two elements: (1) the quantity that is to be reduced, and (2) and additional arbitrary payload (the location). I add respective functions that handle tuples. For example, the Less operation compares the first element of the tuple, whereas a __shfl_down_sync needs to shuffle both elements of the tuple.

Checklist

The proposed changes:

fix a bug or incorrect behavior in AMReX
add new capabilities to AMReX
changes answers in the test suite to more than roundoff level
are likely to significantly affect the results of downstream AMReX users
include documentation in the code and/or rst files, if appropriate

WeiqunZhang · 2022-07-21T17:24:24Z

What's the size of the tuple (i.e., sizeof(Tuple))? How do you plan to use it? Is this going to be inside a kernel that also does some other jobs or a function that does MinLoc only? Instead of using atomics, you could save the block reduce results in device memory. Then you launch a second kernel that has only one block to further reduce the block reduce results. In our experience, this is faster than using atomics built wtih atomicCAS.

eschnett · 2022-07-21T19:13:02Z

The size of the tuple would be 2 (although that could be generalized). The first element is the value, the second the location (stored as a single integer, probably 64 bits). The value that is to be reduced would e.g. be tuple{val, index} (calculated on each thread), and that would also be the reduction result. The index in the result can then be converted back into an (i, j, k, n) tuple.

This would be used inside a ParReduce construct, together with a few other reductions.

We can use your suggestions instead, implementing CUDA code for our combined reduction.

WeiqunZhang · 2022-07-21T19:45:09Z

If index is 64 bits, the total size of val and index will be more than 64 bits. Then you won't be able to atomically write/read the whole tuple. ParReduce uses a two passes approach that does not use atomics.

eschnett · 2022-07-21T20:39:41Z

I thought I implemented just the functionality that ParReduce requires. If atomics are not necessary, then even better.

WeiqunZhang · 2022-07-21T20:59:58Z

I don't think you need to implement those things. It seems that all you need is the following so that amrex's reduce function knows how to initialize it to the maximum value. It seems to work. It used cub::BlockReduce.

diff --git a/Src/Base/AMReX_Reduce.H b/Src/Base/AMReX_Reduce.H
index 9c07b7b4a..2dd241714 100644
--- a/Src/Base/AMReX_Reduce.H
+++ b/Src/Base/AMReX_Reduce.H
@@ -133,7 +133,12 @@ struct ReduceOpMin
     void local_update (T& d, T const& s) const noexcept { d = amrex::min(d,s); }
 
     template <typename T>
-    constexpr void init (T& t) const noexcept { t = std::numeric_limits<T>::max(); }
+    constexpr std::enable_if_t<std::numeric_limits<T>::is_specialized>
+    init (T& t) const noexcept { t = std::numeric_limits<T>::max(); }
+
+    template <typename T>
+    constexpr std::enable_if_t<!std::numeric_limits<T>::is_specialized>
+    init (T& t) const noexcept { t = T::max(); }
 };
 
 struct ReduceOpMax

template <typename TV, typename TI>
struct MinLocTag
{
    TV value;
    TI index;

    static constexpr MinLocTag<TV,TI> max () {
        return MinLocTag<TV,TI>{std::numeric_limits<TV>::max(),
                                std::numeric_limits<TI>::max()};
    }

    friend constexpr bool operator< (MinLocTag<TV,TI> const& a, MinLocTag<TV,TI> const& b)
    {
        return a.value < b.value;
    }
};

        auto const& ma = mf.const_arrays();
        MinLocTag<Real,Long> minloc = amrex::ParReduce(TypeList<ReduceOpMin>{},
                                                       TypeList<MinLocTag<Real,Long> >{}, mf,
        [=] AMREX_GPU_DEVICE (int box_no, int i, int j, int k) noexcept
            -> GpuTuple<MinLocTag<Real,Long> >
        {
            Long nc = ncell;
            Long index = i + nc*j + nc*nc*k;
            return { MinLocTag<Real,Long>{ma[box_no](i,j,k), index} };
        });

Here is a test https://github.com/WeiqunZhang/amrex-devtests/blob/main/minloc/main.cpp

eschnett · 2022-07-21T21:01:48Z

Thanks! I'll try this.

WeiqunZhang · 2022-07-21T21:09:12Z

Let me know if it works for you. I think it should also just work for HIP. We do have to do more coding for Intel GPUs.

WeiqunZhang · 2022-07-21T21:31:32Z

Here is draft for the changes. #2885. I also update the test to do both minloc and maxloc.

WeiqunZhang · 2022-07-25T19:00:52Z

This is superseded by #2885.

eschnett added 2 commits July 21, 2022 10:17

Implement GPU MinLoc reduction

9f50e5b

Correct lambda in Atomic::MinLoc_device

03c6411

WeiqunZhang closed this Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement GPU MinLoc reduction #2882

Implement GPU MinLoc reduction #2882

eschnett commented Jul 21, 2022

WeiqunZhang commented Jul 21, 2022

eschnett commented Jul 21, 2022

WeiqunZhang commented Jul 21, 2022

eschnett commented Jul 21, 2022

WeiqunZhang commented Jul 21, 2022

eschnett commented Jul 21, 2022

WeiqunZhang commented Jul 21, 2022

WeiqunZhang commented Jul 21, 2022

WeiqunZhang commented Jul 25, 2022

Implement GPU MinLoc reduction #2882

Implement GPU MinLoc reduction #2882

Conversation

eschnett commented Jul 21, 2022

Summary

Additional background

Checklist

WeiqunZhang commented Jul 21, 2022

eschnett commented Jul 21, 2022

WeiqunZhang commented Jul 21, 2022

eschnett commented Jul 21, 2022

WeiqunZhang commented Jul 21, 2022

eschnett commented Jul 21, 2022

WeiqunZhang commented Jul 21, 2022

WeiqunZhang commented Jul 21, 2022

WeiqunZhang commented Jul 25, 2022