Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GPU MinLoc reduction #2882

Closed
wants to merge 2 commits into from

Conversation

eschnett
Copy link
Contributor

Summary

I am looking for a GPU-enabled MinLoc reduction operator. This PR provides a proof-of-concept implementation. I am looking for feedback.

Additional background

To implement MinLoc, I view the quantity that is to be reduced as a tuple of two elements: (1) the quantity that is to be reduced, and (2) and additional arbitrary payload (the location). I add respective functions that handle tuples. For example, the Less operation compares the first element of the tuple, whereas a __shfl_down_sync needs to shuffle both elements of the tuple.

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@WeiqunZhang
Copy link
Member

What's the size of the tuple (i.e., sizeof(Tuple))? How do you plan to use it? Is this going to be inside a kernel that also does some other jobs or a function that does MinLoc only? Instead of using atomics, you could save the block reduce results in device memory. Then you launch a second kernel that has only one block to further reduce the block reduce results. In our experience, this is faster than using atomics built wtih atomicCAS.

@eschnett
Copy link
Contributor Author

The size of the tuple would be 2 (although that could be generalized). The first element is the value, the second the location (stored as a single integer, probably 64 bits). The value that is to be reduced would e.g. be tuple{val, index} (calculated on each thread), and that would also be the reduction result. The index in the result can then be converted back into an (i, j, k, n) tuple.

This would be used inside a ParReduce construct, together with a few other reductions.

We can use your suggestions instead, implementing CUDA code for our combined reduction.

@WeiqunZhang
Copy link
Member

If index is 64 bits, the total size of val and index will be more than 64 bits. Then you won't be able to atomically write/read the whole tuple. ParReduce uses a two passes approach that does not use atomics.

@eschnett
Copy link
Contributor Author

I thought I implemented just the functionality that ParReduce requires. If atomics are not necessary, then even better.

@WeiqunZhang
Copy link
Member

I don't think you need to implement those things. It seems that all you need is the following so that amrex's reduce function knows how to initialize it to the maximum value. It seems to work. It used cub::BlockReduce.

diff --git a/Src/Base/AMReX_Reduce.H b/Src/Base/AMReX_Reduce.H
index 9c07b7b4a..2dd241714 100644
--- a/Src/Base/AMReX_Reduce.H
+++ b/Src/Base/AMReX_Reduce.H
@@ -133,7 +133,12 @@ struct ReduceOpMin
     void local_update (T& d, T const& s) const noexcept { d = amrex::min(d,s); }
 
     template <typename T>
-    constexpr void init (T& t) const noexcept { t = std::numeric_limits<T>::max(); }
+    constexpr std::enable_if_t<std::numeric_limits<T>::is_specialized>
+    init (T& t) const noexcept { t = std::numeric_limits<T>::max(); }
+
+    template <typename T>
+    constexpr std::enable_if_t<!std::numeric_limits<T>::is_specialized>
+    init (T& t) const noexcept { t = T::max(); }
 };
 
 struct ReduceOpMax
template <typename TV, typename TI>
struct MinLocTag
{
    TV value;
    TI index;

    static constexpr MinLocTag<TV,TI> max () {
        return MinLocTag<TV,TI>{std::numeric_limits<TV>::max(),
                                std::numeric_limits<TI>::max()};
    }

    friend constexpr bool operator< (MinLocTag<TV,TI> const& a, MinLocTag<TV,TI> const& b)
    {
        return a.value < b.value;
    }
};
        auto const& ma = mf.const_arrays();
        MinLocTag<Real,Long> minloc = amrex::ParReduce(TypeList<ReduceOpMin>{},
                                                       TypeList<MinLocTag<Real,Long> >{}, mf,
        [=] AMREX_GPU_DEVICE (int box_no, int i, int j, int k) noexcept
            -> GpuTuple<MinLocTag<Real,Long> >
        {
            Long nc = ncell;
            Long index = i + nc*j + nc*nc*k;
            return { MinLocTag<Real,Long>{ma[box_no](i,j,k), index} };
        });

Here is a test https://github.com/WeiqunZhang/amrex-devtests/blob/main/minloc/main.cpp

@eschnett
Copy link
Contributor Author

Thanks! I'll try this.

@WeiqunZhang
Copy link
Member

Let me know if it works for you. I think it should also just work for HIP. We do have to do more coding for Intel GPUs.

@WeiqunZhang
Copy link
Member

Here is draft for the changes. #2885. I also update the test to do both minloc and maxloc.

@WeiqunZhang
Copy link
Member

This is superseded by #2885.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants