-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GPU MinLoc reduction #2882
Conversation
What's the size of the tuple (i.e., sizeof(Tuple))? How do you plan to use it? Is this going to be inside a kernel that also does some other jobs or a function that does MinLoc only? Instead of using atomics, you could save the block reduce results in device memory. Then you launch a second kernel that has only one block to further reduce the block reduce results. In our experience, this is faster than using atomics built wtih atomicCAS. |
The size of the tuple would be 2 (although that could be generalized). The first element is the value, the second the location (stored as a single integer, probably 64 bits). The value that is to be reduced would e.g. be This would be used inside a We can use your suggestions instead, implementing CUDA code for our combined reduction. |
If |
I thought I implemented just the functionality that |
I don't think you need to implement those things. It seems that all you need is the following so that amrex's reduce function knows how to initialize it to the maximum value. It seems to work. It used cub::BlockReduce.
Here is a test https://github.com/WeiqunZhang/amrex-devtests/blob/main/minloc/main.cpp |
Thanks! I'll try this. |
Let me know if it works for you. I think it should also just work for HIP. We do have to do more coding for Intel GPUs. |
Here is draft for the changes. #2885. I also update the test to do both minloc and maxloc. |
This is superseded by #2885. |
Summary
I am looking for a GPU-enabled
MinLoc
reduction operator. This PR provides a proof-of-concept implementation. I am looking for feedback.Additional background
To implement
MinLoc
, I view the quantity that is to be reduced as a tuple of two elements: (1) the quantity that is to be reduced, and (2) and additional arbitrary payload (the location). I add respective functions that handle tuples. For example, theLess
operation compares the first element of the tuple, whereas a__shfl_down_sync
needs to shuffle both elements of the tuple.Checklist
The proposed changes: