-
-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient cuda kernels for reductions #382
Conversation
I've now finished the min_to and max_to kernels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for this contribution! Just have some questions to make sure I'm following 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome changes, thanks for contribution!
So far, implements a more efficient sum_to kernel that will have a maximum write contention of the number of blocks (groups of 1024 threads) running concurrently. Operations within each block scale with
log2(min(chunk_size, block_size))
. Resolves #332, and will depend on #380 for @ViliamVadocz 's fix to atomicMaxf and atomicMinf.