New findmin/max implementation using single-pass reduction #576
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR replaces the two-kernel findminmax implementation to a single-pass one by @tkf, #484 (comment), as proposed by @Ellipse0934, #320 (comment). I only rebased it and cleaned-up a little.
Performance is about as @Ellipse0934 reported, i.e. massive improvements in some cases, and a small regression in others (which was much less pronounced with my GPU). As the implementation also fixes a bug, I'd say that's worth it. FWIW, I don't think the performance penalty comes from not using warp intrinsics, but due to the bloaty code generated by a tuple-heavy reduce closure. And recent/beefy GPUs like mine are often less sensitive to that.
Fixes #553, fixes #320 (comment)