This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Select algorithms are based on the decoupled look back approach. Therefore, subsequent thread blocks are guaranteed to write data strictly after input data was read. Unlike partition family, select writes data only from one side of the array. It's safe to have
in
iterator equal to theout
one. The only issue isLOAD_LDG
that is used by default.LOAD_LDG
replacement withLOAD_CA
leads to 50% slowdown on Kepler and about 30% slowdown on Maxwell. To avoid performance regression on these architectures I've forbidden in-place execution and leftLOAD_LDG
. Since it'd be unfortunate to loose in-place option, I introduced in-place overload that takes exactly one argument. This also addresses the following issue.The unique subset of algorithms reads data outside of thread block tile. This leads to data races. It's possible to introduce in-place version but it'd require more work (caching pre-tile data in temporary storage).