Update compile-time shared memory usage check for device_partition #543
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The device partition algorithm uses a default_select_config struct to detect which architecture we're running on.
The default_select_config struct eventually causes the creation of a struct of type limit_block_size. This struct is used to detect if the launch configuration that's being used (block size and amount of shared memory per thread) will cause on the selected device to use more than 32 KiB of shared memory. If so, then limit_block_size attempts to reduce the block size (divides it by 2) and checks the shared memory usage again.
If the element type is large enough, it is possible to get into a situation where, even if we use the minimum block size (a single wavefront of threads) and give the threads the minimum possible number of elements to work on (1 each), we will still use more than 32 KiB of shared memory.
The limit_block_size struct assumes that the amount of shared memory that will be used is equal to the block size multiplied by the amount of memory required per thread. However, the device partition algorithm actually requires slightly more shared memory than this, because it does an extra allocation to store the lookback scan's state.
It's not really feasible to move this lookback scan state out of shared memory because all threads in the block need access to it.
This change modifies the limit_block_size struct so that it accepts an "ExtraSharedMemory" template parameter, and updates the shared memory check it performs so that it takes this value into account.
It also updates the device partition's config-creating code so that it passes in the size of the lookback scan state.