-
Notifications
You must be signed in to change notification settings - Fork 71
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update compile-time shared memory usage check for device_partition (#543
) * Update compile-time shared memory usage check for device_partition The device partition algorithm uses a default_select_config struct to detect which architecture we're running on. The default_select_config struct eventually causes the creation of a struct of type limit_block_size. This struct is used to detect if the launch configuration that's being used (block size and amount of shared memory per thread) will cause on the selected device to use more than 32 KiB of shared memory. If so, then limit_block_size attempts to reduce the block size (divides it by 2) and checks the shared memory usage again. If the element type is large enough, it is possible to get into a situation where, even if we use the minimum block size (a single wavefront of threads) and give the threads the minimum possible number of elements to work on (1 each), we will still use more than 32 KiB of shared memory. The limit_block_size struct assumes that the amount of shared memory that will be used is equal to the block size multiplied by the amount of memory required per thread. However, the device partition algorithm actually requires slightly more shared memory than this, because it does an extra allocation to store the lookback scan's state. It's not really feasible to move this lookback scan state out of shared memory because all threads in the block need access to it. This change modifies the limit_block_size struct so that it accepts an "ExtraSharedMemory" template parameter, and updates the shared memory check it performs so that it takes this value into account. It also updates the device partition's config-creating code so that it passes in the size of the lookback scan state. * Add device partition unit test to check behaviour around shared memory limit Test the edge case where the data passed to the device partition algorithm will consume the maximum allowable amount of shared memory. Since the algorithm itself also requires some shared memory to store state, this should push us over the max limit. In this case, the block size should be reduced to compensate. --------- Co-authored-by: root <root@ixt-hq-106.rocm.amd.com>
- Loading branch information
Showing
4 changed files
with
187 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters