Update compile-time shared memory usage check for device_partition #543

umfranzw · 2024-03-26T16:09:40Z

The device partition algorithm uses a default_select_config struct to detect which architecture we're running on.

The default_select_config struct eventually causes the creation of a struct of type limit_block_size. This struct is used to detect if the launch configuration that's being used (block size and amount of shared memory per thread) will cause on the selected device to use more than 32 KiB of shared memory. If so, then limit_block_size attempts to reduce the block size (divides it by 2) and checks the shared memory usage again.

If the element type is large enough, it is possible to get into a situation where, even if we use the minimum block size (a single wavefront of threads) and give the threads the minimum possible number of elements to work on (1 each), we will still use more than 32 KiB of shared memory.

The limit_block_size struct assumes that the amount of shared memory that will be used is equal to the block size multiplied by the amount of memory required per thread. However, the device partition algorithm actually requires slightly more shared memory than this, because it does an extra allocation to store the lookback scan's state.

It's not really feasible to move this lookback scan state out of shared memory because all threads in the block need access to it.

This change modifies the limit_block_size struct so that it accepts an "ExtraSharedMemory" template parameter, and updates the shared memory check it performs so that it takes this value into account.

It also updates the device partition's config-creating code so that it passes in the size of the lookback scan state.

The device partition algorithm uses a default_select_config struct to detect which architecture we're running on. The default_select_config struct eventually causes the creation of a struct of type limit_block_size. This struct is used to detect if the launch configuration that's being used (block size and amount of shared memory per thread) will cause on the selected device to use more than 32 KiB of shared memory. If so, then limit_block_size attempts to reduce the block size (divides it by 2) and checks the shared memory usage again. If the element type is large enough, it is possible to get into a situation where, even if we use the minimum block size (a single wavefront of threads) and give the threads the minimum possible number of elements to work on (1 each), we will still use more than 32 KiB of shared memory. The limit_block_size struct assumes that the amount of shared memory that will be used is equal to the block size multiplied by the amount of memory required per thread. However, the device partition algorithm actually requires slightly more shared memory than this, because it does an extra allocation to store the lookback scan's state. It's not really feasible to move this lookback scan state out of shared memory because all threads in the block need access to it. This change modifies the limit_block_size struct so that it accepts an "ExtraSharedMemory" template parameter, and updates the shared memory check it performs so that it takes this value into account. It also updates the device partition's config-creating code so that it passes in the size of the lookback scan state.

…y limit Test the edge case where the data passed to the device partition algorithm will consume the maximum allowable amount of shared memory. Since the algorithm itself also requires some shared memory to store state, this should push us over the max limit. In this case, the block size should be reduced to compensate.

nolmoonen

I have no comments about the changes, they look good to me. I do want to note that the configuration obtained by passing rocprim::default_config (using default_select_config in this case), should be viewed mostly as a heuristic. It is unpractical to design a default config that both compiles (does not use more than allowed shared memory) for types with arbitrary size and also gives good performance.

Instead, for arbitrarily large types, the suggested route is to pass a custom configuration, for example

using config = select_config<
	128,
	1,
	::rocprim::block_load_method::block_load_transpose,
	::rocprim::block_load_method::block_load_transpose,
	::rocprim::block_load_method::block_load_transpose,
	::rocprim::block_scan_algorithm::using_warp_scan>;

which can be experimented with to reach desired performance. (Configuration for common types are solved by autotuning, which is soon to be added for partition.)

Naraenda · 2024-04-04T11:15:15Z

NTA

umfranzw requested review from stanleytsang-amd, RobsonRLemos and lawruble13 as code owners March 26, 2024 16:09

umfranzw requested a review from Naraenda March 28, 2024 13:24

Naraenda requested a review from nolmoonen April 3, 2024 12:40

nolmoonen approved these changes Apr 3, 2024

View reviewed changes

Naraenda approved these changes Apr 4, 2024

View reviewed changes

umfranzw merged commit 609ae19 into ROCm:develop Apr 4, 2024
8 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update compile-time shared memory usage check for device_partition #543

Update compile-time shared memory usage check for device_partition #543

umfranzw commented Mar 26, 2024

nolmoonen left a comment

Naraenda commented Apr 4, 2024

Update compile-time shared memory usage check for device_partition #543

Update compile-time shared memory usage check for device_partition #543

Conversation

umfranzw commented Mar 26, 2024

nolmoonen left a comment

Choose a reason for hiding this comment

Naraenda commented Apr 4, 2024