Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update compile-time shared memory usage check for device_partition #543

Merged
merged 2 commits into from
Apr 4, 2024

Conversation

umfranzw
Copy link
Collaborator

The device partition algorithm uses a default_select_config struct to detect which architecture we're running on.

The default_select_config struct eventually causes the creation of a struct of type limit_block_size. This struct is used to detect if the launch configuration that's being used (block size and amount of shared memory per thread) will cause on the selected device to use more than 32 KiB of shared memory. If so, then limit_block_size attempts to reduce the block size (divides it by 2) and checks the shared memory usage again.

If the element type is large enough, it is possible to get into a situation where, even if we use the minimum block size (a single wavefront of threads) and give the threads the minimum possible number of elements to work on (1 each), we will still use more than 32 KiB of shared memory.

The limit_block_size struct assumes that the amount of shared memory that will be used is equal to the block size multiplied by the amount of memory required per thread. However, the device partition algorithm actually requires slightly more shared memory than this, because it does an extra allocation to store the lookback scan's state.

It's not really feasible to move this lookback scan state out of shared memory because all threads in the block need access to it.

This change modifies the limit_block_size struct so that it accepts an "ExtraSharedMemory" template parameter, and updates the shared memory check it performs so that it takes this value into account.

It also updates the device partition's config-creating code so that it passes in the size of the lookback scan state.

The device partition algorithm uses a default_select_config struct
to detect which architecture we're running on.

The default_select_config struct eventually causes the creation of
a struct of type limit_block_size. This struct is used to detect if
the launch configuration that's being used (block size and amount of
shared memory per thread) will cause on the selected device to use
more than 32 KiB of shared memory. If so, then limit_block_size
attempts to reduce the block size (divides it by 2) and checks the
shared memory usage again.

If the element type is large enough, it is possible to get into
a situation where, even if we use the minimum block size (a single
wavefront of threads) and give the threads the minimum possible
number of elements to work on (1 each), we will still use more
than 32 KiB of shared memory.

The limit_block_size struct assumes that the amount of shared memory
that will be used is equal to the block size multiplied by the amount
of memory required per thread. However, the device partition algorithm
actually requires slightly more shared memory than this, because it
does an extra allocation to store the lookback scan's state.

It's not really feasible to move this lookback scan state out of
shared memory because all threads in the block need access to it.

This change modifies the limit_block_size struct so that it accepts
an "ExtraSharedMemory" template parameter, and updates the shared
memory check it performs so that it takes this value into account.

It also updates the device partition's config-creating code so that
it passes in the size of the lookback scan state.
…y limit

Test the edge case where the data passed to the device partition
algorithm will consume the maximum allowable amount of shared memory.
Since the algorithm itself also requires some shared memory to store
state, this should push us over the max limit. In this case,
the block size should be reduced to compensate.
@Naraenda Naraenda requested a review from nolmoonen April 3, 2024 12:40
Copy link
Collaborator

@nolmoonen nolmoonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no comments about the changes, they look good to me. I do want to note that the configuration obtained by passing rocprim::default_config (using default_select_config in this case), should be viewed mostly as a heuristic. It is unpractical to design a default config that both compiles (does not use more than allowed shared memory) for types with arbitrary size and also gives good performance.

Instead, for arbitrarily large types, the suggested route is to pass a custom configuration, for example

using config = select_config<
	128,
	1,
	::rocprim::block_load_method::block_load_transpose,
	::rocprim::block_load_method::block_load_transpose,
	::rocprim::block_load_method::block_load_transpose,
	::rocprim::block_scan_algorithm::using_warp_scan>;

which can be experimented with to reach desired performance. (Configuration for common types are solved by autotuning, which is soon to be added for partition.)

@Naraenda
Copy link
Member

Naraenda commented Apr 4, 2024

NTA

@umfranzw umfranzw merged commit 609ae19 into ROCm:develop Apr 4, 2024
8 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants