Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always use size 16 sub-groups in single work-group radix sort if supported #1833

Merged
merged 4 commits into from
Sep 13, 2024

Conversation

mmichel11
Copy link
Contributor

Previously, an IGC workaround has caused kernels compiled with [[sycl::reqd_sub_group_size(32)]] to be silently compiled with size 16 sub-groups with -O0 on certain devices. To comply with the SYCL spec, this case will now throw a synchronous exception at JIT time.

Single work-group radix sort uses a sub-group sizes of 32 for the smallest inputs and 16 for larger inputs to minimize register pressure. I have experimented with the current mainline version, a version only using sub-group size of 16, and a version removing the requirement altogether. The first two versions performed similarly with no measurable difference, and explicitly requiring a sub-group size of 16 offered up to a ~15% benefit over the version with no requirement at all.

By using only sub-group sizes of 16 in single work-group radix sort, we are able to avoid the IGC issue while not impacting performance.

This change is added to avoid a bug where IGC cannot compile SIMD32
kernels with -O0 compilation flags. No performance impact is observed.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…ely"

This reverts commit 7572f84. After
experimentation with cold cache benchmarks, benefit of up to ~15% was
observed requiring size 16 sub-groups for the larger single work-group
cases.

IGC compiles the kernels with sub-group size 32 here so leaving the
requirement is needed to maximize performance.
@danhoeflinger
Copy link
Contributor

There is a comment on line 819 of parallel_backend_sycl_radix_sort.h which needs to be updated with this PR.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'd like @MikeDvorskiy to also be able to look at this.

@mmichel11
Copy link
Contributor Author

There is a comment on line 819 of parallel_backend_sycl_radix_sort.h which needs to be updated with this PR.

I made a few tweaks to the comment to match the new behavior.

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind regarding the changes, if it doesn't make the performance worse. (As written in PR description - it doesn't)

@mmichel11 mmichel11 merged commit 87d85d5 into main Sep 13, 2024
22 checks passed
@mmichel11 mmichel11 deleted the dev/mmichel11/one_wg_radix_sort_simd16 branch September 13, 2024 15:44
mmichel11 added a commit that referenced this pull request Sep 25, 2024
…orted (#1833)

* Only use sub-group sizes of 16 in one-wg radix sort

This change is added to avoid a bug where IGC cannot compile SIMD32
kernels with -O0 compilation flags. No performance impact is observed.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

---------

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
timmiesmith pushed a commit that referenced this pull request Nov 21, 2024
…orted (#1833)

* Only use sub-group sizes of 16 in one-wg radix sort

This change is added to avoid a bug where IGC cannot compile SIMD32
kernels with -O0 compilation flags. No performance impact is observed.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

---------

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants