Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 #1997

mmichel11 · 2025-01-10T17:22:50Z

On certain integrated graphics architectures, sub-group sizes of 32 are not supported for kernels with certain properties when compiled with -O0 using the icpx compiler. The compiler is normally able to workaround this issue by compiling to a sub-group size of 16 instead. However, in cases in which an explicit sub-group size is required, then the compiler throws an exception at JIT time. This issue directly affects our reduce-then-scan implementation which has a required sub-group size of 32.

To properly work around this issue, several things must be done. Firstly, exception handling is implemented to catch this synchronous exception while re-throwing any other exceptions back to the user. Secondly, after discussion with compiler developers, kernel compilation must be separated from execution of the kernel to prevent corruption of the underlying sycl::queue that occurs when this exception is thrown after implicit buffer accessor dependencies around the kernel have been established. To do this, kernel bundles are used to first compile the kernel before executing.

IGC intentionally forces a sub-group size of 16 on certain iGPUs to workaround a known issue. We have to determine this by first compiling the kernels to see if the required sub-group size is respected. Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…xception Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…erence Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

SergeyKopienko · 2025-01-14T13:20:05Z

I never seen before in our code the examples of usage std::optional.
From my point of view we may simple write try / catch instead of __handle_sync_sycl_exception and don't use std::optional at all.

mmichel11 · 2025-01-14T14:40:01Z

I never seen before in our code the examples of usage std::optional'. From my point of view we may simple write try / catchinstead of__handle_sync_sycl_exceptionand don't usestd::optional' at all.

I don't have a strong preference if we choose to go with the way I implemented or just directly add try ... catch throughout this header. I initially did this to avoiding having many try catch statements here.

Let me leave this open first for others' opinions to see if they prefer a more functional approach or just directly adding try...catch throughout.

mmichel11 · 2025-01-15T21:51:24Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h

+    using _ScanKernel = oneapi::dpl::__par_backend_hetero::__internal::__kernel_name_generator<
+        __reduce_then_scan_scan_kernel, _CustomName, _InRng, _OutRng, _GenScanInput, _ReduceOp, _ScanInputTransform,
+        _WriteOp, _InitType, _Inclusive, _IsUniquePattern>;
+    static auto __kernels = __internal::__kernel_compiler<_ReduceKernel, _ScanKernel>::__compile(__exec);


I wanted to point this unique case out and ensure there are no issues with this approach. Benchmarking the single work-group sizes, I was seeing some overheads after switching from kernel provider to compiler.

Some of the overheads seem related to the kernel bundle which is unavoidable. However, __kernel_compiler creates a std::vector to call sycl::get_kernel_bundle. This allocation / deallocation overhead on each call was leading to measurable slowdowns with small input sizes. To fix this, I have made the variable static as __kernels should be the same for each call.

This is beneficial assuming the application makes multiple calls to the scan-based algorithm which I assume is the most common case.

I wanted to point this unique case out and ensure there are no issues with this approach.

I am thinking about possible issues with uniqueness of the kernels in some corner cases. Is it possible to end up with different kernels with the same instantiation of __parallel_transform_reduce_then_scan? I can imagine that, for example:

// no kernel specified: unnamed lambda case sycl::queue queue_a{selector_vendor_a{}); sycl::queue queue_b{selector_vendor_b{}); // policy_a and policy_b have the same type dpl::execution::device_policy policy_a(queue_a); dpl::execution::device_policy policy_b(queue_b); // ... containers and predicates have the same types dpl::copy_if(policy_a, ...); dpl::copy_if(policy_b, ...); // will it use the kernels compiled for the queue_a (and thus device "a")?

Such cases are highly unlikely, and I cannot think of any others. I would rather document this as a known limitation with a workaround (name the kernel) than compromise performance (if it is large enough overhead).

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

dmitriy-sobolev

I agree with the approach, but I need to review the PR more thoroughly. Indeed, according to the SYCL 2020 spec, launching a kernel decorated with reqd_sub_group_size attribute may throw an exception depending on what inside that kernel...

I have a question regarding that part:

Secondly, after discussion with compiler developers, kernel compilation must be separated from execution of the kernel to prevent corruption of the underlying sycl::queue

Is it a limitation/bug of the DPC++ SYCL implementation?

dmitriy-sobolev · 2025-01-24T13:12:00Z

include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h

+// Macro to check if the exception thrown when a kernel cannot be ran on a device does not align with
+// sycl::errc::kernel_not_supported as required by the SYCL spec. Detects the Intel DPC++ and open-source intel/llvm
+// compilers.
+#ifdef _ONEDPL_LIBSYCL_VERSION


I would suggest limiting it to a future compiler version.

Limited it up to 20250200 for now with the icpx compiler. With an open-source compiler build, I believe this will be enabled until we manually update in the future since there is no __INTEL_LLVM_COMPILER macro there.

dmitriy-sobolev · 2025-01-24T19:33:15Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

@@ -1099,7 +1099,7 @@ struct __write_to_id_if_else
 template <typename _ExecutionPolicy, typename _Range1, typename _Range2, typename _UnaryOperation, typename _InitType,
          typename _BinaryOperation, typename _Inclusive>
 auto
-__parallel_transform_scan(oneapi::dpl::__internal::__device_backend_tag __backend_tag, _ExecutionPolicy&& __exec,
+__parallel_transform_scan(oneapi::dpl::__internal::__device_backend_tag __backend_tag, const _ExecutionPolicy& __exec,


Let's keep the original variant with a forwarding reference, which is aligned with other patterns, unless the change is necessary.

There's a tricky scenario with aligning the future return type from the different potential paths when we use the forwarding reference. I will take another look and see if it can be implemented with the forwarding reference.

dmitriy-sobolev · 2025-01-24T20:38:16Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h

+    using _ScanKernel = oneapi::dpl::__par_backend_hetero::__internal::__kernel_name_generator<
+        __reduce_then_scan_scan_kernel, _CustomName, _InRng, _OutRng, _GenScanInput, _ReduceOp, _ScanInputTransform,
+        _WriteOp, _InitType, _Inclusive, _IsUniquePattern>;
+    static auto __kernels = __internal::__kernel_compiler<_ReduceKernel, _ScanKernel>::__compile(__exec);


I wanted to point this unique case out and ensure there are no issues with this approach.

I am thinking about possible issues with uniqueness of the kernels in some corner cases. Is it possible to end up with different kernels with the same instantiation of __parallel_transform_reduce_then_scan? I can imagine that, for example:

// no kernel specified: unnamed lambda case sycl::queue queue_a{selector_vendor_a{}); sycl::queue queue_b{selector_vendor_b{}); // policy_a and policy_b have the same type dpl::execution::device_policy policy_a(queue_a); dpl::execution::device_policy policy_b(queue_b); // ... containers and predicates have the same types dpl::copy_if(policy_a, ...); dpl::copy_if(policy_b, ...); // will it use the kernels compiled for the queue_a (and thus device "a")?

Such cases are highly unlikely, and I cannot think of any others. I would rather document this as a known limitation with a workaround (name the kernel) than compromise performance (if it is large enough overhead).

mmichel11 · 2025-01-24T22:35:52Z

I agree with the approach, but I need to review the PR more thoroughly. Indeed, according to the SYCL 2020 spec, launching a kernel decorated with reqd_sub_group_size attribute may throw an exception depending on what inside that kernel...

I have a question regarding that part:

Secondly, after discussion with compiler developers, kernel compilation must be separated from execution of the kernel to prevent corruption of the underlying sycl::queue

Is it a limitation/bug of the DPC++ SYCL implementation?

I think it is more of a limitation with the existing DPC++ implementation. From my discussion, it seems as if there is no easy fix which is why this workaround was recommended. From their perspective, it is not a bug as the SYCL specification does not specify any behavior regarding resubmissions to a queue after an exception such as this.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 added 10 commits January 7, 2025 16:24

Rough draft of workaround with new driver behavior

2b326cf

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Move kernel naming within __parallel_transform_reduce_then_scan

4339805

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add fallback for kernel compilation if bundles are not present

4f9ce16

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Revert unnecessary clang-format change

0467ed6

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Adjust sync exception handler and properly catch unsupported kernel e…

6ed1f82

…xception Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Adjust lambda capture clauses to only capture forwarded fields by ref…

c894969

…erence Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove unnecessary namespace fields

d8649f0

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Update comment and add missing preprocessor guard

a4240a2

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 added this to the 2022.8.0 milestone Jan 10, 2025

mmichel11 added 2 commits January 10, 2025 12:02

Make __kernels static so vector alloc / free occurs once on first call

2f334ff

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

clang format and small updates

1a5deb4

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 changed the title ~~[Draft] Workaround crashes and incorrect results in scan-based algorithms with integrated graphics when compiling with -O0~~ [Draft] Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 Jan 10, 2025

mmichel11 requested review from danhoeflinger and adamfidel January 10, 2025 21:10

mmichel11 added 6 commits January 10, 2025 15:15

Update use of throw

e1f9a38

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Revert change to set tests

2ac5e22

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add missing functor to kernel name generator template list

8380699

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Fix double forwarding issue that caused segfaults

e80ff5f

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove perfect forwarding to work differing return types due to ref

9354fd8

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove missed std::forward

8469ea3

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 changed the title ~~[Draft] Work around crashes and incorrect results in scan-based algorithms when compiling with -O0~~ Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 Jan 14, 2025

mmichel11 marked this pull request as ready for review January 14, 2025 14:29

mmichel11 requested review from timmiesmith, akukanov, MikeDvorskiy, dmitriy-sobolev and SergeyKopienko January 14, 2025 14:29

mmichel11 commented Jan 15, 2025

View reviewed changes

Add a broken macro for the generic error code issue

dd016a9

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

dmitriy-sobolev reviewed Jan 24, 2025

View reviewed changes

mmichel11 added 2 commits January 26, 2025 16:45

Limit broken macro to compilers before 20250200

95a62e9

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Fix typo

3454aa0

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 #1997

Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 #1997

mmichel11 commented Jan 10, 2025 •

edited

Loading

SergeyKopienko commented Jan 14, 2025 •

edited

Loading

mmichel11 commented Jan 14, 2025

mmichel11 Jan 15, 2025 •

edited

Loading

dmitriy-sobolev Jan 24, 2025 •

edited

Loading

dmitriy-sobolev left a comment

dmitriy-sobolev Jan 24, 2025 •

edited

Loading

mmichel11 Jan 26, 2025 •

edited

Loading

dmitriy-sobolev Jan 24, 2025 •

edited

Loading

mmichel11 Jan 24, 2025

dmitriy-sobolev Jan 24, 2025 •

edited

Loading

mmichel11 commented Jan 24, 2025 •

edited

Loading

Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 #1997

Are you sure you want to change the base?

Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 #1997

Conversation

mmichel11 commented Jan 10, 2025 • edited Loading

SergeyKopienko commented Jan 14, 2025 • edited Loading

mmichel11 commented Jan 14, 2025

mmichel11 Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

dmitriy-sobolev Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

dmitriy-sobolev left a comment

Choose a reason for hiding this comment

dmitriy-sobolev Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

mmichel11 Jan 26, 2025 • edited Loading

Choose a reason for hiding this comment

dmitriy-sobolev Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

mmichel11 Jan 24, 2025

Choose a reason for hiding this comment

dmitriy-sobolev Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

mmichel11 commented Jan 24, 2025 • edited Loading

mmichel11 commented Jan 10, 2025 •

edited

Loading

SergeyKopienko commented Jan 14, 2025 •

edited

Loading

mmichel11 Jan 15, 2025 •

edited

Loading

dmitriy-sobolev Jan 24, 2025 •

edited

Loading

dmitriy-sobolev Jan 24, 2025 •

edited

Loading

mmichel11 Jan 26, 2025 •

edited

Loading

dmitriy-sobolev Jan 24, 2025 •

edited

Loading

dmitriy-sobolev Jan 24, 2025 •

edited

Loading

mmichel11 commented Jan 24, 2025 •

edited

Loading