-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 #1997
base: main
Are you sure you want to change the base?
Work around crashes and incorrect results in scan-based algorithms when compiling with -O0 #1997
Conversation
IGC intentionally forces a sub-group size of 16 on certain iGPUs to workaround a known issue. We have to determine this by first compiling the kernels to see if the required sub-group size is respected. Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…xception Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…erence Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
I never seen before in our code the examples of usage |
I don't have a strong preference if we choose to go with the way I implemented or just directly add try ... catch throughout this header. I initially did this to avoiding having many try catch statements here. Let me leave this open first for others' opinions to see if they prefer a more functional approach or just directly adding try...catch throughout. |
using _ScanKernel = oneapi::dpl::__par_backend_hetero::__internal::__kernel_name_generator< | ||
__reduce_then_scan_scan_kernel, _CustomName, _InRng, _OutRng, _GenScanInput, _ReduceOp, _ScanInputTransform, | ||
_WriteOp, _InitType, _Inclusive, _IsUniquePattern>; | ||
static auto __kernels = __internal::__kernel_compiler<_ReduceKernel, _ScanKernel>::__compile(__exec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to point this unique case out and ensure there are no issues with this approach. Benchmarking the single work-group sizes, I was seeing some overheads after switching from kernel provider to compiler.
Some of the overheads seem related to the kernel bundle which is unavoidable. However, __kernel_compiler
creates a std::vector
to call sycl::get_kernel_bundle
. This allocation / deallocation overhead on each call was leading to measurable slowdowns with small input sizes. To fix this, I have made the variable static
as __kernels
should be the same for each call.
This is beneficial assuming the application makes multiple calls to the scan-based algorithm which I assume is the most common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to point this unique case out and ensure there are no issues with this approach.
I am thinking about possible issues with uniqueness of the kernels in some corner cases. Is it possible to end up with different kernels with the same instantiation of __parallel_transform_reduce_then_scan
? I can imagine that, for example:
// no kernel specified: unnamed lambda case
sycl::queue queue_a{selector_vendor_a{});
sycl::queue queue_b{selector_vendor_b{});
// policy_a and policy_b have the same type
dpl::execution::device_policy policy_a(queue_a);
dpl::execution::device_policy policy_b(queue_b);
// ... containers and predicates have the same types
dpl::copy_if(policy_a, ...);
dpl::copy_if(policy_b, ...); // will it use the kernels compiled for the queue_a (and thus device "a")?
Such cases are highly unlikely, and I cannot think of any others. I would rather document this as a known limitation with a workaround (name the kernel) than compromise performance (if it is large enough overhead).
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the approach, but I need to review the PR more thoroughly. Indeed, according to the SYCL 2020 spec, launching a kernel decorated with reqd_sub_group_size
attribute may throw an exception depending on what inside that kernel...
I have a question regarding that part:
Secondly, after discussion with compiler developers, kernel compilation must be separated from execution of the kernel to prevent corruption of the underlying sycl::queue
Is it a limitation/bug of the DPC++ SYCL implementation?
// Macro to check if the exception thrown when a kernel cannot be ran on a device does not align with | ||
// sycl::errc::kernel_not_supported as required by the SYCL spec. Detects the Intel DPC++ and open-source intel/llvm | ||
// compilers. | ||
#ifdef _ONEDPL_LIBSYCL_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest limiting it to a future compiler version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limited it up to 20250200
for now with the icpx
compiler. With an open-source compiler build, I believe this will be enabled until we manually update in the future since there is no __INTEL_LLVM_COMPILER
macro there.
@@ -1099,7 +1099,7 @@ struct __write_to_id_if_else | |||
template <typename _ExecutionPolicy, typename _Range1, typename _Range2, typename _UnaryOperation, typename _InitType, | |||
typename _BinaryOperation, typename _Inclusive> | |||
auto | |||
__parallel_transform_scan(oneapi::dpl::__internal::__device_backend_tag __backend_tag, _ExecutionPolicy&& __exec, | |||
__parallel_transform_scan(oneapi::dpl::__internal::__device_backend_tag __backend_tag, const _ExecutionPolicy& __exec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the original variant with a forwarding reference, which is aligned with other patterns, unless the change is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a tricky scenario with aligning the future return type from the different potential paths when we use the forwarding reference. I will take another look and see if it can be implemented with the forwarding reference.
using _ScanKernel = oneapi::dpl::__par_backend_hetero::__internal::__kernel_name_generator< | ||
__reduce_then_scan_scan_kernel, _CustomName, _InRng, _OutRng, _GenScanInput, _ReduceOp, _ScanInputTransform, | ||
_WriteOp, _InitType, _Inclusive, _IsUniquePattern>; | ||
static auto __kernels = __internal::__kernel_compiler<_ReduceKernel, _ScanKernel>::__compile(__exec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to point this unique case out and ensure there are no issues with this approach.
I am thinking about possible issues with uniqueness of the kernels in some corner cases. Is it possible to end up with different kernels with the same instantiation of __parallel_transform_reduce_then_scan
? I can imagine that, for example:
// no kernel specified: unnamed lambda case
sycl::queue queue_a{selector_vendor_a{});
sycl::queue queue_b{selector_vendor_b{});
// policy_a and policy_b have the same type
dpl::execution::device_policy policy_a(queue_a);
dpl::execution::device_policy policy_b(queue_b);
// ... containers and predicates have the same types
dpl::copy_if(policy_a, ...);
dpl::copy_if(policy_b, ...); // will it use the kernels compiled for the queue_a (and thus device "a")?
Such cases are highly unlikely, and I cannot think of any others. I would rather document this as a known limitation with a workaround (name the kernel) than compromise performance (if it is large enough overhead).
I think it is more of a limitation with the existing DPC++ implementation. From my discussion, it seems as if there is no easy fix which is why this workaround was recommended. From their perspective, it is not a bug as the SYCL specification does not specify any behavior regarding resubmissions to a queue after an exception such as this. |
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
On certain integrated graphics architectures, sub-group sizes of 32 are not supported for kernels with certain properties when compiled with -O0 using the
icpx
compiler. The compiler is normally able to workaround this issue by compiling to a sub-group size of 16 instead. However, in cases in which an explicit sub-group size is required, then the compiler throws an exception at JIT time. This issue directly affects our reduce-then-scan implementation which has a required sub-group size of 32.To properly work around this issue, several things must be done. Firstly, exception handling is implemented to catch this synchronous exception while re-throwing any other exceptions back to the user. Secondly, after discussion with compiler developers, kernel compilation must be separated from execution of the kernel to prevent corruption of the underlying
sycl::queue
that occurs when this exception is thrown after implicit buffer accessor dependencies around the kernel have been established. To do this, kernel bundles are used to first compile the kernel before executing.