-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integer arithmetic with overflow checking #3755
base: main
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
🟨 CI finished in 2h 27m: Pass: 92%/151 | Total: 3d 03h | Avg: 29m 59s | Max: 1h 19m | Hits: 62%/209394
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 151)
# | Runner |
---|---|
108 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
1 | linux-amd64-gpu-h100-latest-1 |
I've already implemented the saturation arithmetics in #3449, there are just some compiler issues I haven't resolved yet. However the behaviour is not equivalent, the saturation arithmetics just clamps the result in If you need the overflow flag as a result, you may checkout the implementation, there are some clever ways to optimize the behaviour on device using |
thanks, @davebayer. Indeed, I was going to ask you to take a look at this PR. I check if I can drop the current one if it is redundant with saturation arithmetic. |
In my opinion having |
let me summarize the differences:
The main open question that I have is if we want the same semantics of intrinsic. This would make the implementation more complex without a clear benefits IMO (but I could be wrong) |
I am against this. I think the user should be consistent with the types passed to int16_t fn(int16 x)
{
auto [result, overflow] = cuda::add_overflow(x, 10);
if (overflow)
{
throw std::runtime_error("Error");
}
return result;
} The user clearly wants to check against
I would follow the namespace cuda
{
template <class _Tp>
struct op_overflow_result
{
_Tp value;
bool overflow;
};
template <class _Tp>
op_overflow_result<_Tp> op_overflow(_Tp __lhs, _Tp __rhs)
{
op_overflow_result<_Tp> __ret;
__ret.overflow = __builtin_op_overflow(__lhs, __rhs, &__ret.value);
return __ret;
}
} // namespace cuda |
based on internal discussion and current CUB use cases: https://github.com/NVIDIA/cccl/blob/main/cub/cub/agent/agent_reduce.cuh#L424 and https://github.com/NVIDIA/cccl/blob/main/cub/cub/device/dispatch/dispatch_histogram.cuh#L801. The functions will only check if an operation is valid or not, without providing the result. This is not redundant with the actual computation |
/ok to test |
🟨 CI finished in 2h 48m: Pass: 93%/151 | Total: 3d 00h | Avg: 28m 47s | Max: 1h 19m | Hits: 63%/213614
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 151)
# | Runner |
---|---|
108 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
1 | linux-amd64-gpu-h100-latest-1 |
🟨 CI finished in 1h 38m: Pass: 96%/158 | Total: 3d 16h | Avg: 33m 33s | Max: 1h 37m | Hits: 64%/234380
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
# | Runner |
---|---|
111 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
5 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
🟨 CI finished in 1h 45m: Pass: 98%/158 | Total: 3d 04h | Avg: 29m 10s | Max: 1h 20m | Hits: 69%/242620
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
# | Runner |
---|---|
111 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
5 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
@davebayer It would be great to have your feedback 😊 |
🟩 CI finished in 1h 50m: Pass: 100%/158 | Total: 3d 04h | Avg: 28m 51s | Max: 1h 25m | Hits: 68%/248217
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
# | Runner |
---|---|
111 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
5 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
I've tried to implement the functionality here https://github.com/davebayer/cccl/tree/overflow_arithmetic/libcudacxx/include/cuda/__numeric We can implement the template <class _Tp>
bool is_op_overflow(_Tp __lh, _Tp __rhs)
{
return op_overflow(__lhs, __rhs).overflow;
} This approach should bring better performance, too. What do you think @fbusato? |
not sure if I'm understanding it correctly.
I will update the PR description to better reflect the intent of this functionality |
Yes, I am refering to the solution I proposed. Actually the fastest way to check if an operation overflows is to compute the result and check the overflow flags and the result. I've checked the assembly generated by the compilers and it does exactly that.
I've implemented a version fully functional in both host and device code prefering builtins and falling back a generic implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As touched on monday I prefer to not waste already available information.
That is why I would prefer the approach with computing the result and also passing a flag around that signifies whether overflow occurred.
I believe that there is effectively never a situation where we are completely uninterested in the result of an operation and just want to throw in that hypothetical case.
So throwing away the result in all common cases seems wastefull
Maybe I should have introduced better the solution. All of the functions have 2 overloads: template <class T>
constexpr bool op_overflow(T x, T y, T& result) noexcept;
template <class T>
constexpr overflow_arithmetic_result_t<T> op_overflow(T x, T y) noexcept; They can be used as: // ...
int val;
if (cuda::add_overflow(x, y, result))
{
// handle overflow
}
// use `val`
// ... and // ...
if (auto res = add_overflow(x, y))
{
// handle overflow saved in `res.overflow`
// use result saved in `res.value`
}
// ... The I've already discussed the design with @miscco and he seems to be happy with it. However, the implementation currently all of the inputs must be of the same type. If you insist on type mixing and returning common type, I can change the implementation. What are your thoughts on this, @fbusato? :) |
This is not a waste of available information. Checking the overflow could involve different operations compared to the actual computation. @davebayer I like the idea of the overloads but I would prefer to keep |
I only optimized the multiplication for device, because I did not come up with anything better than what the generic C++ implementation does. I'd like to demonstrate that there is no performance benefit from having My implementation generates the same PTX as the clang-cuda's There are the extended precision integer arithmetic instructions, but we have no way getting the The only improvements I see is that NVCC seems to have trouble using predicates, so I could use inline PTX to fix that, but it would bring more complexity to the whole thing. |
Add/Subtraction
Your idea is very nice, but I would argue the opposite. Even in the worst case for the comparison (
Multiplication:
Thoughts: I'm still convinced that checking for overflow and computing the operations are two different things:
Personally, I would like to have both versions, boolean value and with the result. Final note about the parameter types. Using different types + |
Description
Provide the following functions to check if addition, subtraction, multiplication, or division of two integrals (including 128-bit integers) overflows the maximum value or underflow the minimum value of the common type (
cuda::std::common_type_t<T, U>
).Inspired by https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html and https://clang.llvm.org/docs/LanguageExtensions.html#checked-arithmetic-builtins
Useful when/where undefined behavior sanitizer is not available (e.g. device code) and for assertions