fix softmax, math unary/binary kernel int overflow #8472

chengtbf · 2022-06-23T07:31:48Z

修复 softmax， math unary / binary cuda kernel 可能遇到的 int32 溢出的问题。

对于 CUDA_1D_KERNEL_LOOP 中会对 i 进行整数除法的 kernel，则通过 if 判断进入不同的分支，防止在大部分情况 int32 可以工作的时候影响性能。

guo-ran · 2022-06-23T07:34:24Z

oneflow/user/kernels/sparse_cross_entropy_kernel_util.cu

-      dx[i] = __hmul(dy[row_id], __hsub(prob[i], __float2half(1.0)));
-    } else {
-      dx[i] = __hmul(dy[row_id], prob[i]);
+  if (elem_cnt >= 2147483647) {


我觉得可以把这个判断放外面，加个模板参数IDX，在外面DispatchIndexType比较好。

if (elem_cnt < GetMaxVal<int32_t>() / 2) {

}

我觉得可以把这个判断放外面，加个模板参数IDX，在外面DispatchIndexType比较好。

外面是这样？

if (elem_cnt < GetMaxVal<int32_t>() / 2) { ComputeEntropyGpuHalf<K, int32_t><<<BlocksNum4ThreadsNum(num_instances), kCudaThreadsNumPerBlock, 0, stream->As<ep::CudaStream>()->cuda_stream()>>>( num_instances, num_classes, depth, lower_bound, reinterpret_cast<const half*>(x), labels, reinterpret_cast<half*>(y)); } else { ComputeEntropyGpuHalf<K, int64_t><<<BlocksNum4ThreadsNum(num_instances), kCudaThreadsNumPerBlock, 0, stream->As<ep::CudaStream>()->cuda_stream()>>>( num_instances, num_classes, depth, lower_bound, reinterpret_cast<const half*>(x), labels, reinterpret_cast<half*>(y)); }

嗯，这样可以的

if (elem_cnt < GetMaxVal<int32_t>() / 2) {

这里为什么要除以 2 呢？ @guo-ran

Guo Ran
5:08 PM
这个是因为线程循环不是每次加1，而是加线程数，为了避免越界，我们系统里面好像统一都直接除2处理了。

这里改了

…dev_cc_int_overflow

github-actions · 2022-06-24T12:14:55Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8472/

github-actions · 2022-06-24T12:19:38Z

CI failed when running job: cuda-misc. PR label automerge has been removed

github-actions · 2022-06-24T12:19:55Z

Speed stats:

github-actions · 2022-06-24T15:49:52Z

Speed stats:

GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.7ms (= 12972.7ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 149.4ms (= 14939.7ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.15 (= 149.4ms / 129.7ms)

OneFlow resnet50 time: 75.8ms (= 7582.8ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.2ms (= 8522.9ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 85.2ms / 75.8ms)

OneFlow resnet50 time: 50.1ms (= 10028.3ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 65.6ms (= 13121.1ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.31 (= 65.6ms / 50.1ms)

OneFlow resnet50 time: 41.1ms (= 8228.9ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.2ms (= 8834.9ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.07 (= 44.2ms / 41.1ms)

OneFlow resnet50 time: 35.6ms (= 7127.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 37.9ms (= 7583.9ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.06 (= 37.9ms / 35.6ms)

OneFlow swin dataloader time: 0.406s (= 81.251s / 200, num_workers=1)
PyTorch swin dataloader time: 0.149s (= 29.871s / 200, num_workers=1)
Relative speed: 0.368 (= 0.149s / 0.406s)

OneFlow swin dataloader time: 0.128s (= 25.585s / 200, num_workers=4)
PyTorch swin dataloader time: 0.044s (= 8.717s / 200, num_workers=4)
Relative speed: 0.341 (= 0.044s / 0.128s)

OneFlow swin dataloader time: 0.063s (= 12.619s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.386s / 200, num_workers=8)
Relative speed: 0.348 (= 0.022s / 0.063s)

❌ OneFlow resnet50 time: 145.6ms (= 14561.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 169.0ms (= 16898.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 169.0ms / 145.6ms)

OneFlow resnet50 time: 94.1ms (= 9411.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 112.7ms (= 11272.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 112.7ms / 94.1ms)

OneFlow resnet50 time: 69.6ms (= 13916.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.3ms (= 17655.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.27 (= 88.3ms / 69.6ms)

OneFlow resnet50 time: 57.0ms (= 11402.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.3ms (= 16058.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.41 (= 80.3ms / 57.0ms)

OneFlow resnet50 time: 52.2ms (= 10435.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.4ms (= 15288.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.47 (= 76.4ms / 52.2ms)

github-actions · 2022-06-24T15:52:41Z

CI failed when running job: cpu-module. PR label automerge has been removed

github-actions · 2022-06-24T16:34:26Z

Speed stats:

github-actions · 2022-06-24T16:42:28Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8472/

fix softmax, math unary/binary kernel int overflow

7c308e1

chengtbf added automerge bug op labels Jun 23, 2022

chengtbf requested review from strint and leaves-zwx June 23, 2022 07:31

chengtbf requested review from liujuncheng, guo-ran and MARD1NO as code owners June 23, 2022 07:31

guo-ran reviewed Jun 23, 2022

View reviewed changes

chengtbf removed the automerge label Jun 23, 2022

chengtbf added 2 commits June 24, 2022 11:37

Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …

eadc1b9

…dev_cc_int_overflow

using template IndexType for handle int32_t index in cuda kernel

ca9e3a5

liujuncheng approved these changes Jun 24, 2022

View reviewed changes

MARD1NO approved these changes Jun 24, 2022

View reviewed changes

chengtbf added the automerge label Jun 24, 2022

Merge branch 'master' into dev_cc_int_overflow

98f310c

chengtbf requested a review from oneflow-ci-bot June 24, 2022 10:14

Merge branch 'master' into dev_cc_int_overflow

98cdc72

github-actions bot removed the automerge label Jun 24, 2022

chengtbf added the automerge label Jun 24, 2022

chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot June 24, 2022 15:30

github-actions bot removed the automerge label Jun 24, 2022

chengtbf added the automerge label Jun 24, 2022

chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot June 24, 2022 16:28

chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot June 24, 2022 16:36

mergify bot merged commit 3ea445a into master Jun 24, 2022

mergify bot deleted the dev_cc_int_overflow branch June 24, 2022 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix softmax, math unary/binary kernel int overflow #8472

fix softmax, math unary/binary kernel int overflow #8472

chengtbf commented Jun 23, 2022

guo-ran Jun 23, 2022

guo-ran Jun 23, 2022

chengtbf Jun 23, 2022 •

edited

Loading

guo-ran Jun 23, 2022

chengtbf Jun 24, 2022

chengtbf Jun 24, 2022

chengtbf Jun 24, 2022

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

fix softmax, math unary/binary kernel int overflow #8472

fix softmax, math unary/binary kernel int overflow #8472

Conversation

chengtbf commented Jun 23, 2022

guo-ran Jun 23, 2022

Choose a reason for hiding this comment

guo-ran Jun 23, 2022

Choose a reason for hiding this comment

chengtbf Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

guo-ran Jun 23, 2022

Choose a reason for hiding this comment

chengtbf Jun 24, 2022

Choose a reason for hiding this comment

chengtbf Jun 24, 2022

Choose a reason for hiding this comment

chengtbf Jun 24, 2022

Choose a reason for hiding this comment

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

github-actions bot commented Jun 24, 2022

chengtbf Jun 23, 2022 •

edited

Loading