Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gather kernel check #7979

Merged
merged 37 commits into from
Apr 17, 2022
Merged

Fix gather kernel check #7979

merged 37 commits into from
Apr 17, 2022

Conversation

BBuf
Copy link
Contributor

@BBuf BBuf commented Apr 7, 2022

给gather op加了index check,使它的值限制在指定维度的长度中,对齐报错信息。

oneflow/user/kernels/dim_gather_kernel_util.h Outdated Show resolved Hide resolved
XPU_1D_KERNEL_LOOP(index_offset, elem_cnt) {
IDX_T coordinate[kDimGatherMaxDimCount] = {0};
const IDX_T x = index[index_offset];
#ifdef WITH_CUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

问了下juncheng,这里应该是 __CUDA_ARCH__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的区别是什么?

XPU_1D_KERNEL_LOOP(index_offset, elem_cnt) {
IDX_T coordinate[kDimGatherMaxDimCount] = {0};
const IDX_T x = index[index_offset];
#ifdef __CUDA_ARCH__
assert(x < dim_length && "gather index is out of bounds");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(x < dim_length && "gather index is out of bounds");
assert(x < dim_length); // gather index is out of bounds

是不是用注释就可以了,这个字符串的作业也只是注释作用?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好像不太一样,如果触发了这个检查会直接在命令行上抛出这个错误。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不会吧,assert应该是没有打印的能力的。

这里的字符串只是一个表达式,表达式的值是一个有效的指针地址(总不为0)。所以assert到底值是多少,完全取决于dim_length,所以这个字符串我觉得就只是注释用了。

另外我搜了下代码里,.cu 文件里的assert都没这样加字符串的

@@ -24,30 +24,30 @@ namespace user_op {
template<typename IN_T, typename IDX_T>
__global__ void DoCUDADimGather(const DimOpIndexNdHelper<IDX_T> input_nd_helper,
const DimOpIndexNdHelper<IDX_T> index_nd_helper, int ndim,
int64_t elem_cnt, int32_t dim, const IDX_T* index,
int64_t elem_cnt, int64_t dim_length, int32_t dim, const IDX_T* index,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int64_t elem_cnt, int64_t dim_length, int32_t dim, const IDX_T* index,
int64_t elem_cnt, int64_t dim_length, int32_t dim, const IDX_T* index,

int64_t dim_length 如果用 int32_t 也够的话,最好用 int32_t ,给 global kernel 的传参瘦身是很重要的。
不过我不确定 int32_t 是否够用,你更能准确判断吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,我改回int32_t

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7979/

@BBuf BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot April 13, 2022 07:14
@BBuf BBuf added the automerge label Apr 13, 2022
@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7979/

@github-actions
Copy link
Contributor

CI failed when running job: cuda-module. PR label automerge has been removed

@github-actions
Copy link
Contributor

CI failed when running job: cuda-benchmark. PR label automerge has been removed

@BBuf BBuf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot April 14, 2022 01:57
@BBuf BBuf added the automerge label Apr 16, 2022
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.7ms (= 12871.2ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 140.0ms (= 13997.9ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.09 (= 140.0ms / 128.7ms)

OneFlow resnet50 time: 80.4ms (= 8039.0ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.9ms (= 8391.3ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.04 (= 83.9ms / 80.4ms)

OneFlow resnet50 time: 52.6ms (= 10516.8ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 55.5ms (= 11091.3ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.05 (= 55.5ms / 52.6ms)

OneFlow resnet50 time: 41.6ms (= 8314.4ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 47.3ms (= 9468.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.14 (= 47.3ms / 41.6ms)

OneFlow resnet50 time: 39.8ms (= 7963.7ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 38.0ms (= 7607.2ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.96 (= 38.0ms / 39.8ms)

OneFlow swin dataloader time: 0.251s (= 50.230s / 200, num_workers=1)
PyTorch swin dataloader time: 0.259s (= 51.791s / 200, num_workers=1)
✔️ Relative speed: 1.031 (= 0.259s / 0.251s)

OneFlow swin dataloader time: 0.068s (= 13.586s / 200, num_workers=4)
PyTorch swin dataloader time: 0.067s (= 13.333s / 200, num_workers=4)
✔️ Relative speed: 0.981 (= 0.067s / 0.068s)

OneFlow swin dataloader time: 0.036s (= 7.260s / 200, num_workers=8)
PyTorch swin dataloader time: 0.037s (= 7.479s / 200, num_workers=8)
✔️ Relative speed: 1.030 (= 0.037s / 0.036s)

✔️ OneFlow resnet50 time: 135.3ms (= 13532.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.0ms (= 16002.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 160.0ms / 135.3ms)

OneFlow resnet50 time: 86.4ms (= 8645.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 99.2ms (= 9920.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 99.2ms / 86.4ms)

OneFlow resnet50 time: 61.9ms (= 12373.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.2ms (= 15239.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.23 (= 76.2ms / 61.9ms)

OneFlow resnet50 time: 52.1ms (= 10417.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.8ms (= 13563.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.30 (= 67.8ms / 52.1ms)

OneFlow resnet50 time: 52.3ms (= 10453.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.4ms (= 14487.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.39 (= 72.4ms / 52.3ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7979/

@github-actions
Copy link
Contributor

CI failed when running job: cuda-module. PR label automerge has been removed

@BBuf BBuf added the automerge label Apr 17, 2022
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.5ms (= 12849.2ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 138.7ms (= 13873.1ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.08 (= 138.7ms / 128.5ms)

OneFlow resnet50 time: 78.7ms (= 7867.6ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.7ms (= 8365.6ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.06 (= 83.7ms / 78.7ms)

OneFlow resnet50 time: 54.5ms (= 10890.6ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 62.0ms (= 12400.3ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.14 (= 62.0ms / 54.5ms)

OneFlow resnet50 time: 43.2ms (= 8648.7ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 48.6ms (= 9724.9ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.12 (= 48.6ms / 43.2ms)

OneFlow resnet50 time: 37.3ms (= 7460.3ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.4ms (= 7876.0ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.06 (= 39.4ms / 37.3ms)

OneFlow swin dataloader time: 0.251s (= 50.102s / 200, num_workers=1)
PyTorch swin dataloader time: 0.254s (= 50.786s / 200, num_workers=1)
✔️ Relative speed: 1.014 (= 0.254s / 0.251s)

OneFlow swin dataloader time: 0.065s (= 13.079s / 200, num_workers=4)
PyTorch swin dataloader time: 0.070s (= 13.980s / 200, num_workers=4)
✔️ Relative speed: 1.069 (= 0.070s / 0.065s)

OneFlow swin dataloader time: 0.036s (= 7.295s / 200, num_workers=8)
PyTorch swin dataloader time: 0.039s (= 7.828s / 200, num_workers=8)
✔️ Relative speed: 1.073 (= 0.039s / 0.036s)

✔️ OneFlow resnet50 time: 135.4ms (= 13537.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 157.8ms (= 15776.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 157.8ms / 135.4ms)

OneFlow resnet50 time: 88.1ms (= 8813.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 105.6ms (= 10558.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 105.6ms / 88.1ms)

OneFlow resnet50 time: 59.1ms (= 11816.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.5ms (= 15104.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 75.5ms / 59.1ms)

OneFlow resnet50 time: 53.1ms (= 10618.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 66.5ms (= 13309.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.25 (= 66.5ms / 53.1ms)

OneFlow resnet50 time: 48.2ms (= 9630.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 66.1ms (= 13216.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.37 (= 66.1ms / 48.2ms)

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.5ms (= 12850.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.0ms (= 14103.4ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 141.0ms / 128.5ms)

OneFlow resnet50 time: 80.2ms (= 8021.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.6ms (= 8560.5ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.07 (= 85.6ms / 80.2ms)

OneFlow resnet50 time: 51.3ms (= 10262.3ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.7ms (= 11930.7ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.16 (= 59.7ms / 51.3ms)

OneFlow resnet50 time: 41.9ms (= 8379.8ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 46.0ms (= 9196.9ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.10 (= 46.0ms / 41.9ms)

OneFlow resnet50 time: 39.7ms (= 7939.4ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 40.1ms (= 8013.0ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.01 (= 40.1ms / 39.7ms)

OneFlow swin dataloader time: 0.254s (= 50.883s / 200, num_workers=1)
PyTorch swin dataloader time: 0.253s (= 50.698s / 200, num_workers=1)
✔️ Relative speed: 0.996 (= 0.253s / 0.254s)

OneFlow swin dataloader time: 0.064s (= 12.889s / 200, num_workers=4)
PyTorch swin dataloader time: 0.065s (= 13.046s / 200, num_workers=4)
✔️ Relative speed: 1.012 (= 0.065s / 0.064s)

OneFlow swin dataloader time: 0.036s (= 7.272s / 200, num_workers=8)
PyTorch swin dataloader time: 0.037s (= 7.460s / 200, num_workers=8)
✔️ Relative speed: 1.026 (= 0.037s / 0.036s)

✔️ OneFlow resnet50 time: 135.6ms (= 13561.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 155.9ms (= 15588.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 155.9ms / 135.6ms)

OneFlow resnet50 time: 87.1ms (= 8714.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 99.3ms (= 9932.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.14 (= 99.3ms / 87.1ms)

OneFlow resnet50 time: 62.3ms (= 12457.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.0ms (= 15206.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.22 (= 76.0ms / 62.3ms)

OneFlow resnet50 time: 51.4ms (= 10285.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.8ms (= 15363.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.49 (= 76.8ms / 51.4ms)

OneFlow resnet50 time: 48.4ms (= 9678.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 61.6ms (= 12322.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.27 (= 61.6ms / 48.4ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7979/

@mergify mergify bot merged commit 6e9431d into master Apr 17, 2022
@mergify mergify bot deleted the fix_gather_kernel_check branch April 17, 2022 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants