Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port PadConstantForwardContiguous #4

Closed

Conversation

o2buzzle
Copy link
Collaborator

@o2buzzle o2buzzle commented Apr 15, 2024

This PR ports the PadConstantForwardContiguous OpenCL kernel to MIOpen. Closes MV-379.

Checklist:

  • Kernel ported from OpenCL
  • MIOpen frontend code (prepare/launch the kernel)
  • CPU implementation of the operation (for unit tests)
  • Library components builds and run as expected
  • gtest unit tests (working & passing verification)
  • MIOpenDriver driver code (working & passing verification -V)
  • Code cleanup

Performance comparison with PyTorch ROCm:

bfloat16

op_name dtype size padding direction rocm_kernel_avg miopen_kernel_duration time reduction
PadConstantContiguous bfloat16 [1 1 4 8 8] [0 0 1 1 1 1 1 1 1 1] fwd 9280 6832 1.358313817
PadConstantContiguous bfloat16 [1 1 4 8 8] [0 0 2 2 2 2 2 2 2 2] fwd 8560 6032 1.419098143
PadConstantContiguous bfloat16 [1 1 4 8 8] [0 0 4 4 4 4 4 4 4 4] fwd 8608 5376 1.601190476
PadConstantContiguous bfloat16 [1 2 16 64 256] [0 0 1 1 1 1 1 1 1 1] fwd 20320 22720 0.8943661972
PadConstantContiguous bfloat16 [1 2 16 64 256] [0 0 2 2 2 2 2 2 2 2] fwd 30208 35264 0.8566243194
PadConstantContiguous bfloat16 [1 2 16 64 256] [0 0 4 4 4 4 4 4 4 4] fwd 70177 70080 1.001384132
PadConstantContiguous bfloat16 [1 4 32 128 256] [0 0 1 1 1 1 1 1 1 1] fwd 65616 110833 0.5920258407
PadConstantContiguous bfloat16 [1 4 32 128 256] [0 0 2 2 2 2 2 2 2 2] fwd 104353 154129 0.677049744
PadConstantContiguous bfloat16 [1 4 32 128 256] [0 0 4 4 4 4 4 4 4 4] fwd 242498 259123 0.93584128

float32

op_name dtype size padding direction rocm_kernel_avg miopen_kernel_duration time reduction
PadConstantContiguous float32 [1 1 4 8 8] [0 0 1 1 1 1 1 1 1 1] fwd 10128 6864 1.475524476
PadConstantContiguous float32 [1 1 4 8 8] [0 0 2 2 2 2 2 2 2 2] fwd 8544 6176 1.383419689
PadConstantContiguous float32 [1 1 4 8 8] [0 0 4 4 4 4 4 4 4 4] fwd 9584 5424 1.766961652
PadConstantContiguous float32 [1 2 16 64 256] [0 0 1 1 1 1 1 1 1 1] fwd 26032 22320 1.166308244
PadConstantContiguous float32 [1 2 16 64 256] [0 0 2 2 2 2 2 2 2 2] fwd 44384 34448 1.288434742
PadConstantContiguous float32 [1 2 16 64 256] [0 0 4 4 4 4 4 4 4 4] fwd 125617 67728 1.854727734
PadConstantContiguous float32 [1 4 32 128 256] [0 0 1 1 1 1 1 1 1 1] fwd 98625 107073 0.9211005576
PadConstantContiguous float32 [1 4 32 128 256] [0 0 2 2 2 2 2 2 2 2] fwd 175121 148705 1.177640295
PadConstantContiguous float32 [1 4 32 128 256] [0 0 4 4 4 4 4 4 4 4] fwd 454772 248962 1.826672344

float16

op_name dtype size padding direction rocm_kernel_avg miopen_kernel_duration time reduction
PadConstantContiguous float16 [1 1 4 8 8] [0 0 1 1 1 1 1 1 1 1] fwd 10272 6576 1.562043796
PadConstantContiguous float16 [1 1 4 8 8] [0 0 2 2 2 2 2 2 2 2] fwd 8960 5936 1.509433962
PadConstantContiguous float16 [1 1 4 8 8] [0 0 4 4 4 4 4 4 4 4] fwd 9152 5536 1.653179191
PadConstantContiguous float16 [1 2 16 64 256] [0 0 1 1 1 1 1 1 1 1] fwd 20448 21856 0.9355783309
PadConstantContiguous float16 [1 2 16 64 256] [0 0 2 2 2 2 2 2 2 2] fwd 30368 33888 0.896128423
PadConstantContiguous float16 [1 2 16 64 256] [0 0 4 4 4 4 4 4 4 4] fwd 71424 67264 1.061845861
PadConstantContiguous float16 [1 4 32 128 256] [0 0 1 1 1 1 1 1 1 1] fwd 66641 106353 0.6266019764
PadConstantContiguous float16 [1 4 32 128 256] [0 0 2 2 2 2 2 2 2 2] fwd 105873 147809 0.7162824997
PadConstantContiguous float16 [1 4 32 128 256] [0 0 4 4 4 4 4 4 4 4] fwd 247426 248387 0.9961310375

Average over all cases:

type average
bfloat16 1.037
float32 1.429
float16 1.106

@o2buzzle o2buzzle force-pushed the port/PadConstantFwdContiguous branch from e733b39 to 71b568b Compare April 15, 2024 09:57
@o2buzzle o2buzzle force-pushed the port/PadConstantFwdContiguous branch from 68bdcac to 7a051ed Compare April 16, 2024 02:41
@o2buzzle o2buzzle self-assigned this Apr 16, 2024
@o2buzzle o2buzzle marked this pull request as ready for review April 16, 2024 06:50
@o2buzzle o2buzzle force-pushed the port/PadConstantFwdContiguous branch from f244ab6 to 13843f9 Compare April 16, 2024 07:15
include/miopen/miopen.h Outdated Show resolved Hide resolved
src/kernels/MIOpenPadConstantFwd.cpp Outdated Show resolved Hide resolved
src/solver/pad_constant/pad_constant_fwd_contiguous.cpp Outdated Show resolved Hide resolved
src/kernels/MIOpenPadConstantFwd.cpp Outdated Show resolved Hide resolved
test/gtest/pad_constant.hpp Outdated Show resolved Hide resolved
@o2buzzle
Copy link
Collaborator Author

@et16kr About 98bba90, Are you sure we should combine INPUT_TYPE and OUTPUT_TYPE into one? I think other kernels still specify input and output that way, even when they are of the same type

@et16kr
Copy link

et16kr commented Apr 23, 2024

@et16kr About 98bba90, Are you sure we should combine INPUT_TYPE and OUTPUT_TYPE into one? I think other kernels still specify input and output that way, even when they are of the same type

I knew that there are cases where input and output dtypes are separate, so I deleted my comment, but you already changed it. Please ignore my comments regarding dtype.

@o2buzzle
Copy link
Collaborator Author

I have rolled the changes back.

@kyeonghwanryu
Copy link

  1. Remove rocm_op_avg and use_miopen column
  2. PadConstantContiguous is a name of the kernel, not the operation.
  3. Add geomean result by types.
    The purpose of perf result table is to persuade AMD to push our operations into their repo so it's better to be concise. rocm_op_avg doesn't mean anything if the kernel gaps don't exist or are ignorable. Same goes to use_miopen. If it's TRUE, we'd used MIOpenDriver to measure older implementation so rocm_kernel_avg could be named like miopen_kernel_old.

About other things, good work. You can change your ticket to done once you take care of the perf result table.

@kyeonghwanryu
Copy link

And it looks you need to narrow down the condition in IsApplicable to guarantee better performance.
Think about it while you do next work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants