Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NCHW][SWDEV-312112] fix local buffer alignment bug on transpose+nhwc kernel's solver #1324

Merged
merged 8 commits into from
Dec 4, 2021

Conversation

shaojiewang
Copy link
Contributor

@shaojiewang shaojiewang commented Dec 3, 2021

root cause fix for swdev312112:
[nchw]fix local buffer alignment bug on transpose+nhwc kernel's solver.

added by carlus:

original buffer       workspace
inp NCHW fp16  --> 1# inp NHWC fp16 buffer
wei NCHW fp16  --> 2# wei NHWC fp16 buffer
out NCHW fp16  --> 3# out NHWC fp16 buffer
                   4# out NHWC fp32 buffer  <- bug here

Above is what we are implementing in fwd when input is NCHW but use NHWC kernel with transpose. Basically there should be 4 buffers in workspace, where the first 3 buffers are for transpose the original inp/wei/out buffer. the last one is for some kernel that need use atomic fp32 to accumulate the result, then cast back to fp16. We always allocate a single workspace, and keep an offset inside each of the buffer.
So the problem here is the offset for 4# may not be aligned to multiple of 4 byte, since the previous 3 buffers are fp16. If 4# buffer are not aligned to 4 byte, fp32 atomic will crash.
This PR tries to solve this problem by adding some extra byte to make sure 4# offset is aligned to multiple of 4.

@codecov

This comment has been minimized.

@junliume junliume changed the title [nchw]fix swdev312112 [nchw][SWDEV-312112] fix local buffer alignment bug on transpose+nhwc kernel's solver Dec 3, 2021
@junliume
Copy link
Collaborator

junliume commented Dec 3, 2021

I can confirm that the specific command as listed in [SWDEV-312112] is fixed by this PR.

Meanwhile, we still get RuntimeError: test_nn.py failed! FAILED (failures=13, skipped=78, expected failures=4)

@junliume junliume changed the title [nchw][SWDEV-312112] fix local buffer alignment bug on transpose+nhwc kernel's solver [NCHW][SWDEV-312112] fix local buffer alignment bug on transpose+nhwc kernel's solver Dec 3, 2021
@junliume junliume added this to the ROCm 5.0 milestone Dec 3, 2021
@atamazov
Copy link
Contributor

atamazov commented Dec 3, 2021

This seems Ok as a hotfix. However we should hide details in some suitable abstraction instead hacky code here and there. In this specific case, I can propose some class that can represents metrics of a workspace that logically consists of two sub-buffers, each with its own size and alignment. Very basic example:

WorkspaceBuf2Traits wt(size_0, size_1, alignment_1);
...
wt.GetSize(); // total workspace size
wt.GetOffset1(); // offset of subbuffer 1

Copy link
Contributor

@atamazov atamazov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditionally approved.

@atamazov
Copy link
Contributor

atamazov commented Dec 3, 2021

A complicated example of handling sub-buffers can be found in ConvWinograd3x3MultipassWrW. @shurale-nkn Your recommendations?

Copy link
Collaborator

@junliume junliume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post merge issue has been raised: #1326

@junliume junliume merged commit 330bdd9 into develop Dec 4, 2021
@atamazov
Copy link
Contributor

atamazov commented Dec 4, 2021

@shaojiewang I assumed that #1324 would resolve #1317 and tried it. Unfortunately this is not so. It may worth double-checking #1324 -- maybe some of the fixes are still missing (and HIP is tolerant to these, but OCL doesn't)?

@shaojiewang
Copy link
Contributor Author

@shaojiewang I assumed that #1324 would resolve #1317 and tried it. Unfortunately this is not so. It may worth double-checking #1324 -- maybe some of the fixes are still missing (and HIP is tolerant to these, but OCL doesn't)?

Yes, I'll check it ASAP.

@shaojiewang
Copy link
Contributor Author

shaojiewang commented Dec 6, 2021

@shaojiewang I assumed that #1324 would resolve #1317 and tried it. Unfortunately this is not so. It may worth double-checking #1324 -- maybe some of the fixes are still missing (and HIP is tolerant to these, but OCL doesn't)?

AFAICS, OCL backend requires a larger alignment(128bytes). I'll prepare another PR to solve both #1317 and #1326

@atamazov
Copy link
Contributor

atamazov commented Dec 6, 2021

@shaojiewang

I'll prepare another PR to solve both #1317 and #1326

Good! Just in case -- the OCL problem is not of blocker priority anymore, so we have time))

@atamazov
Copy link
Contributor

atamazov commented Dec 17, 2021

The find-db NCHW records for

  • ConvAsmImplicitGemmGTCDynamicBwdXdlopsNHWC
  • ConvAsmImplicitGemmGTCDynamicFwdXdlopsNHWC
  • ConvAsmImplicitGemmGTCDynamicWrwXdlopsNHWC

must be refreshed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants