Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gemm bwd solver result in segment fault #2474

Closed
carlushuang opened this issue Oct 23, 2023 · 8 comments
Closed

gemm bwd solver result in segment fault #2474

carlushuang opened this issue Oct 23, 2023 · 8 comments
Assignees

Comments

@carlushuang
Copy link
Contributor

carlushuang commented Oct 23, 2023

rocm docker 5.7, with latest MIOpen commit.
now all 3 gemm bwd solvers will result in segment fault
GemmBwd1x1_stride2, GemmBwd1x1_stride1, GemmBwdRest

e.g. :
MIOPEN_ENABLE_LOGGING=1 MIOPEN_FIND_MODE=4 MIOPEN_FIND_ENFORCE=1 MIOPEN_LOG_LEVEL=9 MIOPEN_ENABLE_LOGGING_CMD=1 ./bin/MIOpenDriver convfp16 -n 16 -c 256 -H 16 -W 16 -k 256 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -g 1 -F 2 -t 1 -V 0

MIOPEN_ENABLE_LOGGING=1 MIOPEN_FIND_MODE=4 MIOPEN_FIND_ENFORCE=1 MIOPEN_LOG_LEVEL=9 MIOPEN_ENABLE_LOGGING_CMD=1 ./bin/MIOpenDriver convfp16 -n 16 -c 256 -H 16 -W 16 -k 256 -y 1 -x 1 -p 0 -q 0 -u 2 -v 2 -l 1 -j 1 -g 1 -F 2 -t 1 -V 0

MIOPEN_ENABLE_LOGGING=1 MIOPEN_FIND_MODE=4 MIOPEN_FIND_ENFORCE=1 MIOPEN_LOG_LEVEL=9 MIOPEN_ENABLE_LOGGING_CMD=1 ./bin/MIOpenDriver convfp16 -n 16 -c 256 -H 16 -W 16 -k 512 -y 2 -x 2 -p 0 -q 0 -u 2 -v 2 -l 1 -j 1 -g 1 -F 2 -t 1 -V 0

example log
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: MIOpenUtilKernels4.cl.o; args:  -DMIOPEN_USE_FP16=1 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=0 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DMIOPEN_FP8_IEEE_EXPONENT_BIAS=1 -DMIOPEN_FP8_CLIPPING=1 -mcpu=gfx90a
MIOpen(HIP): Info2 [Prepare] INSERT OR REPLACE INTO kern_db(kernel_name, kernel_args, kernel_blob, kernel_hash, uncompressed_size) VALUES(?, ?, ?, ?, ?);
MIOpen(HIP): Info2 [Measure] Db::StoreRecord time: 6.44736 ms
MIOpen(HIP): Info2 [run] kernel_name = transpose_NCHW2CNHW_V2_2D_WG, global_work_dim = { 64, 1, 256 }, local_work_dim = { 64, 1, 1 }
MIOpen(HIP): Info2 [CallGemm] gemm_desc: {isColMajor 0, transA 1, transB 0, m 256, n 1024, k 256, lda 256, ldb 1024, ldc 1024, batch_count 1, strideA 0, strideB 0, strideC 0, alpha 1, beta 0, dataType half, a_cast_type half, b_cast_type half} 
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, int, ConstData_t, int, Data_t, int, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
Segmentation fault (core dumped)

If run above config with env:MIOPEN_DEBUG_CONV_GEMM=0, then everything is OK

@carlushuang
Copy link
Contributor Author

carlushuang commented Oct 23, 2023

Hotfix for this issue is in #2477
This disable gemm_bwd completely. So I think in-detailed analysis of this solver may be needed

@atamazov
Copy link
Contributor

@carlushuang Can you please fix formatting of the log in the description? See https://github.com/ROCmSoftwarePlatform/MIOpen/wiki/How-to-insert-console-logs-into-github-pages. Thanks.

@atamazov
Copy link
Contributor

@carlushuang @junliume Please provide me with rocm 5.7 docker (identical to the one with which the problem occurs).

@carlushuang

with latest MIOpen commit.

Please provide git hash explicitly, thanks!

@carlushuang
Copy link
Contributor Author

carlushuang commented Oct 23, 2023

Hi @atamazov , I used rocm/composable_kernel-private:ck_ub20.04_rocm5.7, with manually install all the dependency, and build MIOpen. The git hash is the latest, 5798751

@atamazov
Copy link
Contributor

@carlushuang Thanks! Two more Q:

  • Have you tried other docker images?
  • Does the issue reproducible with other HW, e.g. MI50/60 or Navi21 or MI100?

@junliume
Copy link
Collaborator

junliume commented Oct 23, 2023

@carlushuang Thanks! Two more Q:

  • Have you tried other docker images?
  • Does the issue reproducible with other HW, e.g. MI50/60 or Navi21 or MI100?

The issue in the original post cannot be reproduced in the latest mainline docker unfortunately.

A little more digging into this issue:

fail_rocm_composable_kernel_private_ck_ub2004_rocm57.log
pass_latest_mainline.log

Failing:

#define ROCBLAS_VERSION_MAJOR       3
#define ROCBLAS_VERSION_MINOR       1
#define ROCBLAS_VERSION_PATCH       0
#define ROCBLAS_VERSION_TWEAK       5dfd8804-dirty

Passing:

#define ROCBLAS_VERSION_MAJOR       4
#define ROCBLAS_VERSION_MINOR       0
#define ROCBLAS_VERSION_PATCH       0
#define ROCBLAS_VERSION_TWEAK       98ac82bc-dirty

Another CI docker rocm/miopen:ci_37df35 is actually passing with

#define ROCBLAS_VERSION_MAJOR       3
#define ROCBLAS_VERSION_MINOR       1
#define ROCBLAS_VERSION_PATCH       0
#define ROCBLAS_VERSION_TWEAK       b80e4220-dirty

@atamazov
Copy link
Contributor

@junliume Thank you!

@carlushuang
Copy link
Contributor Author

I manually build a docker with the MIOpen's DockerFile, and tested, luckily this problem does not reproduce. So this is a specific docker related issue. Then this issue will not block #2459

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants