Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResNext3D layer selected incorrectly large WSS causing OOM failure #381

Closed
daniellowell opened this issue Aug 14, 2020 · 12 comments
Closed

Comments

@daniellowell
Copy link
Contributor

daniellowell commented Aug 14, 2020

ROCm 3.8 Blocking issue.

JIRA tracking issue for ResNext3D:
http://ontrack-internal.amd.com/browse/SWDEV-246350

Issue is in default hybrid mode Caffe2 model is crashing with OOM. However, in normal find mode it passes.

Analysis:

  • Deleting the user Find-Db and running in default hybrid mode also fails with OOM.

  • MIOPEN_LOG_LEVEL=6 for default hybrid mode shows 1.3GB is requested by layer:

MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 256-1-14-14-3x3x3-256-1-14-14-4-1x1x1-1x1x1-1x1x1-0-NCHW-FP32-B_g256 in file /root/.config/miopen//gfx90878.HIP.2_6_0_8145-rocm-rel-3.7-19-c16087a4.ufdb.txt
MIOpen(HIP): Info2 [FindRecordUnsafe] Key match: 256-1-14-14-3x3x3-256-1-14-14-4-1x1x1-1x1x1-1x1x1-0-NCHW-FP32-B_g256
MIOpen(HIP): Info2 [FindRecordUnsafe] Contents found: miopenConvolutionBwdDataAlgoGEMM:gemm,0.20544,1387266048,rocBlas,<unused>
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.194715 ms
MIOpen(HIP): Info2 [BackwardDataGetWorkSpaceSize] 1387266048
  • Normal mode is actually requesting 5MB:
MIOpen(HIP): Info2 [BackwardDataGetWorkSpaceSize] 5419008
MIOpen(HIP): Info [FindConvBwdDataAlgorithm] requestAlgoCount = 1, workspace = 5419008
MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 256-1-14-14-3x3x3-256-1-14-14-4-1x1x1-1x1x1-1x1x1-0-NCHW-FP32-B_g256 in file /root/.config/miopen//gfx90878.HIP.2_6_0_8145-rocm-rel-3.7-19-c16087a4.ufdb.txt
  • Find-Db is populated after normal run with clean ufdb with the correct 5MB size the incorrect size (1.3GB).

  • Driver command to reproduce the issue:

MIOPEN_ENABLE_LOGGING_CMD=1 MIOPEN_FIND_MODE=3 MIOPEN_LOG_LEVEL=6 /opt/rocm/miopen/bin/MIOpenDriver conv -n 4 -c 256 --in_d 1 -H 14 -W 14 -k 256 --fil_d 3 -y 3 -x 3 --pad_d 1 -p 1 -q 1 --conv_stride_d 1 -u 1 -v 1 --dilation_d 1 -l 1 -j 1 --spatial_dim 3 -m conv -g 256 -F 2 -t 1 -V 0 -i 1
@atamazov
Copy link
Contributor

atamazov commented Aug 14, 2020

Comments above show that the system find-db contains incorrect information under 256-1-14-14-3x3x3-256-1-14-14-4-1x1x1-1x1x1-1x1x1-0-NCHW-FP32-B_g256.

@atamazov
Copy link
Contributor

1387266048 / 256 = 5419008
Something is wrong with computations related to n_groups ( = 256 ).

@atamazov
Copy link
Contributor

Perhaps we need to look at these changes: https://github.com/AMDComputeLibraries/MLOpen/pull/2340/files

@atamazov
Copy link
Contributor

@zjing14 Please have a look.

@zjing14
Copy link
Contributor

zjing14 commented Aug 15, 2020

The multiple of group_count is moved into miopenConvolutionBwdWeightsAlgoGEMM, but not remove the multiple of group_count outside. Will create a fix.

@daniellowell
Copy link
Contributor Author

daniellowell commented Aug 15, 2020 via email

@daniellowell
Copy link
Contributor Author

@zjing14 Does this only affect 3D convolutions, or 2D as well?

@zjing14
Copy link
Contributor

zjing14 commented Aug 15, 2020

Both, probably.

@atamazov
Copy link
Contributor

atamazov commented Aug 17, 2020

Awesome thanks guys. Of course this means we have to regenerate the entire find-db.

Actually, only GEMM Backward Data needs to be regenerated, which is ~1/15 of the full regeneration. The process looks like this:

  • (1) Save existing find-db
  • (2) Disable all algorithms except GEMM
  • (3) Run all Backward Group and Forward Transposed Group convolutions (ignore failures when solution is not found)
  • (4) Merge resulting find-db into existing find-db saved at step (1).

I hoping that Tuna has such a capability. @JehandadKhan

@daniellowell
Copy link
Contributor Author

Yes, we have the ability.
In Tuna we actually scan the solvers, running them individually, then populate a database. The database is then exported in the form of Find-db, so it is straightforward from that perspective to spot correct individual solutions.

@atamazov
Copy link
Contributor

#381 (comment) updated.

@daniellowell
Copy link
Contributor Author

Implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants