ResNext3D layer selected incorrectly large WSS causing OOM failure #381

daniellowell · 2020-08-14T22:24:48Z

ROCm 3.8 Blocking issue.

JIRA tracking issue for ResNext3D:
http://ontrack-internal.amd.com/browse/SWDEV-246350

Issue is in default hybrid mode Caffe2 model is crashing with OOM. However, in normal find mode it passes.

Analysis:

Deleting the user Find-Db and running in default hybrid mode also fails with OOM.
MIOPEN_LOG_LEVEL=6 for default hybrid mode shows 1.3GB is requested by layer:

MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 256-1-14-14-3x3x3-256-1-14-14-4-1x1x1-1x1x1-1x1x1-0-NCHW-FP32-B_g256 in file /root/.config/miopen//gfx90878.HIP.2_6_0_8145-rocm-rel-3.7-19-c16087a4.ufdb.txt
MIOpen(HIP): Info2 [FindRecordUnsafe] Key match: 256-1-14-14-3x3x3-256-1-14-14-4-1x1x1-1x1x1-1x1x1-0-NCHW-FP32-B_g256
MIOpen(HIP): Info2 [FindRecordUnsafe] Contents found: miopenConvolutionBwdDataAlgoGEMM:gemm,0.20544,1387266048,rocBlas,<unused>
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.194715 ms
MIOpen(HIP): Info2 [BackwardDataGetWorkSpaceSize] 1387266048

Normal mode is actually requesting 5MB:

MIOpen(HIP): Info2 [BackwardDataGetWorkSpaceSize] 5419008
MIOpen(HIP): Info [FindConvBwdDataAlgorithm] requestAlgoCount = 1, workspace = 5419008
MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 256-1-14-14-3x3x3-256-1-14-14-4-1x1x1-1x1x1-1x1x1-0-NCHW-FP32-B_g256 in file /root/.config/miopen//gfx90878.HIP.2_6_0_8145-rocm-rel-3.7-19-c16087a4.ufdb.txt

Find-Db is populated after normal run with clean ufdb with ~~the correct 5MB size~~ the incorrect size (1.3GB).
Driver command to reproduce the issue:

MIOPEN_ENABLE_LOGGING_CMD=1 MIOPEN_FIND_MODE=3 MIOPEN_LOG_LEVEL=6 /opt/rocm/miopen/bin/MIOpenDriver conv -n 4 -c 256 --in_d 1 -H 14 -W 14 -k 256 --fil_d 3 -y 3 -x 3 --pad_d 1 -p 1 -q 1 --conv_stride_d 1 -u 1 -v 1 --dilation_d 1 -l 1 -j 1 --spatial_dim 3 -m conv -g 256 -F 2 -t 1 -V 0 -i 1

The text was updated successfully, but these errors were encountered:

atamazov · 2020-08-14T23:24:06Z

~~Comments above show that the system find-db contains incorrect information under 256-1-14-14-3x3x3-256-1-14-14-4-1x1x1-1x1x1-1x1x1-0-NCHW-FP32-B_g256.~~

atamazov · 2020-08-14T23:37:43Z

1387266048 / 256 = 5419008
Something is wrong with computations related to n_groups ( = 256 ).

atamazov · 2020-08-15T00:06:47Z

Perhaps we need to look at these changes: https://github.com/AMDComputeLibraries/MLOpen/pull/2340/files

atamazov · 2020-08-15T00:20:03Z

@zjing14 Please have a look.

zjing14 · 2020-08-15T00:47:13Z

The multiple of group_count is moved into miopenConvolutionBwdWeightsAlgoGEMM, but not remove the multiple of group_count outside. Will create a fix.

daniellowell · 2020-08-15T01:01:42Z

Awesome thanks guys. Of course this means we have to regenerate the entire find-db.

daniellowell · 2020-08-15T01:05:44Z

@zjing14 Does this only affect 3D convolutions, or 2D as well?

zjing14 · 2020-08-15T01:15:12Z

Both, probably.

atamazov · 2020-08-17T12:39:28Z

Awesome thanks guys. Of course this means we have to regenerate the entire find-db.

Actually, only GEMM Backward Data needs to be regenerated, which is ~1/15 of the full regeneration. The process looks like this:

(1) Save existing find-db
(2) Disable all algorithms except GEMM
(3) Run all Backward Group and Forward Transposed Group convolutions (ignore failures when solution is not found)
(4) Merge resulting find-db into existing find-db saved at step (1).

I hoping that Tuna has such a capability. @JehandadKhan

daniellowell · 2020-08-17T14:40:55Z

Yes, we have the ability.
In Tuna we actually scan the solvers, running them individually, then populate a database. The database is then exported in the form of Find-db, so it is straightforward from that perspective to spot correct individual solutions.

atamazov · 2020-08-18T16:46:54Z

#381 (comment) updated.

daniellowell · 2020-08-18T21:35:06Z

Implemented.

daniellowell added bug urgency_blocker labels Aug 14, 2020

daniellowell assigned atamazov Aug 14, 2020

daniellowell assigned daniellowell, sunway513 and lcskrishna Aug 14, 2020

zjing14 mentioned this issue Aug 15, 2020

[Fix] workspace size calc of gemm for group_conv #382

Merged

daniellowell assigned JehandadKhan and cderb Aug 15, 2020

daniellowell closed this as completed Aug 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResNext3D layer selected incorrectly large WSS causing OOM failure #381

ResNext3D layer selected incorrectly large WSS causing OOM failure #381

daniellowell commented Aug 14, 2020 •

edited by atamazov

Loading

atamazov commented Aug 14, 2020 •

edited

Loading

atamazov commented Aug 14, 2020

atamazov commented Aug 15, 2020

atamazov commented Aug 15, 2020

zjing14 commented Aug 15, 2020 •

edited

Loading

daniellowell commented Aug 15, 2020 via email •

edited

Loading

daniellowell commented Aug 15, 2020

zjing14 commented Aug 15, 2020

atamazov commented Aug 17, 2020 •

edited

Loading

daniellowell commented Aug 17, 2020

atamazov commented Aug 18, 2020

daniellowell commented Aug 18, 2020

ResNext3D layer selected incorrectly large WSS causing OOM failure #381

ResNext3D layer selected incorrectly large WSS causing OOM failure #381

Comments

daniellowell commented Aug 14, 2020 • edited by atamazov Loading

atamazov commented Aug 14, 2020 • edited Loading

atamazov commented Aug 14, 2020

atamazov commented Aug 15, 2020

atamazov commented Aug 15, 2020

zjing14 commented Aug 15, 2020 • edited Loading

daniellowell commented Aug 15, 2020 via email • edited Loading

daniellowell commented Aug 15, 2020

zjing14 commented Aug 15, 2020

atamazov commented Aug 17, 2020 • edited Loading

daniellowell commented Aug 17, 2020

atamazov commented Aug 18, 2020

daniellowell commented Aug 18, 2020

daniellowell commented Aug 14, 2020 •

edited by atamazov

Loading

atamazov commented Aug 14, 2020 •

edited

Loading

zjing14 commented Aug 15, 2020 •

edited

Loading

daniellowell commented Aug 15, 2020 via email •

edited

Loading

atamazov commented Aug 17, 2020 •

edited

Loading