updating unet3d rcp for bs 56 using habana hp #329

itayhubara · 2023-09-07T13:47:11Z

Old BS56 RCP:
mean 386400.0 (2300 epochs)
mean after removing best/worst 10%: 378472 (2252.8125 epochs)

New BS56 RCP:
mean 376320.0 (2240 epochs)
mean after removing best/worst 10%: 342090.0 (2036.25 epochs)

github-actions · 2023-09-07T13:47:28Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

erichan1 · 2023-09-07T15:48:06Z

Please give specs on what GPU this was run on and how (eg is it reference code, and fp32)

itayhubara · 2023-09-14T15:54:49Z

This was done on Gaudi2, with bf16 the pytorch dataloader, and code similar to (but not) the reference code. Please note that Nvidia made 57 runs with the reference code and achieved similar statistics.

[1720 1740 1800 1760 1820 1720 2180 3780 2020 1740 3960 1820 2640 1960 1980 2480 1820 1740 1600 1900 2120 1740 2400 1540 1620 1940 2480 1840 3200 1760 2060 1600 1760 1980 1840 2700 1940 1660 2340 1860 1900 3280 2720 2860 1920 1280 2480 2640 2060 1820 1980 1900 3760 1720 2220 2660 2420]
Average 2143.50
Mean after removing best/worst 10% were removed: 2054.46

Since RCP requires running with fp32 on reference code we have 3 options:

Finish the 57 runs - if Nvidia can do that it would be great
Accept the current PR based on the information above.
Reject the PR and keep the old RCP

Please note that both Habana results and Nvidia results are better than the old RCP which achieved an average of 2300 and 2252 when removing the best/worst 10% (meaning Habana HPs are indeed better).

pgmpablo157321 · 2023-09-18T23:53:38Z

@itayhubara Is this RCP update meant for training v3.1? In that case could you update your branch and move the changes into the training-3.1.0 folder?

nv-rborkar · 2023-09-21T15:36:32Z

To avoid setting up a bad precedence, we should avoid merging any convergence points which are not derived from running reference.

nv-rborkar · 2023-09-27T16:04:48Z

@itayhubara can Habana create RCPs by running reference code in FP32 & create a new PR ?

updating unet3d rcp for bs 56 using habana hp

92290ed

itayhubara requested review from a team as code owners September 7, 2023 13:47

nv-rborkar closed this Sep 22, 2023

github-actions bot locked and limited conversation to collaborators Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updating unet3d rcp for bs 56 using habana hp #329

updating unet3d rcp for bs 56 using habana hp #329

itayhubara commented Sep 7, 2023

github-actions bot commented Sep 7, 2023

erichan1 commented Sep 7, 2023 •

edited

Loading

itayhubara commented Sep 14, 2023

pgmpablo157321 commented Sep 18, 2023

nv-rborkar commented Sep 21, 2023

nv-rborkar commented Sep 27, 2023

updating unet3d rcp for bs 56 using habana hp #329

updating unet3d rcp for bs 56 using habana hp #329

Conversation

itayhubara commented Sep 7, 2023

github-actions bot commented Sep 7, 2023

erichan1 commented Sep 7, 2023 • edited Loading

itayhubara commented Sep 14, 2023

pgmpablo157321 commented Sep 18, 2023

nv-rborkar commented Sep 21, 2023

nv-rborkar commented Sep 27, 2023

erichan1 commented Sep 7, 2023 •

edited

Loading