Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating unet3d rcp for bs 56 using habana hp #329

Conversation

itayhubara
Copy link
Contributor

Old BS56 RCP:
mean 386400.0 (2300 epochs)
mean after removing best/worst 10%: 378472 (2252.8125 epochs)

New BS56 RCP:
mean 376320.0 (2240 epochs)
mean after removing best/worst 10%: 342090.0 (2036.25 epochs)

@itayhubara itayhubara requested review from a team as code owners September 7, 2023 13:47
@github-actions
Copy link

github-actions bot commented Sep 7, 2023

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@erichan1
Copy link
Contributor

erichan1 commented Sep 7, 2023

Please give specs on what GPU this was run on and how (eg is it reference code, and fp32)

@itayhubara
Copy link
Contributor Author

This was done on Gaudi2, with bf16 the pytorch dataloader, and code similar to (but not) the reference code. Please note that Nvidia made 57 runs with the reference code and achieved similar statistics.

[1720 1740 1800 1760 1820 1720 2180 3780 2020 1740 3960 1820 2640 1960 1980 2480 1820 1740 1600 1900 2120 1740 2400 1540 1620 1940 2480 1840 3200 1760 2060 1600 1760 1980 1840 2700 1940 1660 2340 1860 1900 3280 2720 2860 1920 1280 2480 2640 2060 1820 1980 1900 3760 1720 2220 2660 2420]
Average 2143.50
Mean after removing best/worst 10% were removed: 2054.46

Since RCP requires running with fp32 on reference code we have 3 options:

  1. Finish the 57 runs - if Nvidia can do that it would be great
  2. Accept the current PR based on the information above.
  3. Reject the PR and keep the old RCP

Please note that both Habana results and Nvidia results are better than the old RCP which achieved an average of 2300 and 2252 when removing the best/worst 10% (meaning Habana HPs are indeed better).

@pgmpablo157321
Copy link
Contributor

@itayhubara Is this RCP update meant for training v3.1? In that case could you update your branch and move the changes into the training-3.1.0 folder?

@nv-rborkar
Copy link
Contributor

To avoid setting up a bad precedence, we should avoid merging any convergence points which are not derived from running reference.

@nv-rborkar nv-rborkar closed this Sep 22, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Sep 22, 2023
@nv-rborkar
Copy link
Contributor

@itayhubara can Habana create RCPs by running reference code in FP32 & create a new PR ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants