Unet3d rcp fix v3.1 #327

mmarcinkiewicz · 2023-08-03T13:47:40Z

UNET3D RCPs don't exhibit a typical convex curve on a plot epochs to converge vs GBS. Instead, it's convergence is a bit chaotic - this is a result of the dataset being very small and the fact that training happens on GBS which are a large fraction of the dataset (1/3 or even 1/2). Because of that, the RCP pruning doesn't work as expected, potentially passing a questionable results. Here's the current plot of epochs to converge vs GBS:

My proposition is to replace epochs to converge with samples to converge (like BERT and LLM currently do). To do that, I simply multiply the epochs to converge by samples per epoch. The new plot would look like:

The behaviour is much more stable, there are no visible dips for RCP pruning to malfunction.

If the PR is approved, I'll prepare appropriate changes to rules and the reference.

In mlcommons/training#523 the distributed sampler was enabled, which required to use `drop_last_batch` setting, vs `pad_last_batch` as was being used previously. This changed the `samples_per_epoch` for each global batch size and required a different set of RPCs for GBS that do not divide the dataset evenly. Furthermore, in mlcommons/training#625, a bug in the implementation of the distributed samples was found, which might have affected convergence. I regenerated all the RCPs (except for BS=2 which does not use the distributed sampler and divides the dataset evenly) to accommodate the changes. The change for GBS=32 and GBS=80 seems substantial, but please keep in mind that the number of samples is more or less the same, so the total time to train should remain constant: GBS=32 old samples per epoch: 192 old RCP mean: 1974 new samples per epoch: 160 new RCP mean: 2409 ratio: 1.22, expected ratio (192/160): 1.2 GBS=56 old samples per epoch: 168 old RCP mean: 2213 new samples per epoch: 168 new RCP mean: 2300 ratio: 1.04, expected ratio: 1.00 GBS=64 old samples per epoch: 192 old RCP mean: --- new samples per epoch: 128 new RCP mean: 3270.77 ratio: ---, expected ratio: 1.5, expected old RCP mean: 2180.51 GBS=80 old samples per epoch: 240 old RCP mean: 1618.75 new samples per epoch: 160 new RCP mean: 2462.51 ratio 1.52, expected ratio: 1.5 GBS=84 old RCP mean: 2233.25 new RCP mean: 2308.695652173913 ratio: 1.03, expected ratio: 1.0

github-actions · 2023-08-03T13:47:54Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

nv-rborkar · 2023-08-17T17:07:25Z

Training WG 08/17: Generally accepted as ok in meeting.

pgmpablo157321 · 2023-09-18T23:54:39Z

@mmarcinkiewicz Is this RCP update meant for training v3.1? In that case could you update your branch and move the changes into the training-3.1.0 folder?

…to unet3d-rcp-fix-v3.1 # Conflicts: # mlperf_logging/rcp_checker/training_3.0.0/rcps_unet3d.json

nv-rborkar · 2023-09-27T17:37:00Z

@pgmpablo157321 changes have been moved to 3.1.0 folder. Please take a look

pgmpablo157321

LGTM

mmarcinkiewicz and others added 5 commits May 9, 2023 12:46

Update "when" fields

868390a

Update rcps_unet3d.json

12d92a6

Fix RCPs for UNET3D GBS=80

18ee0bd

Add total samples instead of epochs for UNET3D RCPs

a6aa39c

mmarcinkiewicz requested review from a team as code owners August 3, 2023 13:47

xyhuang previously approved these changes Sep 5, 2023

View reviewed changes

mmarcinkiewicz added 2 commits September 27, 2023 18:44

Merge branch 'master' of https://github.com/mmarcinkiewicz/logging in…

5c07d0c

…to unet3d-rcp-fix-v3.1 # Conflicts: # mlperf_logging/rcp_checker/training_3.0.0/rcps_unet3d.json

Add total samples instead of epochs for UNET3D RCPs

7966690

mmarcinkiewicz dismissed xyhuang’s stale review via 7966690 September 27, 2023 16:45

nv-rborkar approved these changes Sep 27, 2023

View reviewed changes

pgmpablo157321 approved these changes Sep 27, 2023

View reviewed changes

nv-rborkar merged commit c27d838 into mlcommons:master Sep 28, 2023
1 check passed

github-actions bot locked and limited conversation to collaborators Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unet3d rcp fix v3.1 #327

Unet3d rcp fix v3.1 #327

mmarcinkiewicz commented Aug 3, 2023

github-actions bot commented Aug 3, 2023 •

edited

Loading

nv-rborkar commented Aug 17, 2023

pgmpablo157321 commented Sep 18, 2023

nv-rborkar commented Sep 27, 2023

pgmpablo157321 left a comment

Unet3d rcp fix v3.1 #327

Unet3d rcp fix v3.1 #327

Conversation

mmarcinkiewicz commented Aug 3, 2023

github-actions bot commented Aug 3, 2023 • edited Loading

nv-rborkar commented Aug 17, 2023

pgmpablo157321 commented Sep 18, 2023

nv-rborkar commented Sep 27, 2023

pgmpablo157321 left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 3, 2023 •

edited

Loading